The Big Problem with Big Data

Without a doubt, Big Data holds a lot of promise. But, Nate Silver reminds us that the mere availability of data will not change anything, even if it’s coming in large servings

11-MIN READ

Updated:Dec 15, 2012 05:45:15 PM IST

Pages: 534

These days it’s hard not to hear someone or the other talk about Big Data, especially if you are a journalist covering IT in Bangalore. The term always comes up in press conferences, seminars, in the power point presentations, and sometimes, even in casual conversations. The pronouncements on Big Data are often delivered with the passion of an evangelist, and arise from an awareness that every day, astronomical amounts of data get generated.

A McKinsey report, released last year, glowingly quoted an IDC analysis, saying that in 2009, 800 exabytes of data was created - “enough to fill a stack of DVDs reaching to the moon and back.” Nearly all sectors in the US economy, it said, had at least an average of 200 terabytes of stored data per company with more than 1,000 employees.

The consulting firm studied five domains - healthcare, public sector administration, personal location data, retail and manufacturing - and in each of these it found Big Data can generate significant financial value.

It’s no wonder, then, that everyone’s excited. The other day, I heard an executive from a mid-sized IT firm speak about what they found out after an exercise in this field. When they placed pharma products sales data on the top of weather data, they saw that the sales of band-aid went up on rainy days. “I don’t know what causes it, but there is a correlation. And that’s the key. Imagine the value this insight gives to the clients, to their supply chain”, he said.

The idea that data is all you need to navigate the rough waters - and who cares about the mechanics of causation - has been around for some time. In 2008, Chris Anderson, editor of Wired and author of Long Tail and Free, wrote gushingly about how Google uses data and said, “The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.” His piece was called the End of Theory.

Similar sentiments were expressed recently, after Nate Silver, a geeky blogger at New York Times, accurately predicted US presidential elections, with nothing more than his data sets, his statistical models and his computer. The political pundits who mocked his data-driven model during the elections had to eat their words later on. When the results were out, media was ready with its tributes: Nate Silver-Led Statistics Men Crush Pundits in Election | Bloomberg; Has Nate Silver destroyed punditry? | Christian Science Monitor; The Statisticians on the Bus | How a nerd named Nate Silver changed political reporting forever | Newsweek

Silver’s success was not a fluke. Before he got into in electoral predictions, Silver made a name for himself in baseball and poker. He designed a system called Pecota to predict the scores of major league baseball players, and sold it to Baseball Prospectus. For some time, he made a living on online poker - his income ran into six figures then, by one account. If Congress hadn’t banned online poker, he would have continued to play it, he joked at a Google event recently. As it happened, he turned his attention to US presidential elections. He started a blog called FiveThirtyEight (it refers to the number of electoral college votes) to share the findings of his analysis. In 2008, he was right about 49 of the 50 states. For 2012 elections, he moved his blog to New York Times, which became one of the biggest draws for the newspaper’s website, and turned Silver into a superstar.

So, when I picked up Nate Silver’s book I expected to read a strong case for data driven approach to everything, a set of arguments to demonstrate supremacy of data. But, it turned out to be different.

Now, Big Data is not the main theme of his book - and Silver touches on it only now and then. The book is about using data to make predictions, and that, in some ways, is at the core of Big Data. After all, people are primarily interested in future - what will be the price of a produce that's growing on your field now, how much of a particular will we sell in a particular market, what kind of products should we develop, or even what will be the traffic on the route to airport in three hours from now. And the promise of Big Data is that it will give the answers by studying reams and reams of data.

From this perspective there are three big takeaways from the book. One, in the field of predictions, failures rule, and successes are rare. Two, more data will not solve this problem, and data alone is not sufficient. And three, there’s a way to improve your chances of getting your predictions right.

That Nate Silver achieved a kind of super-stardom - he has gathered several badges: Ted Talk, Talks @ Google, interviews with Jon Stewart and Stephen Colbert - and as I write this the book is 14th on Amazon’s bestsellers list, and the most wished for book in the Money & Markets category - just by predicting election results right tells something about how rarely such a thing happens. It’s not the sole exception. The book itself talks about a number of examples. Weather forecasts, for example, have become better and better, and is fairly reliable in the US today. There’s a fascinating chapter on how IBM’s machine won against Kasparov.

But, tales of failures abound. Silver talks about Tohoku earthquake that led to Fukushima disaster in his book, and as if to remind it will take longer than a year to get better at these things, another earthquake hit Japan last week without a warning. There are terrorist attacks, economics - and even a good deal of academic research. Even where it has seen some success, it's somewhat limited. Silver attributes his own success to well chosen battles, and has spoken about the limitations of his model. (If you listen to his interviews, some of which are available on Youtube, you can’t help but notice how disarmingly honest he is.) His model didn’t work in parliamentary elections in UK, and you only have to take a look at the historic data in India’s election commission website and compare the number of pre-election polls here with the range and variety that Silver had access to in the US, to see why it won’t work in India either. Making predictions, like Yogi Berra said, is difficult, especially if it’s about the future. That has hardly changed. Doing his research, Silver says, “I came to realize that prediction in the era of Big Data was not going very well”

Now, one might argue more data could solve these problems - and in fact, that’s one of the reasons why people are very excited about Big Data. Silver draws our attention to the forecasting firm ECRI, which in September 2011, predicted that world is headed for a “double dip” recession (if it wasn’t already in one), and threw at its customers several leading indices that suggested so. He writes:

Theirs was a story about data—as though data itself caused recessions—and not a story about the economy. ECRI actually seems quite proud of this approach. “Just as you do not need to know exactly how a car engine works in order to drive safely,” it advised its clients in a 2004 book, “You do not need to understand all the intricacies of the economy to accurately read those gauges."

This kind of statement is becoming more common in the age of Big Data. Who needs theory when you have so much information? But this is categorically the wrong attitude to take toward forecasting, especially in a field like economics where the data is so noisy. Statistical inferences are much stronger when backed up by theory or at least some deeper thinking about their root causes.”

Besides, the problem with Big Data could exactly what its name suggests, big data. To separate signal from noise, to deal with false positives, and to test hypothesis - all these will be difficult, because more data will also produce more noise. “For instance, the U.S. government now publishes data on about 45,000 economic statistics. If you want to test for relationships between all combinations of two pairs of these statistics—is there a causal relationship between the bank prime loan rate and the unemployment rate in Alabama?—that gives you literally one billion hypotheses to test”

As of ECRI’s double dip recession, it never happened.

But there’s a way to improve your chances of getting your predictions right: Be a fox, rather than a hedgehog. And even more importantly, join Bayesian Club.

In 1953, philosopher Isaiah Berlin published an essay called The Hedgehog and the Fox, in which he spoke about two types of men - foxes, who knew a lot of things and hedgehogs, who knew one big thing. It was definitely not his most serious works, but it turned out to be very popular, and the hedgehog and the fox became an enduring metaphor for certain types of thinkers and writers. Here’s a seven minute video that explains the difference between a fox and a hedgehog.

[youtube]http://www.youtube.com/watch?v=WIbbFfz8nEQ[/youtube]

Silver says foxes are better at predicting too.

Bayes’s Theorem is a recurring theme in Silver’s book. The theorem was proposed by Thomas Bayes, a 18th century mathematician. Probability buffs speak about Bayes in the same way physics buffs speak about Richard Feynman. Here's a video that might help you jog your memory.

[youtube]http://www.youtube.com/watch?v=E2pOJwSwWDk[/youtube]

Math apart, Silver highlights three habits of mind that Bayesian approach encourages.

Consider these two sentences.

No investor can beat the stock market

It is hard to tell how many investors beat the stock market over the long run, because the data is very noisy, but we know that most cannot relative to their level of risk, since trading produces no net excess return but entails transaction costs, so unless you have inside information, you are probably better off investing in an index fund.

The first is simple - even sounds powerful - but it’s just an approximation. The second looks messy, full of uncertainties, but it’s a better description of reality. (The book has five other statements in-between, showing how probabilistic thinking evolves, allowing you to add layers upon layer, include exceptions, and refinements till your model of the world gets realistic enough to be useful. But, the first and last statements give the idea) “We have big brains, but we live in an incomprehensibly large universe,” Silver writes. “The virtue in thinking probabilistically is that you will force yourself to stop and smell the data—slow down, and consider the imperfections in your thinking. Over time, you should find that this makes your decision making better.”

One of the things we do in solving a problem using Bayes’s theorem is to ‘estimate a prior belief’. The best way to understand this is through an example, and let me quote one from the book, because it’s kind of mischievous. Suppose you discover another woman’s panties in your dresser drawer, what's the probability that your partner is cheating on you? To answer this, we have to answer three more questions. One, what's the probability of the panties appearing in your dresser if he were cheating on you. Call it x, and let's say it's 50%. Two, what's the probability of it appearing there, if he were not cheating on you. It's possible - may his luggage got mixed up - but it's remote. So, let's say it is 5%, and call it y. And the third question is critical. It's called assigning the prior. What is the probability you would have assigned to him cheating on you before you found the underwear? It's not easy to be objective about it, and sometimes, you can get empirical backing. Let's say for this, we have studies that say about 4 percent of married partners cheat on their spouses in any given year, and that will be z. From here, it's just a question of applying the formula - xy/[xy + z(1-x)] - to find out how likely is it that you’re being cheated on, given that you’ve found the underwear?”

The key point to remember here is that if the prior probability is either 1 or 0, additional evidence is not going to change your answer at all. Thus Bayes is also an exercise in finding out what our prior beliefs are. “To state your beliefs up front—to say “Here’s where I’m coming from” — is a way to operate in good faith and to recognize that you perceive reality through a subjective filter,” Silver writes.

Finally keep updating your forecasts every time you get any new information. "Staring at the ocean and waiting for a flash of insight is how ideas are generated in the movies. In the real world, they rarely come when you are standing in place. Nor do the “big” ideas necessarily start out that way. It’s more often with small, incremental, and sometimes even accidental steps that we make progress."

Without a doubt, Big Data holds a lot of promise. But, Silver reminds us that the mere availability of data will not change anything, even if it’s coming in large servings. “The numbers have no way of speaking for themselves. We speak for them. We imbue them with meaning. Like Caesar, we may construe them in self-serving ways that are detached from their objective reality. Data-driven predictions can succeed—and they can fail. It is when we deny our role in the process that the odds of failure rise. Before we demand more of our data, we need to demand more of ourselves.”

The Big Problem with Big Data

Without a doubt, Big Data holds a lot of promise. But, Nate Silver reminds us that the mere availability of data will not change anything, even if it’s coming in large servings

Popular Now