One year I spent a lot of time with professional magicians. A few showed me the secrets to their tricks. Whenever they did, the skill and dexterity required for sleight-of-hand struck me as far more impressive than the idea that magic had been performed. It reminded me of my own experience with statistics.
Data analysis is very similar to performing magic. With great skill you can pull things together and create the perception of surprising relationships. Often the magic is getting people to look at one thing, when they should be seeing another. Similarly with statistics, it’s often not the correlation that’s interesting but what you did to find it.
This is important to keep in mind as the world embarks on the big data revolution. Big data is very large data sets, collected by the government, corporations, and institutions, becoming more available. Using this data, firms and policymakers can figure out what programs work (like health treatments, and which people respond to government incentives) and what consumers want. The deluge of information is expected to increase efficiency and lower prices. In a recent report, the McKinsey Global Institute encourages the increased availability of big data. It estimates that greater access to big data has the potential to create $3 trillion a year in value.
It is generally true that more information is better, though big data comes at a cost in terms of privacy and data collection. Yet what concerns me is the proper interpretation of big data. An earlier McKinsey report addresses this issue. It notes a dearth of trained statisticians, estimating that America is short 140,000 to 190,000 workers with the skills to handle data. But lack of talent is not just an impediment; it’s a potential source of danger. People, even those who know better, often take correlations literally and make decisions based on them, without appreciating the magic behind the numbers.
Interpreting data is more of an art than a science. But unlike magicians, most researchers do not intentionally mislead people. A big concern when you run statistics is bias (over- or understating a relationship) and mistaking correlation for causation (whether X causes Y or just that they tend to occur at the same time). You might get biased results by either using the wrong data or an inappropriate estimation technique. Minimizing bias requires making subjective judgments. If you ran numbers on a large data set without inspecting it, removing outliers, and choosing the best model — you’d have much more bias than if you used some discretion.
The process is complicated by human nature. It is easy to be seduced by your own results when they validate your prior expectation of what you’ll find. Take the financial crisis, in which bad statistics played a large role. Many quants priced exotic housing securities using models that were fed data from areas where house prices never fell. This made the price of risk look very attractive, but then the products couldn’t remain viable when house prices fell. In most cases the oversight was not intentional. It reflected the data available and the current industry standard. Without a significant drop in housing prices in recent memory, it was an easy mistake to make.
Often what’s most interesting isn’t the statistical relationship itself, but the data that was required to find it. Take the oft-cited statistic that American life expectancy is lower than that of many other OECD countries. That would suggest that American healthcare is not as successful as other systems. But when you look more deeply at the data, a different story emerges. Once you account for people who died from injury (like violence or car accidents) or obesity-related disease, American life expectancy is similar to Canada‘s. America’s lower life expectancy is alarming and should get the attention of policymakers. But to remedy it, we need to understand what’s causing more car fatalities and obesity, and what factors — like poverty or arcane drug laws — lead to so much violence. American healthcare is certainly inefficient, but depending on how you parse the data, it’s not clear that it’s delivering worse results in terms of mortality compared to other OECD countries.
Such examples may seem straightforward, but in practice they are hard to spot, even for the most experienced and well-intentioned professionals. That’s why in academia, statistical work under goes a rigorous peer review process. In the same way a magician can discern an impressive or dirty trick, it takes a community with the same expertise to spot sources of bias. But expert peer review won’t be realistic as data becomes more wildly available and used commercially. It should be a serious concern that people, without adequate experience, might unknowingly produce biased results and make important decisions based on them.
But the use of big data is worth the risk. Statistical analysis is an imperfect process, but it’s all we have to make sense of big data. With any new, transformative innovation there exists potential to take it too far or use it incorrectly. The same can be said for cars, airplanes or new financial products. The benefits of more innovation and information usually outweigh the costs. We can minimize these risks with greater awareness of a new innovation’s limitations. McKinsey advocates more training and apprenticeships so we have more people who can run and manage data. This is certainly necessary, but not sufficient. We must also view any statistical result with the same humility and skepticism we experience when we see a magic trick.
Allison Schrager is a Reuters columnist. The opinions expressed are her own.