Spurious correlations and Big Data

I have been a Time magazine subscriber for decades. And while I generally enjoy being informed, I also read Time knowing that they have a definitive ideological slant, which is very evident in their over-reporting of certain topics. So I read Time carefully, knowing that I do not agree with all their viewpoints, and also that I should not believe everything I read. Time is also not generally known for graphical excellence—in fact, you will easily find Time being used as fodder for examples of how not to do charts. And I also find it amusing to look at their charts and infographics, although I must add that there are definitely times that Time does get it right (and in all fairness to them, it is unfortunately so that when they do get it right, we don’t find people singing their accolades).

I am, as I have indicated in an earlier post, doing a lot of work with Google Trends data. One of the biggest challenges with this data is the problem of dimensionality or overfitting, which, simply put, means that when we have data on masses of predictor variables, we are bound to find some which, by chance, correlate well with our variables of interest—in other words, the more dimensions we add, the more likely we are to run afoul of spurious correlations.

I have just again browsed through some of Tyler Vigen’s hilarious spurious correlations, like the 0.99 correlation between the number of lawyers and suicides or the .093 correlation between per capita cheese consumption and people who died by becoming tangled in their bedsheets. (Which in itself raises some interesting questions, like “How do between 400 and 900 people in the US get it right to get so tangled up in their bedsheets each year?” Perhaps the statistic includes infant deaths, which would be very tragic).

In fact, it seems you can “prove” anything with research these days, including, for example, that intelligent people are messy, cursing insomniacs (also here), who like trashy movies, and appear to be lazy (but note, again, how popular media slants research findings to support what they want it to say, not necessarily what it does say).

I also unearthed an old issue of Time (1 August 2016) to see this: Political foods.

   Time 2016-08-01 Vol188 no5 p17a  Time 2016-08-01 Vol188 no5 p17b  Time 2016-08-01 Vol188 no5 p17c

You can also read more on this page and this page.

This has to be one of the most spurious of all spurious correlations, and is a good example of overfitting. They have a big data set (Grubhub’s data). They choose 175 dishes. Then for each dish, they calculate the correlation between the number of orders and the two percentages of Democratic and Republication votes. That’s two correlations for each dish, for a total of 350 correlations. Presumably, each district is a data point: This page notes that they ran “correlations between the share of orders for 175 dishes and the average share of votes going to Democrats or Republicans in each district.” And then they just chose those dishes that showed the highest correlations.

Here, it is much more likely to say that certain foods enjoy more support in certain geographic areas (or are more freely available in certain areas, as Time does acknowledge), just as the Republican and Democratic parties also enjoy more support in certain geographic areas, but it is one of the most trite deductions imaginable to claim that one is more likely to eat certain foods because of one’s political affiliation (or, heaven forbid, that the food you eat determines how you vote—Time suggests, hopefully in jest, that “Of course, we all know that eating a hamburger makes you more likely to vote Republican”). the correlation should at least make sense. For example, certain religions have certain dietary prescriptions, and so an association between religious affiliation and dietary preference makes intuitive sense, but political affiliation being directly associated with dietary preference just makes no sense. Indirectly, maybe (e.g., if certain religious or cultural groupings tend to hold a certain political affiliation), but not directly.

Remember that correlation is only a measurement that shows the degree of association between two variables (not even agreement, as Bland and Altman pointed out). So remember that correlation does not prove anything. That also does not mean that nothing is proved by correlation. The main point is that correlation should be correctly understood. It only shows the association of two variables. But there must at least be some understandable link between the variables, and, more importantly, all spurious variables must be excluded.