The Big Data Backlash
Reposted from Google+ (Mar 30, 2014)
So, the inevitable backlash against 'Big Data' has begun. In an article in the Financial Times yesterday, journalist Tim Harford suggests that Big Data analyses are more dangerous than we thought.
He points to reports that Google Flu Trends, Google’s system for predicting flu outbreaks from search queries, has become less reliable in recent years. For Harford, big data analyses, such as Google Flu Trends, are more fragile than traditional analyses because the models they produce are less interpretable:
'But a theory-free analysis of mere correlations is inevitably fragile. If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down.'
I can understand why talk of 'big data' would lead someone to think that a mysterious and subtle model lay behind Google’s analysis. Crunching through 50 million search queries must have lead to an unbelievably complex model that no one could ever understand, right?
In fact 'Google Flu Trends' deliberately used an extremely simple model. The model was comprised of a linear model with a mere 45 search query terms as inputs. Table 1 from the original paper (behind a pay-wall but attached here), summarises the queries that were used:
Does Harford really believe that ‘we have no idea what is behind a correlation' between googling for things like 'Cold/flu remedy' and people actually having flu? I'm not a doctor but I'll try and speculate about what's behind that correlation: people with flu want to stop having the flu.
The fact that, 5-years on, changing behaviour on the part of users caused the algorithm to become less accurate is not unexpected. Indeed, the authors of the paper warn about this in their penultimate paragraph:
'Despite strong historical correlations, our system remains susceptible to false alerts caused by a sudden increase in ILI-related queries. An unusual event, such as a drug recall for a popular cold or flu remedy, could cause such a false alert.'
I suggest that ‘Big Data’ analyses are no more prone to this kind of problem than any other kind of analysis.