Big data – bad, bad data!
Ah, big data! Cloud computing is so 2012! Last year’s [2013] buzzword was big data, and there is a lot of use (and misuse) of it in the media. Consequently there are a lot of organisations interested in the subject and how it can help them improve their businesses. The problem is, the vast majority of people that I came across who talk about big data have little to no merit to discuss the subject.
Last year I read two articles in particular, both the cover of important international publications, on the subject of big data. One article was the cover of Foreign Affairs, and the other the cover of VEJA. While both articles have their merits, they also helped to perpetuate a lot of misconceptions, which I am tackling in this post.
The first misconception about big data is about its value. On its own, data means nothing. The data must be processed into information in order to obtain its value. “Big data is not about the data, it’s about the analytics, according to Harvard University professor Gary King” (Tucci, 2013). And in order to obtain good value, we need good data experts (i.e.: statisticians) to analyse it, or to create a model that analyses it. Yet curiously, I came across many experts with little to no knowledge of some core principles of statistics; experts who wouldn’t be able to carry out some of the simplest statistical tests. This lack of expertise will lead to bad analytical models and therefore, bad conclusions that will lead to poor decisions. In sum, data is only good as the people who are analysing and interpreting it.
The second misconception about big data: Big data provides correlation, so we don’t need to understand the cause. This one makes me cringe. As Svetlana Sicular pointed out in an article entitled The Illusions of Big Data (Gartner, 2013), “more data and more analytics will create an illusion of solutions while the problem [i.e. the cause] still persist”. In many cases, finding a cause for the problem is the ultimate goal. After all if we don’t understand the causes, how can we ever be proactive in resolving a problem?
With big data we move from deterministic to probabilistic answers, but probability is an illusion of control. People can’t agree on elementary things and “while an individual usually behaves rationally, an aggregated crowd is irrational”. In addition an analytical model is never 100% accurate and it might not apply to a single person or event since this person or event can be an outlier. Organisations that are willing to disregard this factor and take a bulky, “one-answer-fits-all” approach are facing the risk of antagonising their stakeholders (mostly their customers). I can sum this though by saying the following:
People are imperfect and different by nature, and to follow a strict mathematical reasoning to ultimately predict or dictate human behaviour is to reject their individuality — and thus their humanity.
This brings me to another misconception: That we can cross anything with everything and achieve a reasonable conclusion. One of the first things we learn in statistics is that correlation does not imply causation. Ultimately, we can’t create a sensible model that identifies a person’s favourite band based on their favourite ice cream flavour. However the issue at hand here is more subtle than that. I am referring to analysing unstructured data in a model. In analytics and particularly in data mining, variable selection methods must be applied in the statistical models as redundant or irrelevant variables are likely to create noise that can render the model useless.
Moreover when crossing variables, we must consider variable transformation as well. The analogy I like to make here is that we can’t compare apples and oranges in a model — but we can certainly compare fruits. So when bringing different variables into our model we might sometimes need to perform a transformation that would allow us to compare them. An obvious example would be conversions, such as if we want to compare weight and we have some entries in pounds while others are in kilograms. However often there are cases that require variable transformation and are much more subtle than a simple conversion. For instance, there are cases where logarithmic or square root transformations might be needed. Variable transformation can be used to reduce skewness in the distribution of interval variables which would in turn yield a better fitting model.
Another point which isn’t necessarily a misconception but a caveat, is the idea of using the entire data set rather than just a sample in our analysis. We need to consider this approach carefully in data mining models, as we can over-train a model and therefore reduce the model’s predictive performance. Working with samples is still a required approach in some types of analysis and models, regardless of computational power.