Correlation is not causation. Whilst big data is currently all the rage, it’s being applied with insufficient care and attention in business, especially within sales and marketing.

Big Data, Little Theory

When the ancients looked up at the night sky, they saw all manner of patterns: bears, scorpions, mythical figures – the constellations. Of course now we laugh at this naivety, but even the assumption that constellations are neighbouring groups of stars is in fact wrong. Whilst the three stars that make up the belt of Orion are in more or less the same place, his shoulders are twice as close to us as they are to his belt, a line of sight effect resulting from the fact that a nearby dim star can look the same as a faraway bright star.

People have a tendency to see patterns everywhere, and will employ considerable creativity in ascribing causal reasons for these patterns. But with any correlation between two observations A, and B, there are 4+ possible explanations:

  1. It’s a statistical fluke, A and B are in fact uncorrelated
  2. A and B are both caused by some common effect (but independently of each other)
  3. A causes B
  4. B causes A
    4+. All/Any of the above in some combination. Hint: anything you discover in sales and marketing data is likely to be of this type….

A good way to see this is to take a map and randomly sprinkle a handful of salt over it. You’ll note that the salt does not evenly cover the map, but naturally forms clusters and voids of greater and lesser density. If you look closely, you’ll note that some of these clusters correspond to cities or specific sites on the map.

The significance of this experiment is when we look to evaluate the strength of observations like the cancer cluster detected around one of Britain’s nuclear power plants, Sellafield. It’s now believed that whilst the cluster is real, it’s not caused by radiation, but by immigration.

Another example of things not being what they seem is a well-known meta study of anti-depressants. For long considered wonder drugs, a survey of both published, and crucially, unpublished data has suggested the effects were much more modest, and in most cases too small to deliver any clinical benefit. But negative results don’t make good reading, and so inadvertently peer-reviewed publications had became a text-book example of survivor bias –,8599,1717306,00.html.

Entrepreneurial success stories are classic example of this, and another experiment that illustrates this is coin-tossing. How likely is it that someone could toss 10 coins all heads in a row? Must be a fix – surely it’s impossible? In fact an unbiased coin would fall heads 10 times in a row roughly once a thousand times – that is, with a 1000 people, you’d expect on balance to see such luck. Incidentally, this is how stock picking newsletter scams work – you send an email saying X will soar to half your addresses, and one saying X will tank to the other half. Repeat with the winners 3 times, and the 1/16 fraction left will be convinced of your stock forecasting credentials. The losers are forgotten.

The moral of the story is that data is not just data. Data always exists within a context, and is always interpreted via a perceptual prism. Theory is not an optional nice-to-have, but a critical element in interpreting data. If you don’t understand whysomething is the way it is, then you can’t even be sure you know it really is that way. There is simply no such thing as “raw data”. Every piece of information we have is the product of the thing in itself and the process of measurement, a key part of it being you with all of your prejudices and inadequacies factored in. There are no hard facts, and numbers absolutely do lie – just ask the accounting profession if you don’t believe me.

Now any scientist reading this would surely have fallen asleep by now. Everything I’ve said is well understood in the physical sciences. But sadly this learning has not been yet fully assimilated in business where the current meme for big data has taken over as the next big thing. The idea is that the more data I crunch, the more ‘insights’ I get. Tuning the dials I can maximise profit without ever leaving the comfort of the office.

There is a tendency to dress up arguments in mathematical and scientific language in order to bamboozle and impress the innocent. The reputation of science, cemented by such important advances as big screen TV’s and plastic surgery, mean that anything “scientific” has to be true. Absolute. Objective. Inarguable. There’s even a name for it – Physics envy –

But the world of business is not the world of science. When physicists talk about statistical machanics, they talk in huge numbers of 1,000,000,000,000,000,000,000,000 and more. And crucially the things they talk about, atoms and electrons, etc, are simple. Electrons can be described by just three numbers (rest mass, charge, and intrinsic spin). As best as we can tell they have no shape and are are dimensionless points. And as best as we can tell every single electron we have every seen is identical to every other electron in the universe, now, and forever.

But sales and marketing is fundamentally about people. And if I’ve learnt one thing about people it’s that you need more than just three numbers to understand them. People are not simple. Regardless of stereotypes, they are not identical (or even consistent one week to the next). Few experiments or A/B testings manage much more than a few thousand individuals at best, and that is not a huge number. In short, all the characteristics that make statistical mechanics work and indeed most statistics, don’t exist in sales and marketing. It’s why opinion polls so often get it wrong.

The data warehousing story of beer and diapers (nappies) is well-known. It goes that a store was crunching it’s data and found a correlation in its shopping baskets between beer and diapers, and that then by co-locating them, sales increased. The explanation was that men were picking up diapers on the way home from work, plus some beer. However, it turns out this story is in fact an urban legend – – with a germ of truth, but not much.

Which if you think about it makes sense. After all, whilst your wife might ask you to pick up some diapers on the way home, if she is anything like mine, she is equally likely to ask you pick up some carrots, or milk, or any one of a dozen things. It’s why in every store you’ve visited you’ve never seen diapers and beer co-located. The story isn’t true. And by the way, the implicit assumption of the gender roles here was never seen in the data. As an aside, my own family’s commute patterns mean it would more common for my wife rather than me to pick up a six-pack of Corona and a lime on the way home.

I was recently pointed to some data showing a decline in traditional IT systems management topics –– and if you look at the charts of web search data, it seems hard to argue with their conclusions. For example, an systems management tool Nagios shows search interest halving:

When I saw this,I thought it looked a bit suspicious as many vendor neutral terms like SNMP showed a similar pattern, so I checked out something I was pretty sure was not declining – CRM:

Now I smelt a rat, so I looked at some terms I knew were not declining – London, Paris, New York:

At least London shows a spike for the 2012 Olympics, but it’s hard to look at all these charts together and not question the entire basis of the data. I don’t know the reason for these results, but it may be some form of measurement artifact. One possible contributing source is the growth of Bing over this period, which now powers about 25% of web searches and Google “just” 65% of all web searches, so perhaps this is just showing the decline in Google searches? More work needed, but I’m pretty sure interest in the world’s capitals hasn’t collapsed in the past few years.

The takeaway is that you can’t just accept raw data without understanding it fully. If you don’t have a theory of what you are seeing versus what you should be seeing, then you’ve no idea if the effect you are looking at is real or not.

There is science to be had, and this is what psychologists do, but they key difference is in that doing it for a living, spend years training and they take great care to take into account confirmation bias, adequate numbers, statistical analysis to see if this a real effect or just noise, and proper controls (which is what went wrong with the Openxtra’s systems management survey above). And even then, they still make mistakes because people aren’t electrons. Most of what passes for A/B testing in marketing today is garbage –

There is an old joke about an Astronomer, a Physicist and a Philosopher walking in the Highlands of Scotland, when they come across a black sheep. “Oh look”, exclaims the Astronomer, “all the sheep in Scotland are black!”. The physicist grimaces and face-palms: “No, all you can say is that there are some black sheep in Scotland”. The philosopher sighs and tuts, “I’m afraid you’re both getting carried away, all we know is that the creature over there would appear to be black on the side facing us”.

There is an unfortunate tendency these days to put too much emphasis on metrics, and not enough on actual understanding. It bothers me when I see VP’s of Sales and Marketing put together ever more elaborate charts and reports to explain why the leads and revenue isn’t coming in. Nine times out of ten it’s a product problem, but fixing that is hard and takes a long time. It’s easier to say “Let’s just fire the sales reps”, regardless of the root issue.

Marketing is more about opinion than optimization. Selling is more about psychology than statistics. Data is important but please make sure you know what it is and where it came from before elevating it to the status of gospel.