1 Comment

An interesting article on predictive analysis appeared in the New York Times on 16 February. The main focus was on how the US retailer Target was using its information on customers to predict which of them were pregnant, and so send them information on pregnancy and baby-related products. Simultaneously impressive and scary. The article couldn’t find out how accurate it really was, but the company has improved its revenue stream.

Many other companies are doing the same thing, which you may be aware of if you belong to a loyalty programme, or if you shop online.

This is an example of what these days is called ‘Big Data’. It has many firms, governments and consultants very excited. McKinsey Global Institute, along with a lot of others, has been looking at the implications of ‘Big data’. The expectation is that with detailed analysis of lots of different information companies can sell you more stuff or stuff that better meets your needs, while governments can plan more effectively or efficiently, and provide services tailored for you.

This graphic illustrates where McKinsey see better use of data helping a range of sectors

MGI Big Data Chart Feb 2012

You could dispute how McKinsey have characterised the ease of collection or value of big data. I for one think that current departmental silos make it hard to collate a range of government data — and it can be extremely difficult for people in the same agency to get access to even some of their own data. But that is a relatively minor quibble that will be overcome. Although, based on the past record of IT system upgrades, this seems unlikely to be as cheaply or easily achieved as the decision makers may hope.

Who wouldn’t want better health care services through better data analysis, smarter urban design through study of mobile data, or creating new businesses that make data useful? Or even reduced insurance costs because the insurance company can tell I’m being a good boy?

I can see benefits to that, but there are also plenty of challenges. I’m reminded of what my PhD adviser (take a bow David Penny [PDF]) warned me of back in the day when ‘big data’ were 500 base pairs each from about 20 species. I paraphrase but it was something like ‘it can feel nice running barefoot through the data, but it’s not much use if you can’t make sense of it’.

Technology review covers the findings of a paper by Boyd & Crawford about where having more data isn’t necessarily better. The data may not be as objective or as accurate as you assume (eg, users of social networks are generally not representative of the whole population (so far), and have biases to what they post, or don’t post).

Other issues Boyd & Crawford highlight are:

  • Quantity of data isn’t a substitute for quality. A UK Foresight report [PDF] notes that being able to demonstrate the authority of the data in real time will be a critical requirement in the future, particularly for government data.
  • All data is not equivalent. Not every connection in networks are equivalent. Frequency of contact on social networks (or mobile devices) does not necessarily mean you have identified a person’s closest friends or colleagues.
  • There are ethical judgements to be made when using ‘public’ information. They note that in at least one study it was demonstrated that individual privacy could be compromised from supposedly anonymised data. Since the ‘Big data’ field is new it is unclear what ethical implications may arise from aggregating information from individuals.
  • Finally, not everyone has the same access to large databases, so there risks of creating new digital divides and inhibiting testing of conclusions from analyses of large data sets. Researchers in some fields are incentivised and generally good at sharing their data, but others aren’t yet.

Boyd & Crawford exhort researchers to not overlook the value of ‘small data’ — what data you use depends upon the questions you have.

Leaving aside how you analyse super data sets, just the organisational and technical challenges of going from handling small to big data won’t be easy (and it’s not just a question of being cloud based). Governments and firms will also need to have real time information to inform their decisions and services, rather than just being able to analyse what happened previously.

Another article in the New York Times notes that use of ‘Big data’ may exacerbate the economic divide between low paid lousy jobs and well paid interesting ones, as previously middle class work is taken over by computerised analytics.

Wired lost the plot a few years ago when it claimed that you don’t need a scientific method if you have lots of data. The Target example is a case where there is quite a clear outcome being sort,  the company did have a hypothesis to test and threw strong statistical analyses at the problem. Whether such applications are acceptable uses is up for debate.

One final concern is whether we all, as members of the public and consumers, become reduced to sets of data points in cosmic analytical frameworks and suffer the tyranny of regression to the mean. Choice still matters. When buying goods online, being told what items other people also bought is inoffensive and you can choose whether you buy them too. I don’t want to go to my doctor and be told that others with most of my symptoms had treatment X, so that’s all I can have. Big data can better help identify problems and options, but they shouldn’t necessarily dictate the response.