Trusting Secret Data: Dunedin edition

By Eric Crampton 12/03/2013


Unless you’ve run the regressions yourself, it’s often hard to trust empirical results. A lot of results are fragile – small changes in specifications, either changing date ranges or adding seemingly irrelevant variables, can change results. And endogeneity always makes inference hard in social sciences.

First best is running things yourself. If the data and code are available for replication and extension, then it’s impossible to maintain outright fraud. And fragile results will eventually be commonly known to be fragile. Boing Boing recently ran a very nice summary in two parts on why we see so many contradictory results on guns and crime: all the specifications are fragile. On some questions, teh science isn’t strong enough to give much policy direction.
When the data is not publicly available, we really have to trust the people running the specifications. There can be good reasons for keeping data private. Most self-interestedly, if the researchers did a ton of work in putting together the dataset, they might well want to get a few papers based on it before letting everyone else pile on. It’s not an edifying reason, but even if the data collection involves public funding, it doesn’t seem unreasonable to give the survey team the first few kicks at the can. On the other hand, if the data cannot be shared without violating the privacy of individuals who answered the surveys, and if there aren’t clever ways of anonymising the data so that it cannot be used to identify individuals, then that’s a pretty good reason to be stingy with the data. 
But, there potentially being good reasons for secret data certainly isn’t sufficient reason to trust the consequent analysis. And so we come to Dunedin vs Rogeberg.
The Dunedin Longitudinal Survey is secret data, likely for the good reason of wishing to maintain the anonymity of survey respondents. We really have to trust the people running regressions on secret data. The Dunedin group recently found that early marijuana use predicts IQ decline. Ole Rogeberg wondered whether the results were due to cohort selection effects: the kids most likely to try marijuana early could well have had different results even had marijuana not existed. Rogeberg provided Dunedin a list of tests that might sort things out. He summarises things here:

I’ll start with a short recap: Researchers published article august 2012 arguing that adolescent-onset cannabis smoking harms adolescent brains and causes IQ to decline. I responded with an article available here arguing that their methods were insufficient to establish a causal link, and that non-cognitive traits (ambition, self-control, personality, interests etc) would influence risks of adolescent-onset cannabis use while also potentially altering IQs by influencing your education, occupation, choice of peers etc. For various reasons, I argued that this could show up in their data as IQ-trends that differed by socioeconomic status (SES), and suggested a number of analyses that would help clarify whether their effect was biased due to confounding and selection-effects. In a reply this week (gated, I think), the researchers show that there is no systematic IQ-trend difference across three SES groups they’ve constructed. However, as I note in my reply (available here), they still fail to tell us how different the groups of cannabis users (never users, adolescent-onset users with long history of dependence etc) were on other dimensions, and they still fail to control for non-cognitive factors and early childhood experiences in any of the ways I proposed. In fact, none of the data or analyses that my article asked for have been provided, and the researchers conclude with a puzzling claim that randomized clinical trials only show “potential” effects while observational studies are needed to show “whether cannabis actually is impairing cognition in the real world and how much.”

I’ll be interested in seeing the Dunedin group’s reply when it comes out. Rogeberg points to evidence the Dunedin group themselves have provided showing reasonable cohort differences between users and non-users. It’s not implausible that these kinds of differences are responsible for at least some of their measured marijuana effect. It would be simplest if Dunedin would send Rogeberg the data and make him sign a data confidentiality waiver; it’s not particularly plausible that a Norwegian labour econometrician wants the data for identifying NZ individuals. But if they’re not willing to do that, they should at least be willing to put his code to their data. 
I’ll be downgrading my trust in all the Dunedin Longitudinal results if they don’t handle this well; secret data requires trust.

There is away around all of these kinds of problems though.

The General Social Survey in the US would have almost as many privacy concerns as the Dunedin survey. And yet they’re able to make available an online resource letting anybody with a web browser run analysis on the data, for free. I regularly set assignments in my undergrad public choice class where students without much in way of metrics background have to go and muck about in the data.

As I understand things, the Health Research Council funds the Dunedin surveys. There’s a worldwide movement toward open data in government-funded projects. HRC could fund a Dunedin Longitudinal equivalent of the GSS browser analytics. Anything personally identifying, like census meshblock, could be culled. Nobody would see the individual observations. And any cross-tab that would reduce to too-small a number of observations could return nulls. But folks with concerns about Dunedin studies could do first-cut checks without having access to the bits that might cause legitimate worries about privacy. Somebody like Rogeberg should be able to run a t-test on whether those who later go on to report marijuana use differed on other important variables prior to their starting consumption.

We all could trust Dunedin results more if we could check things like this. And it shouldn’t be hard to set this up either: see how SDA coded things for GSS, put in the Dunedin data instead, then host it. If HRC did this, one small bit of funding could have a whole ton of researchers publishing useful studies based on the already-funded survey instead of having to fund Dunedin for every new regression. Let the Dunedin group keep any new wave of the survey under wraps for a couple of years before releasing it to the online analytics so they get a fair kick at the can. After that, why not open things up?