Trusting Secret Data: Dunedin edition
Unless you’ve run the regressions yourself, it’s often hard to trust empirical results. A lot of results are fragile – small changes in specifications, either changing date ranges or adding seemingly irrelevant variables, can change results. And endogeneity always makes inference hard in social sciences.
I’ll start with a short recap: Researchers published article august 2012 arguing that adolescent-onset cannabis smoking harms adolescent brains and causes IQ to decline. I responded with an article available here arguing that their methods were insufficient to establish a causal link, and that non-cognitive traits (ambition, self-control, personality, interests etc) would influence risks of adolescent-onset cannabis use while also potentially altering IQs by influencing your education, occupation, choice of peers etc. For various reasons, I argued that this could show up in their data as IQ-trends that differed by socioeconomic status (SES), and suggested a number of analyses that would help clarify whether their effect was biased due to confounding and selection-effects. In a reply this week (gated, I think), the researchers show that there is no systematic IQ-trend difference across three SES groups they’ve constructed. However, as I note in my reply (available here), they still fail to tell us how different the groups of cannabis users (never users, adolescent-onset users with long history of dependence etc) were on other dimensions, and they still fail to control for non-cognitive factors and early childhood experiences in any of the ways I proposed. In fact, none of the data or analyses that my article asked for have been provided, and the researchers conclude with a puzzling claim that randomized clinical trials only show “potential” effects while observational studies are needed to show “whether cannabis actually is impairing cognition in the real world and how much.”
There is away around all of these kinds of problems though.
The General Social Survey in the US would have almost as many privacy concerns as the Dunedin survey. And yet they’re able to make available an online resource letting anybody with a web browser run analysis on the data, for free. I regularly set assignments in my undergrad public choice class where students without much in way of metrics background have to go and muck about in the data.
As I understand things, the Health Research Council funds the Dunedin surveys. There’s a worldwide movement toward open data in government-funded projects. HRC could fund a Dunedin Longitudinal equivalent of the GSS browser analytics. Anything personally identifying, like census meshblock, could be culled. Nobody would see the individual observations. And any cross-tab that would reduce to too-small a number of observations could return nulls. But folks with concerns about Dunedin studies could do first-cut checks without having access to the bits that might cause legitimate worries about privacy. Somebody like Rogeberg should be able to run a t-test on whether those who later go on to report marijuana use differed on other important variables prior to their starting consumption.
We all could trust Dunedin results more if we could check things like this. And it shouldn’t be hard to set this up either: see how SDA coded things for GSS, put in the Dunedin data instead, then host it. If HRC did this, one small bit of funding could have a whole ton of researchers publishing useful studies based on the already-funded survey instead of having to fund Dunedin for every new regression. Let the Dunedin group keep any new wave of the survey under wraps for a couple of years before releasing it to the online analytics so they get a fair kick at the can. After that, why not open things up?