The recent report that machine learning is “causing a science crisis” – giving misleading results from data analysis – brought to mind something my PhD advisor said over 20 years ago.
“The bad news is all the models are wrong. The good news is the journal editors don’t know it yet!”
This was in the early days of programmes to build evolutionary (or phylogenetic) trees. Lots of DNA sequence data were being produced, and geneticists needed methods to make sense of it. Happily, mathematicians and mathematically-inclined biologists were creating models that could do this. The problem was that the outputs were, initially, uncritically accepted without understanding the assumptions and limitations of the data and the models being used.
It’s an age old problem. If someone is given a hammer for the first time, everything starts to look like a nail.
We saw the issue when high throughput sequencing and gene expression became common, and a new range of models were, and are being, developed to interpret the data. I’m sure it’s the same in other branches of science too.
The application of machine learning is, in one sense, just more of the same. It’s part of the trend of the growing complexity of scientific research and the increasing data density of scientific papers.
An article in Wired magazine a decade ago suggested the scientific method was becoming obsolete, because of the “data deluge”. The idea being that with enough data, the “right answer” would naturally emerge. I thought it was stupid at the time, and still do.
But there is the risk that the scientific process becomes undermined by an abundance of data, and algorithms, if it doesn’t retain it’s critical thinking and hypothesis-driven practice. It is as important as ever for scientists to ask about assumptions of models, and their uncertainties.
Trust in, and reliability of, evidence are going through hard times. And there is the expectation that more data and more powerful analytical methods will be applied to many policy and strategic questions. Use of poor quality data and poor models have implications beyond science, and will in turn influence, fairly or not, how science in general is viewed.
Somewhat ironically, DARPA is funding research to see whether automated methods can help assess the level of confidence for particular research findings. There’s a risk that we’ll end up with layers of algorithm’s assessing each other, when all we need is to keep the old spirit of science alive by asking a few good critical questions.