The story your ancestors didn’t write

By Peter Dearden 31/08/2012

by SM Morgan


I love books.  I’m just that kind of person.  I also love genetics, convenient then (or deliberate, depending your world view) that I am, or will soon be, a professional research geneticist.  Earlier this month a research group split between Baltimore and Boston published a brief paper detailing their method for storing large amounts of data in DNA form – specifically, a book encoded in DNA.

The research was designed as an effort towards solving the “Big Data” problem.  BD is the term affectionately given to the problem facing a society which, when the ability has presented itself, wants to save everything.   (Those file copies of assignments you did at uni 7 years ago which you still have saved on your hard drive, for example).  For big businesses like Google, the problem is vastly exacerbated.  Having access to smaller and smaller means of data storage makes saving bigger and bigger amounts of data possible.  (Let’s not debate the self perpetuating cycle, for the moment).

The researchers, from the departments of Genetics and Biomedical Engineering, worked with the proposal of using DNA as a format of large-scale data storage.  DNA is, as you learnt at high school, the means through which living organisms store the data required for building and maintaining a body through its lifetime.  This information is encoded in four nucleotide bases; adenine, thymine, cytosine and guanine, which pair up A-T, C-G to form a ladder or helix shape, which in turn can be packaged up to fit inside the nucleus of a single cell.  If you are unsure of just how small a single cell is (or how completely mind-blowing our universe), I suggest you have a look at this delightful interactive info graphic.

As a proof of concept, the researchers converted a book, a draft of something, containing 53.5k words, 11 .jpg images and 1 Java script program (I would love to know what it was; why was that information not more notable?!) into a 5.27 megabit bit stream of binary code.  This string of 0s (A or C) and 1s (G or T) was then synthesised into a DNA strand in 159 nucleotide-long blocks.  These blocks contained 96 nucleotides of book data, a barcode to give the ‘address’ or location of the piece (19 nucleotides), and a tag sequence of 22 nucleotides at either end for amplification and sequencing of the DNA.

The DNA strands were synthesised using what is basically a very, very small inkjet printer, on a glass slide, or ‘microchip’.  Once the book was printed as such, the researchers reversed the process to prove their ability to do so.  The ‘library’ was amplified using PCR and sequenced using Illumina HiSeq, which is just a brand name for a machine capable of reading the sequence of bases contained within a DNA strand and displaying it in a readable format.  Each wee piece of the book was read 3000 times, a common process in DNA sequencing, to ensure the code is correct and the reading trustworthy.  In this particular case, the entire book was retrieved from the DNA with an error rate of 10 bit errors out of 5.27 million (so 10 tiny errors in 5.27 million digits of binary code).

The use of DNA as a data storage mechanism is favorable due to its ability to pack such large amounts of information into a small scale, and to last for, potentially – millennia, but also to store that data in a 3D state – you could have a cup full of information for example, instead of a single sheet.  As the technology for synthesising and reading DNA becomes more exact, simpler and cheaper, this option of data storage becomes more and more attractive.

This experiment has actually been attempted before.  In 1988, small messages in DNA were demonstrated, as a science-art collaboration, the fascinating journal article of which is actually worth a read.  The scale and specificity of the current study however, are unique, and the complete storage of a book, is delightful.  It was this aspect of the study which fueled the media attention, and misled some readers“Other applications of the technology they are overlooking… 1st off, the ability to upload or inject a file of data directly into a subject. Secondly, being able to extract our DNA and literally break down every experience into reading material. Imagine learning to play a violin with a simple injection of DNA or find out that you already know how because it has been recorded into your DNA and passed down for generations!”.  Perhaps misunderstanding the function and reality of DNA…

However, it also induced a pithy comment from Mashable writer, Peter Pachal, when discussing the immortal language of DNA and future generations retaining the ability to read it (ignoring the fact that the language etcconverted is still modern, and subject to change), that this is a premise “that assumes artificial intelligence doesn’t exterminate or replace human society, of course”.  At which point one would assume the ‘Big Data’ problem, is no longer a pressing issue.