First find your tuatara (or how to sequence a genome)

By David Winter 25/06/2013

So, now we’ve told you why we’re so keen on sequencing the tuatara genome you might want to know exactly how we are going to do it. As the project goes on we will get the experts working on each stage to describe exactly what they’re up to, and how it’s going. But we also want to give you a broad overview of how the project will proceed. Here then, in five simple steps, we present a guide to sequencing a genome:

1. First, find a tuatara


Genome sequencing projects are easier if you use DNA from a single representative of the species you are studying. For tuatara, getting that representative isn’t entirely easy. Although they once lived pretty much everywhere in New Zealand, tuatara are now almost entirely restricted to offshore islands (the only tuatara living on the mainland these days are in sanctuaries or, like the fine looking fellow above, in  museums) .

Quite of few of the islands that retain tuatara populations are in the rohe of Ngatiwai, who act as kaitiaki or guardians of the animals on these islands. Ngatiwai are partners in the tuatara genome project, so, in 2011 representatives of the iwi, along with Dr Nicky Nelson and PhD student  Lindsay Mickelson from Victoria University, collected a small blood sample from a large male tuatara on Motumuka (Lady Alice Island) in the Hen and Chickens group.

2. Put your tuatara back

It should  go without saying, but, as tuatara are an endangered species none will be harmed in this project. Sequencing a genome only requires a little bit of blood, and reptiles like the tuatara require even less blood than most since their red blood cells contain DNA (unlike mammals, where red blood cells lose their DNA before they circulate). Once we had taken about two mL of blood,  and about two minutes of time, from our tuatara we let him get back to his life on Motumuka.

3. Prepare your DNA

Once you have your blood sample, you need to prepare DNA from that sample for sequencing. This really is just a matter of following a recipe very carefully. By adding chemicals, heating your sample and spinning it at great speed you can break the cells in your sample down into their chemical components, and isolate DNA from that mixture. This, like almost everything you ever do in a molecular biology lab, will get you a tube with a very small volume of clear almost colourless liquid*.

That small volume of liquid  will have millions of DNA molecules, each one being a long chain built from four different nucleotide ‘bases’, which we usually refer to using the abbreviated names ‘A’, ‘C’ ,’T’ and ‘G’. Your goal in sequencing the genome is to work out the order of those chemical bases in the DNA molecules.



The first step toward determining the base pair sequence is to smash the long chains of DNA into shorter, more manageable chunks. There are lots of different ways of doing this. For the tuatara DNA we blasted the DNA with very high frequency sound. Once you have your DNA sufficiently small you need to prepare it for the sequencing reaction. In this case we had to add some extra DNA bases to each of the fragments – these so called “adapter sequences” are crucial for the next step.

4. Sequence at room temperature for two weeks

The machines that work out the base sequence of DNA molecules are incredible wonders of modern engineering. For the tuatara genome, all the steps described below are going to be controlled by an Illumina HiSeq 2000 which looks like this:



All the DNA prepared above will be loaded in one of these flow cells, which is about the size of a microscope slide.



Each of the eight lanes in those flow cells is coated with millions of very short DNA molecules, which are designed to catch on to the adapter sequences you added to your DNA fragments in Step 3.  Once your  DNA fragments attach themselves to the flow cell, they are copied to create a tiny population of cloned molecules. In time, millions of DNA fragments will attach to the flow cell and create clusters of identical DNA molecules:


To work out the base sequence of each fragment, the DNA is copied one more time. This time, the replication is carefully controlled so that only one base is added to each fragment at a time. The bases added at this step have fluorescent labels, with a different colour for each of the four possible bases.



At the end of every one base-pair step of this reaction the flow cell is scanned by a laser and the light shining out from each cluster on the flow cell is recorded as an image. By keeping track of the colour (and therefore the base being added) at each cluster the sequence of hundreds of millions of DNA fragments can be determined in a single run**



5. Assemble your genome

Once your sequencing machine has finished its run you will have all the data you’ll need to build your genome. But none of it will make any sense. Because the fragments that are sequenced are so small (about 100 bases) and the genome is so big (about 5,000,000,000 bases) it’s very hard to know how to put all the fragments together.

Imagine trying to piece together a sentence in English if all you had were these fragments and no context as to what they might mean:


'ave postulat', ' have postul', 'aped our not', 'not escaped ', 
'airing we ha', 'escaped our ', 'ave postulat', 'It has not e', 
' immediately', 't escaped ou'


If you look carefully, you might notice a few of those fragments seem to overlap with each other. By linking up the overlapping segments, you can reconstruct some of the words in the original sentence. In this example we can reconstruct two chunks of the original sentence:

airing we ha
          have postul
           ave postulat
           ave postulat
It has not e
       not escaped
         t escaped ou
           escaped our
              aped our not

Now you have a new problem. You have two fragments of the original sentence, but no idea how they relate to each other. This is a common problem in assembling genomes too. One way to get around this problem is to sequence some longer DNA fragments. Although we can’t sequence all the way across these larger fragments,  we can sequence both ends of them and, if we prepare them carefully we can know how long they are.  Combining these bits of information can help us join up unconnected sections of DNA. Here’s the two sentence-chunks connected by some of these “paired-end” fragments:

         ot escaped o---------------the specific
            escaped our ---------------specific pa
                       notice that ---------------iring we hav
                                    the specific---------------e postulated

With enough short and long fragments  you should be able to cover the whole genome (or sentence!) multiple times and reconstruct what you started out with. In our example, it’s the line which Watson and Crick famously described the importance of their discovery of the structure of DNA:

It has not escaped our notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material.

Of course, assembling a genome is much harder than putting together a sentence. You will have millions of reads, each made of only four bases, and there will be no sentence structure, syntax, or even words to provide clues as to how they might go together. Because assembling a genome is such a hard task, lots of different software packages have been developed to do just this, and scientists that build the tuatara genome plan to test different assembly software to find the one that does the best job with this dataset.

It’s important to note, even though the software (and the scientists) that  build up genomes are very good at what they do, there will always be some uncertainty in a genome sequence. The sequence you produce will be a draft, with most regions very well understood and a few others will remain murky for some time

Steps 6 – infinity

Congratulations. By the time you finish step 5 you will have a a draft genome sequence. In many ways, this is just the start of the project. Now you will want to know what all those bases mean. Which genes do tuatara share with other reptiles, and which seem to be unique to the species? Can we work the genetic basis of the tuatara’s unique biology? Which genes might be of interest to conservation biologists trying to manage the species?

To answer these questions you’ll need to compare the tuatara genome sequence to DNA sequences fromother organisms.  We will answer some of these questions as part of the tuatara genome project, more importantly, the draft sequence we produce will be available for any scientists that wants to work on any question .

*Rob Day, who prepared the sample, tells me it had a yellow-ish tinge
**There are a couple of nice animations showing how the sequencing process works on Youtube. One from Illumina themselves and another from Aiden Flynn.
The large DNA image is a composite of “Chomosome ” by wikimedia user KES47 and “difference DNA RNA”  by Sponk. Our image is provided under a CC-BY-SA license.
The image of the flow cell is courtesy of the DOE Join Genome Institute and is CC-BY-ND-NC.
Other images produced for this post are CC-BY.

0 Responses to “First find your tuatara (or how to sequence a genome)”

  • What a charming description on how genome sequencing works! Ticks all the boxes for me. I look forward to seeing how the sequence gives us all insights into tuatara, not to mention human evolution. Well done!

  • Awesome topic! I imagine we’ll soon have little gadgets sequencing our genomes in no time, just like little gadgets of today for blood sugar!

  • […] First find your tuatara. If you read nothing else from this list, read this one. It gives a really great, simple explanation of how to sequence an organism’s genome, in this case a tuatara lizard. […]