By Genomics Aotearoa 05/04/2019 2

Peter Dearden

Sequencing is an essential part of creating a genome, be it human, stick insect, kākāpō or sheep (all genomes Genomics Aotearoa is currently working on).

But what is sequencing, how does it work, and why does it take so long?  And why does it matter to us?

Sequencing is a laboratory process where segments of DNA base pairs are put in order, so researchers know where genes are located on the chromosomes.

Genomics Aotearoa is finding the best ways of new genomic technologies to assemble genomes of organisms important to NZ medicine, agriculture and conservation, and to make those into technology ‘pipelines’. That will streamline processes, and make the technology more readily available to other research projects.

Current sequencing technology means the whole genome can’t be produced in one go. Researchers, therefore, cut DNA into pieces, then they reassemble the sequences into the proper order.

A major challenge is that genomes we work with are very large, but at present, we can mostly only sequence relatively small bits of DNA.

The revolution in DNA sequencing technology since the first genome was produced has been mainly about through-put – using next generation sequencing technology we can now sequence 100-250 base pairs of a DNA sequence. However, given an average sized animal genome is 3.2 billion base pairs, there is still a great deal of sequencing to do.

So to get this done efficiently as possible researchers use a mix of long-read and short read technology.  This sequencing approach helps both the speed and the coverage quality of genome production – internationally and in New Zealand.

So how does sequencing and assembly work?

The sequencing process is that we break up the genome into small overlapping chunks, sequence all the chunks, then assemble the genome from all those chunks. Assembly works by looking at the sequence of each chunk to find sequences that overlap (aligning).

Of course, as you have to use overlapping fragments, and you need to be sure of the sequence of each base pair, you can’t just sequence the genome once. For a good-quality human genome you might aim for 30 x coverage, so 30 x the 6.4 billion base pairs is 192 billion base pairs – split into 200 base pair chunks is 960 million. We call these ‘reads’.

So now we need to accurately assemble a genome from 960 million pieces. This is the world’s worst jigsaw puzzle because of the number of pieces (reads), and because each read is made up of only four base pairs (A, C, G and T), meaning there are very likely reads that are similar or identical. Identical reads become more of a problem when you know that bits of most organisms’ genomes are repetitive.

Twenty-five per cent of the human genome is made up of repetitive sequences. Larger genomes are even more repetitive.

Diversity is also a challenge. In the human genome, any genetic variation where our maternal genome differs from the paternal one is a problem. It’s solvable given we get one set of alleles from each parent – there are only two possibilities, but if we want to sequence the genome of something small and diverse, for example, the invasive Argentine stem weevil, and have DNA from lots of individuals, there may be multiple options for sequence at each site. Assembling a decent genome from this mess is impossible.

To solve this, we have a whole bunch of new technologies not available to those doing the first genome assemblies, including 10X chromium technology (which allows us to know which reads come from a similar region of the genome), Hi-C technology (which allows us to work out which reads are close to each other in the genome), and long read technology.

About long read technology

Long read technology involves new forms of sequencers that can read long distances down one strand of DNA. There are currently two effective long read technologies:

  • Pacific Biotechnology (PacBio) technology is an imaging approach that allows the detection of the incorporation of single labelled base pairs one after another into a strand of DNA being replicated ( ).
  • Oxford Nanopore (Nanopore) technology uses tiny charged pores that a strand of DNA is drawn into, and as each base passes through the hole it changes the charge in a way that can be measured ( These changes in charge are then assigned to each base and the sequence built from there.

Neither of these technologies are currently as accurate as short read technology but they deliver considerably more, generating over 200,000 base pairs for PacBio and up to 100,000 base pairs for Nanopore.

So if we have a genome that is large and hard to assemble, we can now use long reads to build a first approximation, and then ‘polish’ the sequence with short reads. The long reads build a good assembly, and the short reads ensure the sequence is accurate. This is computationally challenging and greedy, but having the long reads allows you to read through repetitive DNA and build a more complete, ordered genome – in effect, a more complete jigsaw puzzle.

An example of long and short read technology in NZ

Genomics Aotearoa and Bioprotection Research Centre staff took DNA from one single Argentine stem weevil (3.5 mm x 1.5 mm), amplified the whole genome using neat molecular biology tricks, sequenced it on a Nanopore to 30X coverage (30 x genome size), assembled that as much as possible, then polished it with short reads. This is really at the limits of this technology – so far we have a draft that includes over 75% of the genome. Although there is some way to go, the efficiency of the technologies has enabled us to progress the project significantly.

The creation of the Genomics Aotearoa platform means we are now better positioned to make use of bioinformatics tools and pipelines as they come on stream, and to see that they benefit New Zealand production, environment and wellbeing.

2 Responses to “A long read, and a tricky jigsaw puzzle”

  • Assuming you are talking about the length of the reads ‘200,000 base pairs for PacBio…’, this is a huge understatement for Nanopore sequencing. A major benefit to Nanopore sequencing is that there is no theoretical upper limit to the length of reads, only due to the process in which the DNA is extracted, leading to the the generation of ‘ultra long reads’ exemplified here:

    P.S; How did the Weevil Genome assemble?