Starting in 1990 the Human Genome Project (HGP) allowed the sequencing of the whole human genome by 2003 (Gibson and Muse, 2009). The Sanger sequencing technique is a gold standard for sequencing DNA and was instrumental for HGP (Gibson and Muse, 2009). Unfortunately, Sanger sequencing involves much time and money to perform, and throughout the decades other methods of sequencing have been developed to reduce time and cost for sequencing large scale projects. In addition to time and cost of sequencing, hurdles have arisen—such as correct sequence assembly of repeated regions of DNA and how to navigate through mega-datasets. In this essay throughput, cost and other technological and biological aspects will be summarized for the next generation of sequencing technologies.
Throughput and Cost
There are three generations of sequencing techniques: the first, second and third generation. The first-generation is the Sanger sequencing method which used ddNTPs as fragment terminators to allow labels fragments to be separated via gel electrophoresis. Then dye-terminating coupled with capillary electrophoresis allowed for a higher throughput from the standard 1977 sanger sequencing technique (Pavopoulos et al., 2013). The HGP costed about $3 billion using this type of method. The second-generation sequencing allows up to millions of short sequences at high throughput, which allows for clinical applications because of the read depth is much more then the previous generation of sequencing techniques. Second-generation sequencing can have 30x of coverage or more and is highly used today to preform complex sequencing investigations (Pavopoulos et al., 2013). The cost to sequence whole genomes with this technology have drastically reduced from millions of dollars to thousands of dollars. Third generation or next generation of sequencing, would allow for whole human genome sequencing in a matter of hours at low cost. Some of the players in this space are Helicos BioSciences, Pacific Biosciences, Oxford Nanopore and Complete Genomics (Pavopoulos et al., 2013). The cost to sequence whole genomes is estimated to be within $1,000 or less. Even though the next generation of sequencing reduces time and cost of sequencing data, clinical applications do tend to be confirmed using Sanger sequencing methods due to these next generation methods having issues with copy number variants (Ghazani, 2017) (Zhao et al., 2013). These next generation systems leverage advances from second generation systems in DNA-seq for determining unknown genome or variation analysis, RNA-seq for analyzing gene expression, or Chip-seq for protein binding sites on DNA (Pavlopoulos et al., 2013). Faster and cheaper sequencing may lead to less accurate sequencing due to copy number variations because the inherent statistical methods used for assembly.
Other Technological and Biological Aspects
As for the technological aspects of next generation sequencing methods the following are important to consider: file formats, alignment tools, genomic browsers and visualization methods for comparative genomics. With the expansion of sequencing data and different software companies developing methods to utilize these datasets, there needs to be a standardization for archiving genomic data. For example, FASTQ, SAM/BAM, or VCF are standards for holding gigabytes or terabytes of data. Pavlopoulos et al. (2013) have shown that at least twelve different software can be used for predicting structural variations in sequences, whereby some software uses proprietary input files. Some software for predicting structural variations are better at a combination of single-end, genome referencing, insertion detection and deletion detection; whereas other software performs better with pair-end, translocation across chromosomes and within chromosomes (Pavlopoulos et al., 2013). With all these different alignment tools researchers can become bewildered in navigating through each of the specific uses for structural variation prediction. Genomic browsers display the sequencing and annotation within a graphical user interface. Most of these genomic browsers have common features that help the user’s experience, but navigating through large sequences remains a challenge even though the search algorithms within these genomic browsers are becoming more efficient. This leads us to consider the need to visualize the genome sequence better. Not just visualizing linear data sequences but the 3-dimensional representation of the DNA itself. Virtual Reality (VR) and Augmented Reality (AR) might be a possible tool to help in navigating through large datasets and producing flythrough options for the researcher to see the sequence structure in 3-dimensions. These VR and AR systems have been used in biological research, such as protein shape; but these systems are far from being commercially viable. In the interim to VR and AR systems, focus has been on the algorithms that deal with alignment of unfinished genomes, intra/inter chromosome rearrangement and functional element identification for comparative genomics (Pavlopoulos et al., 2013).
Ghazani, A. A. (2017). Introduction to Genomics [PowerPoint slides]. Retrieved from https://canvas.harvard.edu/courses/35084/modules
Gibson, G., & Muse, S. V. (2009). A Primer of Genome Science (3rd ed.). Sunderland, MA: Sinauer Associates Inc.
Pavlopoulos, G. A., Oulas, A., Iacucci, E., Sifrim, A., Moreau, Y., Schneider, R., … Iliopoulos, I. (2013). Unraveling genomic variation from next generation sequencing data. BioData Mining, 6(13). http://www.biodatamining.org/content/6/1/13.
Zhao, M., Wang, Q., Wang Q., Jia P., & Zhao Z. (2013). Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspective. BMC Bioinformatics, 14(11). http://www.biomedcentral.com/1471-2105/14/S11/S1.