Advances in whole genome sequencing have sparked a revolution in digital biology.
Genomics programs around the world are gaining momentum as the cost of next-generation high-throughput sequencing has fallen.
Whether for sequencing ICU patients with rare diseases or for population-scale genetic research, whole genome sequencing is becoming a fundamental step in clinical workflows and drug discovery.
But genome sequencing is only the first step. Analysis of genome sequencing data requires accelerated computing power, data science, and AI to read and understand the genome. With the end of Moore’s Law, the observation that the number of transistors in an integrated circuit doubles every two years, new computational approaches are needed to reduce the cost of data analysis, increase throughput and accuracy of reads, and ultimately unlock the full potential of the human genome.
An explosion of bioinformatics data
Sequencing the entire genome of an individual generates around 100 gigabytes of raw data. That more than doubles after sequencing the genome using complex algorithms and applications like deep learning and natural language processing.
As the cost of sequencing a human genome continues to decrease, the amount of sequencing data is increasing exponentially.
An estimated 40 exabytes will be required to store all human genomic data by 2025. For reference, that’s 8 times more memory than would be required to store every word spoken in the story.
Many genomics pipelines struggle to keep up with the massive amounts of raw data being generated.
Accelerated genome sequencing analysis workflows
Sequence analysis is complicated and computationally intensive as numerous steps are required to identify genetic variants in a human genome.
Deep learning becomes important for base calling directly within the genomic instrument using RNN and Convolutional Neural Network (CNN) based models. Neural networks interpret image and signal data generated by instruments and derive the 3 billion nucleotide pairs of the human genome. This improves read accuracy and ensures base calling is closer to real-time, further accelerating the entire genomics workflow, from sample to variant call format to final report.
For secondary genome analysis, alignment technologies use a reference genome to help assemble a genome after sequencing DNA fragments.
BWA-MEM, a leading alignment algorithm, helps researchers quickly match DNA sequence reads to a reference genome. STAR is another gold standard alignment algorithm used for RNA-seq data that provides accurate, ultra-fast alignment to better understand gene expressions.
The Smith-Waterman dynamic programming algorithm is also commonly used for alignment, a step that’s accelerated 35x on the NVIDIA H100 Tensor Core GPU, which includes a dynamic programming accelerator.
Uncover genetic variants
One of the most critical phases of sequencing projects is variant calling, where researchers identify differences between a patient’s sample and the reference genome. This helps clinicians determine what genetic disease a critically ill patient might have, or helps researchers screen a population to discover new drug targets. These variants can be single nucleotide changes, small insertions and deletions, or complex rearrangements.
GPU-optimized and accelerated callers like the Broad Institute’s GATK – a genome analysis toolkit for germline variant calling – accelerate analysis. To help researchers remove false positives in GATK results, NVIDIA has partnered with the Broad Institute to introduce NVScoreVariants, a deep learning tool for filtering variants using CNNs.
Deep learning-based variant callers like Google’s DeepVariant increase the accuracy of calls without requiring a separate filtering step. DeepVariant uses a CNN architecture to invoke variants. It can be retrained with the outputs of any genomic platform to be fine-tuned for improved accuracy.
Secondary analysis software in the NVIDIA Clara Parabricks tool suite has accelerated these variant callers by up to 80x. For example, the runtime of Germline HaplotypeCaller is reduced from 16 hours in a CPU-based environment to less than five minutes with GPU-accelerated Clara Parabricks.
Accelerating the next wave of genomics
NVIDIA is helping enable the next wave of genomics by supporting both short- and long-read sequencing platforms with accelerated AI base calling and variant calling. Industry leaders and startups partner with NVIDIA to push the frontiers of whole genome sequencing.
For example, biotech company PacBio recently announced the Revio system, a new long-read sequencing system powered by NVIDIA Tensor Core GPUs. Enabled by a 20x increase in processing power over previous systems, Revio is designed to sequence human genomes at scale with high-precision long reads for under $1,000.
Oxford Nanopore Technologies offers the only single technology capable of real-time sequencing of DNA or RNA fragments of any length. These traits allow for the rapid discovery of greater genetic variation. Seattle Children’s Hospital recently used the PromethION high-throughput nanopore sequencing instrument to understand a genetic disorder in a newborn’s first hours of life.
Ultima Genomics offers high-throughput whole genome sequencing for as little as $100 per sample, and Singular Genomics’ G4 is the most powerful benchtop system.
At NVIDIA GTC, a free online AI conference March 20-23, speakers from PacBio, Oxford Nanopore, Genomic England, KAUST, Stanford, Argonne National Labs and other leading institutions will share the latest AI advances in in the fields of genome sequencing, analysis and genomics present large language models for understanding gene expression.
The conference will feature a keynote address from NVIDIA Founder and CEO Jensen Huang on Tuesday, March 21 at 8:00 am PT.
NVIDIA Clara Parabricks is free for students and researchers. Get started today or try a free hands-on lab to see the toolkit in action.