DNA Sequencing Explained: Reading the Genetic Code

Updated May 2026
DNA sequencing is the process of determining the precise order of nucleotide bases (adenine, thymine, guanine, cytosine) in a DNA molecule. Since Frederick Sanger developed the first practical sequencing method in 1977, technology has advanced through multiple generations, with modern platforms capable of reading an entire human genome in hours for under 200 dollars. Sequencing underpins virtually all modern genetics research, clinical diagnostics, forensic identification, evolutionary biology, and agricultural science, making it one of the most transformative technologies in the history of biology.

Sanger Sequencing: The First Generation

Sanger sequencing (the chain termination method) was the dominant sequencing technology for three decades and remains in use for targeted applications today. Developed by Frederick Sanger in 1977 (earning him his second Nobel Prize in Chemistry), the method exploits modified nucleotides called dideoxynucleotides (ddNTPs) that lack the 3-prime hydroxyl group needed to extend a DNA chain. When a dideoxynucleotide is incorporated during DNA synthesis, it terminates the growing strand because no further nucleotides can be added.

In a Sanger reaction, many copies of the target DNA are simultaneously synthesized from a primer, with a small proportion of dideoxynucleotides mixed among normal nucleotides. Termination occurs randomly at every position in the sequence, producing a collection of fragments that differ in length by exactly one nucleotide. Each of the four dideoxynucleotides (ddATP, ddTTP, ddGTP, ddCTP) carries a different fluorescent label, so the identity of the terminal base can be determined by its color. Capillary electrophoresis separates these fragments by size, and a detector reads the fluorescent colors as fragments pass, producing a chromatogram that represents the sequence.

Modern automated Sanger sequencing produces reads of 600 to 1,000 bases per reaction with very high accuracy (over 99.99 percent per base after quality filtering). Though largely replaced by next-generation methods for genome-scale projects, Sanger sequencing remains the gold standard for confirming specific mutations identified by other platforms, sequencing individual PCR products, verifying plasmid constructs in molecular cloning, and clinical confirmation of variants in diagnostic settings where the target region is known and limited in size.

Next-Generation Sequencing

Next-generation sequencing (NGS) platforms sequence millions to billions of DNA fragments simultaneously through massively parallel approaches, dramatically increasing throughput while reducing cost per base by orders of magnitude. The conceptual breakthrough was eliminating the need to separate individual clones into separate reactions: instead, millions of spatially separated DNA clusters are sequenced together on a single surface, with imaging capturing data from all clusters simultaneously in each cycle.

Illumina platforms, which dominate the NGS market with over 80 percent of global sequencing data production, use sequencing by synthesis (SBS). DNA fragments are attached to a glass flow cell surface via adapter sequences, amplified locally into clusters of approximately 1,000 identical molecules (bridge amplification), and sequenced one base at a time. In each cycle, fluorescently labeled nucleotides with reversible terminators are incorporated into all clusters simultaneously, the surface is photographed to record which base was added at each cluster position, and the terminators are chemically removed to allow the next cycle. This process repeats for 150 to 300 cycles per read direction.

The cost of sequencing a human genome has fallen from approximately 100 million dollars in 2001 to under 200 dollars in 2026, a cost reduction exceeding Moore Law predictions by orders of magnitude. This has democratized genomics, enabling individual research laboratories, clinical diagnostic facilities, and direct-to-consumer genetics companies to perform genome-scale sequencing that was previously possible only at large genome centers. A single modern Illumina NovaSeq X instrument produces over 16 terabases of data per run, enough for approximately 128 whole human genomes at standard 30x coverage depth.

Ion semiconductor sequencing (Ion Torrent) detects nucleotide incorporation through pH changes rather than fluorescence. When a nucleotide is incorporated into a DNA strand, a hydrogen ion is released, slightly lowering the pH in the surrounding solution. Semiconductor sensors beneath each sequencing well detect this pH change directly, eliminating the need for expensive optical systems. This approach produces shorter reads (200 to 600 bases) but offers faster run times and lower instrument costs, making it suitable for targeted sequencing panels and smaller laboratories.

Long-Read Sequencing

Long-read technologies from Pacific Biosciences (PacBio) and Oxford Nanopore Technologies read individual DNA molecules tens of thousands to over one million bases long, without the fragmentation required by short-read platforms. These third-generation sequencing methods observe single molecules directly, providing read lengths limited primarily by the length of the input DNA rather than by the chemistry of the sequencing reaction.

PacBio uses single-molecule real-time (SMRT) sequencing, observing a DNA polymerase enzyme as it incorporates fluorescent nucleotides in real time. Each polymerase molecule is fixed at the bottom of a tiny well (zero-mode waveguide) that illuminates only the immediate vicinity of the enzyme. As each labeled nucleotide is incorporated, its fluorescent emission is recorded, identifying the base. By circularizing the DNA template and allowing the polymerase to traverse it multiple times, PacBio generates high-accuracy consensus sequences (HiFi reads of 10,000 to 25,000 bases with accuracy exceeding 99.9 percent).

Oxford Nanopore Technologies threads single-stranded DNA through protein nanopores embedded in a synthetic membrane. As each nucleotide passes through the pore, it partially blocks the flow of ionic current in a characteristic pattern. A sensor measures these current fluctuations, and computational algorithms (including neural networks) translate the current signal into base sequence. Oxford Nanopore devices range from the MinION (a portable USB device costing around 1,000 dollars) to the PromethION (a production-scale instrument), with ultra-long reads exceeding 4 million bases achieved under optimal conditions.

Long reads resolve repetitive regions, structural variants, and full-length gene isoforms that short reads fundamentally cannot address. They are essential for de novo genome assembly (building a genome sequence without a reference), detecting large insertions, deletions, and inversions, reading through tandem repeat expansions that cause neurological diseases (Huntington, fragile X, myotonic dystrophy), characterizing complex genomic regions like the major histocompatibility complex, and phasing variants to determine which alleles sit on the same chromosome. The telomere-to-telomere completion of the human genome in 2022 was achieved only through the combination of long-read sequencing technologies.

Bioinformatics: From Raw Data to Biological Meaning

Sequencing instruments produce raw data that requires extensive computational processing before biological interpretation. Base calling algorithms convert raw signals (fluorescent images for Illumina, current traces for Nanopore) into nucleotide sequences with associated quality scores indicating confidence in each base call. Quality filtering removes low-quality reads and adapter sequences that would confuse downstream analysis.

Sequence alignment maps reads to a reference genome, determining where in the genome each fragment originated. Short-read aligners like BWA and Bowtie2 map billions of short reads to reference genomes in hours, accommodating mismatches from sequencing errors and genuine variants. Variant calling algorithms then identify positions where the sequenced individual differs from the reference, distinguishing true variants from sequencing errors based on the number of reads supporting each observation and their quality scores.

Genome assembly constructs complete genome sequences from sequencing reads when no reference genome exists, using algorithms that identify overlaps between reads and piece them together into continuous sequences (contigs). Hybrid assembly approaches combine the accuracy of short reads with the spanning power of long reads to produce highly contiguous and accurate assemblies. Annotation pipelines then identify genes, regulatory elements, and other functional features within assembled genomes.

Clinical and Applied Applications

Clinical diagnostics uses sequencing to identify disease-causing mutations in patients with suspected genetic conditions. Whole-exome sequencing (WES) analyzes the protein-coding regions of all 20,000 genes simultaneously, diagnosing 25 to 40 percent of patients with previously undiagnosed rare diseases. Whole-genome sequencing (WGS) provides even more comprehensive analysis, detecting mutations in non-coding regulatory regions, structural variants, repeat expansions, and deep intronic variants missed by exome sequencing, with diagnostic yields reaching 40 to 60 percent in some patient populations.

Cancer genomics sequences tumor DNA to identify the specific driver mutations in each patient cancer, enabling selection of targeted therapies directed against those particular molecular vulnerabilities. Comprehensive genomic profiling panels test hundreds of cancer-related genes simultaneously, identifying actionable mutations, determining tumor mutational burden for immunotherapy decisions, and detecting gene fusions that indicate sensitivity to specific inhibitors. Liquid biopsy sequences circulating tumor DNA from blood samples, enabling non-invasive monitoring of treatment response and early detection of resistance mutations without repeated tissue biopsies.

Metagenomic sequencing identifies all organisms present in a sample by sequencing total DNA without prior culture or amplification of specific targets. In clinical microbiology, metagenomic sequencing can identify causative pathogens directly from patient specimens (cerebrospinal fluid, respiratory samples, blood) within hours, detecting bacteria, viruses, fungi, and parasites simultaneously, including organisms that are difficult or impossible to culture. Environmental metagenomics characterizes microbial communities in soil, water, and other ecosystems, revealing the vast diversity of unculturable organisms.

Forensic DNA analysis uses sequencing for human identification through short tandem repeat (STR) profiling, increasingly supplemented by massively parallel sequencing approaches that provide additional information including physical appearance predictions, biogeographic ancestry, and kinship analysis. Agricultural genomics uses sequencing for marker-assisted breeding, genomic selection, pathogen surveillance, and food safety testing.

Key Takeaway

DNA sequencing reads the base-by-base order of nucleotides in DNA molecules. Technology has advanced from Sanger sequencing (one fragment at a time) through massively parallel next-generation platforms (billions of fragments simultaneously) to single-molecule long-read methods (individual molecules of 10,000+ bases). Costs have dropped over a million-fold, making sequencing accessible for clinical diagnosis, cancer treatment, forensics, agriculture, and fundamental biology research.