Genomes, variants, and massively parallel methods

Abstract

Background

One of the defining achievements of the early 21st century is the sequencing and alignment of more than 90% of the human genome. Of course, there is not a single human genome: individuals differ from each other by about 0.1% and from other primates by about 1%. Variation comes in many different forms, including single base changes and copy number differences in large segments of DNA. Even more challenging than sequencing the whole genome is documenting and understanding the clinical significance of human sequence variation. We are still very early in our understanding of the human genome.

Content

Beginning with a historical perspective, the structure of the human genome is described in detail, followed by comparison to other interesting species. Then, different types of genomic variation are covered, including single base changes (substitutions, deletions, insertions), copy number variations, translocations and fusions, short tandem repeats of different size and number, and larger repetitive segments, some of which can hop around the genome as transposons. The function of different genomic elements is considered, along with many different classes of RNA transcribed from the DNA. How to name all the different genes, variants, and elements is a daunting task, and accepted nomenclature is presented. Massively parallel methods for genomic analysis including microarrays and massively parallel sequencing are detailed. We end with a description of basic informatics tools that provide a pipeline from massive amounts of raw sequencing data to finished sequence, including variant annotations.

Introduction

It is easy to be carried away by the detectable peculiarities and to forget that much underlying variability is still hidden from view until some new technical device discloses the finer structure of chromosomes... Lionel Penrose, Chicago, IL. Third ISCN Consensus Conference, 1966

In 1966 it was recognized that the effort to characterize human cytogenetic variation was only the tip of the iceberg in terms of our understanding of genetic detail and that many more types of variation would be revealed with advancing technology. Since the time when DNA was discovered as the major molecule for genetic inheritance, there has been a need to understand how DNA variations affect growth, development, and disease. Even after over 50 years of advances in DNA technology, many types of DNA variation have yet to be identified, named, cataloged, and studied.

Human genome

The word genome signifies the collection of genes in an organism and is believed to have been coined by the German botanist Hans Winkler in the 1920s. The human genome encompasses all of the information needed for growth, development, and heredity. This information is copied in the nucleus of every cell in the body.

The determination of 46 chromosomes in humans occurred in 1956. The following years were marked by increased activity in human cytogenetics, but it soon became apparent that there was no coordination of how findings were named or classified. Beginning in 1960, a consensus meeting of laboratories (Denver Conference) established basic guidelines for naming large chromosomal variations. The findings from multiple subsequent consensus conferences were unified in a single document, “An International System for Human Cytogenetic Nomenclature (1978).”

Some of the basic concepts of chromosomes are described in ISCN 1978: autosomes are numbered from 1 to 22 in descending order of length. The sex chromosomes are named X and Y. The symbols p and q designated the short (p for “petit,” meaning small in French) and long arms of the chromosomes, respectively. A chromosome band is the part of the chromosome that is clearly distinguishable from adjacent segments which are darker or lighter in appearance. G-bands are the bands resulting from Giemsa dye staining. In addition to describing the normal state of chromosomal features, ISCN 1978 considered the naming of chromosomal rearrangements such as inversions, deletions, and translocations.

In more recent times, the ISCN 2013 version introduced new features such as the term “hg” for “human genome build or assembly,” and a chapter titled “Microarrays,” which is devoted to naming changes identified by oligonucleotide microarrays. Of note, there is a separate consortium focused on microarrays knows as ISCA (The International Standard for Cytogenomic Arrays). ISCA is focused on microarray test quality improvement by projects such as variant databases linked to clinical data.

Throughout the 1990s, there was an international effort to sequence the human genome. The first draft was released in 2001, ^, followed by a more complete version in 2004. The 2004 version contains 2.85 billion nucleotides (bases) and was considered 99% complete for euchromatic DNA. The overall size of the genome, including both euchromatic and heterochromatic sequence (tightly compact DNA found at centromeres and telomeres), was estimated to be 3.08 billion nucleotides. Thus the total overall genome was only 92.5% sequenced when it was first declared “essentially” complete. Within the 2.85 billion nucleotides of euchromatic DNA there were 19,438 known genes and an additional 2188 predicted genes. The total number of nucleotides encoding protein was approximately 34 million (1.2%) of the genome. This portion of the genome encoding proteins is also known as the exome . Genomic terms and definitions used in this chapter are given in Box 65.1 .

BOX 65.1

Genomic Terms and Definitions

Adapter: Oligonucleotides that are ligated to library fragments in order to provide consensus priming sites.
Annotation: Biologic information attached to genomic sequence.
Annotation track: Optional metadata in a genome browser that allows viewing of genes, exons, SNVs, repeats, etc.
Assembly: Reconstruction of short sequence reads on a scaffold of reference DNA.
Binary alignment map (BAM): After alignment to a reference genome, the aligned data for each read produces a sequence alignment map (SAM file). The BAM file is the binary equivalent of the SAM file, and allows for efficient random access of the data.
Browser extensible data (BED): A tab delimited text file that defines the data lines in an annotation track, including the chromosome name, the starting and the ending positions.
Contig: A linear stretch of consensus sequence assembled from smaller overlapping sequence fragments.
Copy number polymorphism (CNP): A copy number variant present at more than 1% in a population.
Copy number variant (CNV): A structural variant of a large region of the genome that has been deleted or duplicated.
Coverage: The percent of target bases that were sequenced at least a given number of times.
Deletion: A DNA sequence that is missing in one sample compared to another. Deletions may be as small as one nucleotide or as large as an entire chromosome.
De novo assembly: Formation of a contig without using a reference sequence.
DNA library: A collection of DNA fragments with ligated adapters that will be sequenced.
DNA microarray: An array of microscopic DNA spots attached to a solid surface or surface within a chamber. Each DNA spot contains of a specific DNA sequence, known as a probe. Probe-target hybridization is usually detected and quantified by detection of fluorescently labeled targets to determine the relative abundance of target nucleic acid sequences.
FASTA file: A nucleotide sequence text file.
FASTQ file: A text output file of sequencing reads in a run, along with the quality scores of each position.
Fusion: A translocation, inversion, large deletion, or large duplication resulting in a hybrid gene formed from originally separate genes.
Gb: Gigabase (1,000,000,000 bases).
Indel: Originally referred to a unique class of sequence variants that included both an insertion and a deletion usually (but not always) resulting in an overall change in the number of base pairs. Today more commonly refers to either insertions or deletions or a combination thereof.
Insert: Part of the original DNA that has been fragmented before ligation to adapters.
Insertion: An extra DNA sequence that is present in one sample compared with a reference sequence.
Heteroplasmy: A mixture of more than one type of mitochondrial sequence in one cell.
Intergenic: DNA sequence between genes.
kb: Kilobase (1000 bases).
Mate-pair sequence: Sequence obtained from both ends of a DNA fragment that is typically 5000–10,000 bases long.
Mb: Megabase (1,000,000 bases)
Missense: A nucleotide substitution that changes a codon to the code for a different amino acid. Although these sequence changes are commonly referred to as missense “mutations,” this is strictly a misnomer because missense variants may be benign and cause no disease.
Mutation: A disease-causing sequence variation. Historically, the term has been interchangeable with variant to describe any change in DNA sequence regardless of relation to disease causation. For current clinical descriptions or reporting, the use of mutation is reserved for the scenario when disease causation is known. Many clinical laboratories no longer use the term “mutation” and instead favor “likely pathogenic variant” or “pathogenic variant.”
Nonsense: A nucleotide substitution that results in a stop codon, prematurely terminating the protein.
Nonsynonymous: Nucleotide substitutions that are predicted to change the coding amino acid to a different amino acid (missense) or stop codon (nonsense).
Oligonucleotide: A short single-stranded polymer of nucleic acid.
Paired-end sequence: Sequence from both ends of a DNA fragment typically hundreds of bases long.
Phred score: Estimate of the error probability for a base called in DNA sequencing. It is represented as a Q-score; the higher the number, the higher the probability of a correct call.
Plasmid: An extrachromosomal ring of double-stranded, closed DNA found in bacteria.
Polony: A microscopic colony of clonal temples used in massively parallel sequencing. A polony may be generated by PCR, bridge amplification, or isothermal amplification.
Pseudogene: A genetic element that does not code for a functional gene product, usually because of accumulated sequence variations.
Sequence alignment map (SAM file): A file generated by alignment of sequence data to a reference genome. This file type is often converted to a BAM file to save space.
Short tandem repeat (STR): A simple sequence repeat that is 1–13 bases long.
Simple sequence repeat (SSR): A sequence from 1 to 500 bases that is repeated end to end. If the repeat unit is 1–13 bases, it is a microsatellite or STR. If the repeat is 14–500 bases it is a minisatellite.
Single nucleotide polymorphism (SNP): A benign single nucleotide variant (substitution, deletion, or insertion) that occurs in a population at a frequency of at least 1%.
Single nucleotide variation (SNV): A single nucleotide variant (substitution, deletion, or insertion). SNVs may be benign or may cause disease.
Structural variation: A region of DNA greater than 1000 bases in size that is inverted, translocated, inserted, or deleted.
Synonymous variant: A nucleotide change that results in no change to the amino acid sequence. Although synonymous variants are typically considered to be benign since there is no protein coding change, there is the possibility of pathogenicity by changes in splicing, gene expression, or mRNA stability.
Transposon: A mobile genetic element that can delete and insert itself variably into the genome.
Variant call format (VCF): After aligning all reads onto a reference sequence, variants that are different from the reference genome at a given nucleotide position are stored in a text file in a specific format.
Variation: A change in DNA sequence. It may be benign or may cause disease.

The 2004 genome contained 341 gaps in heterochromatic regions. These regions contain DNA that is difficult to sequence (e.g., repetitive elements, GC-rich sequence) or where no clone/template could be made. Commonly used DNA sequencing technologies require a scaffold on which sequence fragments are pieced together. The first human reference sequences were assembled by the University of California at Santa Cruz (UCSC) and were numbered starting with “hg1” in May of 2000. The National Center for Biotechnology Information (NCBI) produced their own genome builds starting in December 2001 as NCBI build 28 (equivalent to hg10 from UCSC) as the genome was further refined. This led to the publicly available 2004 version of the human genome known as NCBI35/hg17. This template or reference sequence has subsequently undergone continuous improvement under the international Genome Reference Consortium (GRC), producing GRCh37/hg19. In the future, only one designation will be given, such as the currently released GRCh38.

Since the 2004 genome publication, there have been continuing efforts to create “Platinum Genomes” that address the missing information (gaps) and improve the quality of data. Prior gaps have been sequenced by utilizing DNA from a haploid cell line. Long-read sequencing technologies can create de novo assemblies that do not require the use of reference genomes. Hybrid sequencing methods are emerging that combine the advantages of short-read sequencing for single nucleotide base accuracy with the advantages of long-read sequencing to further decrease gaps in human genome data and reveal new mechanisms of human variation. For example, the use of long-read sequencing technology has provided the first assembly of the highly repetitive centromere region of the Y-chromosome, and all gaps in the entire human X chromosome have been removed. Complete “telomere-to-telomere” genome sequencing of the malaria parasite, Plasmodium falciparum , has been accomplished. Eventually, it is likely that all human chromosomes will be sequenced from telomere-to-telomere, including the repetitive centromeric regions.

Each human cell contains two copies of the 3.08-billion genome divided into 46 chromosomes. Table 65.1 summarizes statistics for the human genome and the types of variations that are important in clinical diagnostics. Three quarters of human DNA is intergenic or between genes. More than 60% of this intergenic sequence consists of “parasitic” DNA regions of mostly defective transposable elements 100 to 11,000 bases in length. Between 2 and 3 million of these “retrotransposons” are present in each copy of the genome. They contribute to genetic recombination and chromosome structure and provide an evolutionary record of sequence variation and selection.

TABLE 65.1

The Human Genome and Its Sequence Variation

Data from Lander et al., Venter et al., and the International Human Genome Sequencing Consortium.

The Human Genome
3.08 billion base pairs in 24 chromosomes 23 chromosome pairs (46–244 million base pairs per chromosome)
*75% Intergenic Sequences*
Transposable elements	45%
Segmental duplications	5%
Simple sequence repeat	3%
Structural (centromeres, telomeres)	2%
Other	20%
*25% Genes That Code for Proteins*
Introns Exons Coding segments Untranslated regions Number of genes	23% 1.9% 1.2% 0.7% 19,438 known 2188 predicted
Average gene	27,000 base pairs 10.4 exons 9.1 transcribed exons 1340 exonic bases 446 amino acids
Sequence Variants
99.9% identity (one difference every 1250 bases between randomly selected haploid genomes)
*Single-Nucleotide Variants (SNVs): Identified Every 75 Bases on Average*
Noncoding	97%
Average number within a gene	126
Average number within the coding region of a gene	5
*Copy Number Variants (CNVs): Involves 5–12% of the Genome*
*Disease-Causing Variants*
SNVs Missense (amino acid substitution) Nonsense (termination) Splicing Regulatory Small insertions or deletions (or both) Structural variants (copy number variations, inversions, translocations, rearrangements, repeats)	68% 45% 11% 10% 2% 24% 8%
Epigenetic Alterations
Variable initiation and alternative splicing Cytosine methylation Histone phosphorylation, methylation, acetylation

Segmental duplications constitute 5.3% of the human genome. They are over 1 kilobase (a thousand bases, or kb) in length, have a sequence identity of at least 90% and are not transposable. Segmental duplications are common in the human genome and are prone to deletion or rearrangements, often with medical consequences. Intergenic DNA also carries most of the simple sequence repeats (SSRs) present in the genome. A subset of SSRs, the short tandem repeats (STRs) have repeat units of 1 to several bases that may be repeated up to thousands of times. STRs have played a large role in genetic linkage studies and in forensic and medical identity testing. They are formed by slippage during replication and are highly polymorphic between individuals. The most common STRs are dinucleotide repeats, such as ACACAC and ATAT. On average, one STR occurs every 2000 bases.

Approximately 2% of DNA is required to maintain the structure of chromosomes and is located at chromosome centers (centromeres) and ends (telomeres) and makes up heterochromatic DNA. Centromeric DNA includes many tandem copies of nearly identical 171 base pair (bp) repeats encompassing 0.24 to 5.0 Mb per chromosome. Each chromosome end is capped with several kb of the telomeric 6 base repeat TTAGGG. Although intergenic DNA does not code for protein and was originally considered “junk,” much of this DNA is transcribed to RNA, producing a complex “transcriptome” network of RNA control elements whose function and mechanics are active areas of investigation.

There are about 20,000 genes that code for about 200,000 transcripts. Alternative splicing of exons produces many more transcripts than there are genes. The average gene covers 27,000 bases, but only about 1300 of these bases code for amino acids. The primary RNA transcript is processed by splicing to retain exons that are interspersed throughout the gene and have a higher GC content than noncoding regions. On average, 95% of a gene is excised as introns, retaining a mean of 10.4 exons, of which on average 9.1 are translated into proteins. Exons make up only 1.9% of the total genome, with 1.2% of the genome coding for proteins. Some important genes are present in many copies, so that overall protein expression is not affected if a chance variation occurs in one copy. If extra copies of genes lose their function, they are known as pseudogenes. At least as many pseudogenes as functional genes are present in the human genome. It is important to distinguish pseudogenes from functional genes because variants in pseudogenes are seldom of clinical importance, and they often complicate DNA diagnostic assays.

POINTS TO REMEMBER

Human Genome

Contains approximately 3 billion base pairs per haploid genome
Protein coding nucleotides are about 1% (30 million base pairs)
Noncoding sequence has important regulatory roles

Nonhuman genomes

Before the human genome was completed, other genomes of smaller size were sequenced, enabling advancements in technology and logistical organization to sequence the human genome. ^, The genomes of different species vary in size and the complexity can be surprising. One of the largest known genomes is the white spruce tree ( Picea glauca ) at 26.9 billion bases. On the opposite end of the spectrum is Porcine circovirus-1, a single-stranded DNA virus with a genome that is less than 2000 bases. There is overlap in the genome size of eukaryotes (animals, plants, fungi), viruses, and bacteria ( Table 65.2 and Fig. 65.1 ).

TABLE 65.2

Homo sapiens in Comparison to Other Genomes

Data from the National Center for Biotechnology Information ( http://www.ncbi.nlm.nih.gov/genome ).

Organism/Name	Group	Size (Mb)
Human (Homo sapiens)	Animals	3080
White spruce tree (Picea glauca)	Plants	26,900
Migratory locust (Locusta migratoria)	Animals	5760
Mouse (Mus musculus)	Animals	∼2500
Rat (Rattus norvegicus)	Animals	∼2750
Apple tree (Malus domestica)	Plants	742
Roundworm (Caenorhabditis elegans)	Animals	97
Aspergillus fumigatus	Fungi	∼30
Baker’s yeast (Saccharomyces cerevisiae)	Fungi	12.3
Haemophilus influenzae	Bacteria	1.8
Human immunodeficiency virus (HIV) 1	Viruses	0.0092
Porcine circovirus-1	Viruses	0.00173

FIGURE 65.1, Range of genome sizes. Among different organisms, there is wide variation in genome size. In this plot of publicly available genomes, the y -axis is in megabases, and the x -axis lists various organisms: Eukaryota (animals, plants, fungi), bacteria, and viruses. On average, Eukaryota have larger genomes compared with bacteria and viruses; however, there are exceptions in which virus genomes are larger than bacteria or Eukaryota. The difference between the smallest and largest known genomes is more than six orders of magnitude. Several specific genome sizes are illustrated in Mb (megabases, million).

Primates

Comparison of the chimpanzee genome with the human genome shows a genome-wide difference of only 1.23%. This approximate 1% difference translates to 35 million nucleotides and 5 million insertion/deletion differences. There are also differences at the level of proteins between humans and chimpanzees. Only 29% of proteins are identical at the amino acid level, but proteins that are different only differ by an average of two amino acids.

Two orangutan species have been sequenced. Their genome sizes are similar to humans at 3 billion bases. During evolutionary development, the number of structural rearrangements in orangutans has been less than the Human and Chimpanzee branches. For example, the number of genome rearrangements greater than 100 kb was 38 in the orangutan, but 85 and 54 in the chimpanzee and human, respectively.

An improved understanding of the ape genome was demonstrated by combining sequencing technologies including long-read DNA sequencing and cDNA sequencing. Previous genomic analysis of apes relied to some extent on human genomic analysis. In this novel analysis, the genomes of two humans, one chimpanzee, and one orangutan were assembled independently without the use of reference genomes. This independent assembly of genomes improved the understanding of differences between human and ape genomic variation including approximately 17,789 structural variants predicted to disrupt 479 genes in humans. When DNA genomics were compared to RNA expression in neural progenitor cells, 41% of genes with downregulated expression in humans compared to chimpanzees had an associated disrupting structural variant. This type of loss of expression in humans compared to apes is supportive of a theory that human evolution involved the loss of neuronal gene expression.

An example of nonprotein coding variation between primates is the number and types of DNA insertions. A comparison of 5 primate genomes (chimpanzee, gorilla, orangutan, gibbon, and macaque) identified regions of human DNA that were absent in nonhuman primates. More than 200,000 human-specific DNA insertions were identified; the majority of these were less than 10 nucleotides in length and were eliminated from further study. There were 5582 genes identified that contained larger insertions; 2450 of these genes were expressed in brain tissue. Many of the human-specific insertions were transposable elements and long terminal repeats.

Rodents

The mouse genome is 14% smaller than the human genome (2.5 billion bases compared to the human size of approximately 3 billion bases). In comparison, the rat ( Rattus norvegicus ) genome is in between the size of the human and mouse (2.75 billion bases). The number of genes is similar between all three species. About 40% of the rat, mouse, and human genomes are in alignment. Another 30% of the rat and mouse genomes match each other but not the human genome.

Fungi

Fungi are eukaryotes and their genomes are less complex than the human genome. Common fungi that cause human disease have genome sizes of 7.5 to 30 million bases and 8 to 16 chromosomes, as well as mitochondrial genomes. Some fungi have diploid genomes, and others have haploid genomes. Many of their genes have introns. For instance, Aspergillus fumigatus (a fungus that causes allergic reactions and systemic disease with a high mortality rate) has a haploid genome of about 30 million bases with more than 9900 predicted genes on eight chromosomes. Its genes are smaller than human genes, with an average length of 1400 bp and 2.8 exons per gene.

The first eukaryotic genome sequenced was Saccharomyces cerevisiae (baker’s yeast). This fungal genome has 12 million bases arranged into 16 chromosomes. In addition to the importance of yeast in baking breads and brewing alcohol, yeast is an important model organism and pathogen. With the identification of the approximately 6000 genes within the S. cerevisiae genome, systematic alteration of each gene or combination of genes can now be explored to examine the role of genes in yeast and higher organisms.

Bacteria

Bacterial genomes are considerably less complex than human or fungal genomes. Common bacteria have only one chromosome, usually a circular DNA double helix of 4 to 5 million base pairs, about 1000 times less than the amount of DNA in a human cell. About 90% of the DNA in bacteria codes for protein. There are no introns, but there are multiple small intergenic regions of repetitive sequences that are dispersed throughout the genome. Escherichia coli , a common bacterium in the human intestinal tract has about 4300 genes.

In addition to the large circular chromosome that carries essential genes, bacteria also carry accessory genes in smaller circles of double-stranded DNA (dsDNA) known as plasmids. Plasmids range in size from 1000 to more than 1 million base pairs. Plasmids are important in the molecular diagnosis of bacterial infections because they often encode pathogenic factors and antibiotic resistance.

The bacterial repertoire of DNA can be altered by (1) gain or loss of plasmids; (2) single-base changes, small insertions, and deletions as in eukaryotic genomes; and (3) large segmental rearrangements, including inversions, deletions, and duplications. Some genes, such as those for ribosomal RNA, are present in many copies, making them good targets for molecular assays to identify species of bacteria. In addition, the intergenic repetitive sequences serve as multiple targets for oligonucleotide probes, enabling the generation of unique DNA profiles or fingerprints for individual bacterial strains.

The first genome sequenced by random fragmentation and computational assembly was the pathogenic bacteria, Haemophilus influenza . Its genomic DNA was fragmented into 19,687 templates and inserted into plasmids and bacteriophages. A total of 24,304 sequences were successfully generated over 3 months. The sequencing data required 30 hours of computational time to be assembled. A total of 11 million bases of DNA were sequenced and used to generate the 1.8 million bases of the H. influenzae genome. In addition to being the first genome solved by shotgun sequencing, it was also the first bacterial genome sequenced. Multiple strains of H. influenzae have been subsequently sequenced. These additional genomes have revealed heterogeneity in the number of genes between different strains. Of the approximately 3000 genes identified, only 1461 are common to all strains. The differences in genes between different strains may be associated with differences in the infectious pathogenicity of H. influenza.

Viruses

Viral genomes are considerably less complex than bacterial genomes. Common viruses that infect humans vary in size from about 5000 to 250,000 bases, or 20 to 1000 times less than the amount of nucleic acid in E. coli. Because viruses use the host’s cellular machinery, they do not need as many genes as bacteria do. Small viruses may encode only several genes, but the larger viruses can encode hundreds. The viral genome consists of either DNA or RNA, and the nucleic acid may be single stranded or double stranded, linear, or circular with one or multiple fragments or copies per viral particle. As in bacteria, there are no introns. In fact, in some viruses the exons overlap with different reading frames that code different products from the same nucleic acid sequence. Noncoding regions are usually present at the terminal ends of linear genomes. Repeat segments are often found as terminal or internal repeats and may be inverted.

Sequence alterations in viruses are common. Areas of high sequence variation may be interspersed between conserved domains. Higher frequencies of variation correlate with lower polymerase fidelity and may allow escape from antibody recognition and antiviral drugs. Common sequence variants in viruses include single base changes, insertions, and deletions. Sequence diversity within a viral species may be so great that consensus sequences for molecular typing are difficult to find.

DNA that codes for RNA but not protein

Even though 99% of the human genome does not code for protein, most of it is transcribed into noncoding RNA. At least 93% of the genome is transcribed, producing more than 10 times the amount of RNA that is produced from the coding segments of genes. Both strands of DNA may be transcribed, and long noncoding transcripts may overlap coding regions, producing a complex transcriptome of functional RNA molecules that may variably regulate transcription of coding regions, RNA processing, mRNA stability, translation, protein stability, and secretion. In addition to long noncoding RNA, ribosomal RNA, and transfer RNA, specific classes of noncoding RNAs include small nuclear RNAs critical for splicing, small nucleolar RNAs that modify rRNA, telomerase RNAs for maintenance of telomeres, small interfering RNAs, and microRNAs (miRNAs) that regulate gene expression. In a recent review on RNA, 54 different categories were identified. Some of the more important types of RNA are listed in Table 65.3 .

TABLE 65.3

Some Common, Interesting, and Important Types of RNA

Abbreviation	Description
mRNA	Messenger RNA is translated to protein by the ribosome.
rRNA	Ribosomal RNA is a major component of ribosomes.
tRNA	Transfer RNA pairs an amino acid with its anticodon in protein synthesis.
ncRNA	Noncoding RNA is not translated to protein.
lncRNA	Long noncoding RNA is greater than 200 bases and is not translated to protein.
hnRNA	Heterogeneous nuclear RNA is the initial RNA transcript that includes introns.
Ribozyme	RNA that has catalytic activity.
Riboswitch	RNA that switches between 2 conformations under certain conditions (ligand exposure).
Telomerase RNA	Structural part of telomerase that also provides a hexamer template.
Xist RNA	X-inactive–specific transcript RNA inactivates one X chromosome in females.
snRNA	Small nuclear RNA is found in the eukaryotic nucleus.
snoRNA	Small nucleolar RNA are intron fragments essential for pre-rRNA processing.
siRNA	Small interfering RNA can cleave perfectly complementary target RNA.
gRNA	Guide RNA pairs with an RNA target and guides proteins for cleavage and so on.
miRNA	MicroRNA affects target mRNA regulation or decay.

MicroRNAs (or miRNAs) are particularly interesting as potential markers for disease. For example, concentrations of specific, circulating miRNAs correlate with many different types of cancer. MicroRNAs are noncoding but functional single-stranded RNAs that are 21 to 22 bases long and are expressed in a tissue-specific manner. They are initially transcribed as longer precursors that undergo two rounds of truncations as they are transported from the nucleus to the cytoplasm in the cell. The mature miRNA is then integrated into a protein complex called the RNA-induced silencing complex , which regulates translation of mRNA. MicroRNAs hybridize to a 6 to 8 base sequence in the 3′ untranslated region of target mRNAs and inhibit mRNA expression either by mRNA degradation if the bases are perfectly complementary, or by blocking of translation if they are imperfectly complementary. Currently for humans, there are 1917 precursor miRNA and 2654 mature miRNAs cataloged in miRBase. Despite the promise of miRNAs as tumor markers, the literature is often contradictory and inconsistent with few accepted conclusions.

Variation in the human genome

If the DNA of any two individuals is compared, on average one difference is noted every 1250 bases (i.e. , approximately 99.9% of the sequence is identical between randomly chosen copies of the genome). However, copy-number variants involve a greater amount of the genome, with 0.5% of the genome differing on average between two individuals when copy-number variants greater than 50 kb are considered. Between individuals, at least five times as many bases are affected by copy number changes as by small sequence differences.

Most human genetic material is present in two copies, with the exception of the unpaired sex chromosome in males and mitochondrial DNA. The presence of only single gene copies on the X and Y chromosome in males leads to well-known sex-linked disorders. In contrast, the 16,500-bp mitochondrial genome is present in multiple copies per cell, constituting about 0.3% of human DNA, depending on the tissue source. Allele fractions may vary over a wide range when all mitochondria in a cell are considered. That is, sequence variations in mitochondrial DNA are heteroplasmic, meaning that the ratio of the wild-type allele to a variant allele can vary almost continuously, sometimes resulting in a wide range of symptoms even when only one sequence variant is involved.

Large-scale human genome sequencing projects have cast a wide net across many diverse populations. These projects have provided a wealth of knowledge of the genetic diversity that exists in humans. An alternative approach to human genetic diversity is to examine more homogenous populations. Several studies have examined the genetics of a large number of individuals from Iceland. A whole genome sequencing study of 2636 Icelanders observed 20 million single nucleotide variants (SNVs) and 1.5 million insertions/deletions. The data from this whole genome sequencing study were combined with a previous data set of 104,220 Icelanders who had been SNV typed at 676,913 locations. By applying whole genome sequencing data from only a small subset of individuals, the full genetics could be inferred for a larger set of over 100,000 individuals who had only had SNV typing.

Another interesting result of the Icelandic whole genome study was the identification of 6795 loss of function single nucleotide variants, insertions, or deletions in 4924 genes. Loss of function changes (homozygous or compound heterozygous) were found in 7.7% of the individuals sequenced. In essence, this study identified a surprisingly high percentage of individuals with “knocked-out” or functionally silenced genes.

Any sequence change (compared to a reference sequence) is called a sequence variant or variation. Many variations do not affect human health and are benign or silent. For example, most (1) copy-number variations, (2) SNVs, and (3) STRs found between genes are seldom associated with disease.

You're Reading a Preview

Become a Clinical Tree membership for Full access and enjoy Unlimited articles

Become membership

If you are a member. Log in here