High-Throughput Genomic and Proteomic Technologies in the Postgenomic Era


Key Points

  • Completion of the Human Genome Project has provided scientists with a detailed map of the human genome and predicted coding regions that have facilitated the emergence of high-throughput genomic and proteomic technologies.

  • A number of mature platforms exist for high-throughput profiling of gene expression in human tissue, including serial analysis of gene expression (SAGE), deoxyribonucleic acid (DNA) microarrays, and real competitive polymerase chain reaction.

  • Proteomic technologies, including mass spectrometry and protein arrays, have begun to explore the dynamic and complex protein composition of healthy and diseased human tissue.

  • DNA microarray and SAGE technologies have identified diagnostic gene-expression signatures for a number of hematologic and solid malignancies that are often difficult to distinguish using traditional histologic analysis.

  • Prognostic gene and protein expression profiles have been identified in a large number of cancer settings, including lymphoma, lung cancer, breast cancer, and acute myeloid leukemia, among others.

  • Validation in large clinical trials, standardization of techniques and controls, and inclusion of analytic standards are needed before widespread clinical implementation of these technologies can be achieved.

Overview

With the complete sequence of human and other genomes recently elucidated, we have witnessed an explosion of information and high-throughput tools that are profoundly altering biomedical research and the culture of science. This revolution, which began in the mid-1980s, emerged from developments in three areas: (1) molecular biology, most notably, breakthroughs in rapid deoxyribonucleic acid (DNA) sequencing; (2) information technology, in particular, the ability to store and analyze unprecedented quantities of data; and, most important, (3) progress in human genetics, especially the identification of thousands of single-gene human disorders. The convergence of these technological and scientific advances raises promise for the rapid identification of disease-related genes, leading to improved diagnostic tests and more effective therapies.

Detailed maps of human and other genomes provide the information needed to chart a course toward understanding and treating many diseases. However, this course remains a long and difficult one. Although progress will come most readily for disorders following Mendelian patterns of inheritance, even these disorders will pose significant difficulties. For example, although the biology and genetics of sickle cell anemia have been reasonably well understood for more than half a century—a single valine replaces glutamic acid at position 6 in the β-hemoglobin chain—effective treatment has been slow to develop. The penetrance of genetic diseases thought to be due to single mutations often relies on more complex interactions between the mutation and a variety of concurrent gene polymorphisms, such as those seen in the genes encoding surface proteins on postcapillary venule endothelial cells, which, in part, account for sickle cell disease severity.

Many common diseases—in particular, cardiovascular disease, mental illness, and almost all cancers—stem from multigenic causes. In addition, these diseases invariably have strong environmental components, presenting a substantial challenge to the development of effective and economical diagnostic and prognostic tests. Even in the absence of a confounding environmental influence, linking the quantifiable phenotype of a disease to a set of distinct alleles is a complex undertaking. Further complicating matters, many disorders do not have sharply defined, quantifiable phenotypes.

In a chapter filled with information on promising technologies, we make these sobering remarks to emphasize that greater understanding never guarantees cures or therapies. However, greater understanding does aid the development of rational strategies to detect and control disease. The reference genome allows us to rapidly characterize polymorphisms across the human population; it also enables molecular fingerprinting technologies that permit identification of the precursors and consequences of normal and pathologic changes in gene and protein expression. We can feel confident that the power of genomic technologies is well beyond anything previously available, and that it will make possible during the next several decades a host of new diagnostics and therapeutics for cancer and other common diseases, profoundly altering the practice of medicine. As discussed in Chapter 75 , genomic technologies have already made an impact in the area of pharmacogenomics, in which drug protocols are being designed from the cytochrome P450 genetic profiles of patients, allowing for maximally effective therapeutic regimens for each patient.

This chapter focuses on several high-throughput genomic and proteomic technologies that have the potential to influence disease classification and prognostication ( Fig. 80.1 ). These molecular tools have affected virtually all forms of human pathology. However, given the public health implications and the preponderance of recent publications in the field, we will concentrate on diagnostic and prognostic applications related to cancer, although we will mention other salient disease states where applicable. After presenting a brief overview of the Human Genome Project and resultant high-throughput technologies, we will discuss examples of applications in the setting of several hematologic and solid malignancies. The scope of the chapter has been constrained to focus on high-throughput technologies; therefore, this chapter does not represent a comprehensive overview of all technologies being used in clinical genomic and proteomic studies. In addition, among the technologies presented, some are highlighted in greater detail because of their widespread use. As more data are produced using these technologies, strategies combining data sets may become powerful approaches to developing accurate clinical tools. Along with measuring gene and protein expression levels, understanding human genetic variation in DNA will be important in elucidating disease markers and mechanisms. A discussion of high-throughput genotyping technologies to assay single-nucleotide polymorphisms is beyond the scope of this chapter, however (for reviews, see ; ).

Figure 80.1, High-throughput platforms and the central dogma of biology. The three major technologies responsible for rapid analysis of biological systems include mass spectrometry, sequencing, and microarrays. Examples of each technology have been listed as applied to each broad stage of biological information, that is, DNA, RNA, and protein. ICAT, Isotope-coded affinity tags; m/z, mass-to-charge ratio; MS, mass spectroscopy; SELDI-TOF-MS, surface-enhanced laser desorption ionization, time-of-flight mass spectroscopy.

The Human Genome Project

Public Sequencing Effort (Hierarchical Shotgun Sequencing)

Sequencing of the human genome was a 15-year, $3 billion project, initiated in 1990 as a joint effort between the U.S. Department of Energy and the National Institutes of Health ( ). From the time of the project’s inception through 1995, genetic and physical maps of human and mouse genomes were constructed, and yeast and worm genomes were sequenced. These initial projects, coupled with advances in sequencing technology and sequence data analysis, outlined cost-effective strategies and techniques while demonstrating the feasibility of sequencing the human genome. In March 1999, the effort to sequence the human genome commenced in earnest. Sequencing was set to be completed in two phases: the first phase would include completion of a draft sequence and the second would be a finishing phase for resolution of misassembled regions and filling in of sequence gaps. By June 2000, centers involved in the project were producing raw sequence data at a rate of about 1000 nucleotides per second, 24 hours a day, 7 days a week ( ).

The first phase of the Human Genome Project, a collaborative endeavor of 20 groups in six countries, was completed and published in February 2001 in the journal Nature ( ; ). The draft sequence covered about 96% of the euchromatic part (gene-rich) of the human genome and 94% of the entire genome, with an average of fourfold coverage (i.e., each base sequenced an average of four times). The International Human Genome Sequencing Consortium employed a sequencing strategy referred to as hierarchical shotgun sequencing (also known as map-based , bacterial artificial chromosome ( BAC ) –based , or clone-by-clone sequencing ) ( ). Genomic DNA obtained from volunteers from a variety of racial and ethnic backgrounds was partially digested with restriction endonucleases. Fragments of 1 to 2 MB in length were cloned into BACs. Eight DNA libraries containing overlapping insert clones were created, representing 65-fold coverage of the genome (each base is represented on average 65 times, as seen on examination of all 8 libraries).

BACs containing fragments of the human genome are inserted into bacteria and are replicated as the bacteria grow and divide. Each BAC clone is completely digested with a restriction enzyme to produce a unique pattern of DNA fragments, known as a fingerprint . The fingerprints from different BAC clones can be compared to allow selection of a set of overlapping BAC clones that cover a portion of the genome (a large region of the genome covered by overlapping clones is called a contig ). Contigs can be positioned along the chromosome by using known markers from previously constructed genetic and physical maps of the human genome. Selected BAC clones are sheared into smaller overlapping fragments, subcloned, and sequenced. Subcloning is necessary because each sequencing reaction can reliably read only about 500 base pairs (bp). The sequence of the BAC clone can be reconstructed from the set of sequences obtained from the subclones, and BAC fingerprints guide the assembly of several BAC clones into contigs.

Draft sequences were required to obtain an average of fourfold coverage with 99% accuracy as determined by software (i.e., PHRED, PHRAP; CodonCode Corporation, Dedham, MA) that assigns base quality scores and assembles sequences according to the scores. Throughout the duration of the project, sequences longer than 2 KB were required to be deposited in public databases within 24 hours (data are available from the Genome Browser of the University of California at Santa Cruz, www.genome.ucsc.edu ; the GenBank of the National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov/genbank/ ; and Ensembl of the European Bioinformatics Institute and the Sanger Centre, www.ensembl.org ). Assembly of the draft sequence was a three-part process that involved filtering the sequence data to eliminate bacterial and mitochondrial sequences, constructing a layout of clones along the genome, and merging overlapping clones to produce a draft sequence.

The hierarchical shotgun sequencing approach was chosen by the public consortium for several reasons. Dividing the work among sequencing centers was straightforward with the use of clones. Also, the assembly of clones to produce a draft sequence probably would enhance accuracy because approximately 46% of the human genome comprises repeat sequences and exhibits widespread individual sequence variation. In addition, this approach could address cloning bias, and underrepresented sequences could be targeted for sequencing.

Private Sequencing Effort (Whole-Genome Shotgun Sequencing)

In 1998, a private company, Celera Genomics (Alameda, CA), led by Craig Venter, announced its intention of sequencing the human genome in 3 years using a different approach, known as whole-genome shotgun sequencing . Celera’s draft of the human genome sequence was reported in the February 2001 issue of Science ( ). Celera generated 14.8 billion bp of DNA sequence in 9 months to produce a 2.91 billion–bp consensus sequence of the euchromatic part of the human genome with an average of fivefold coverage. The company used genomic DNA obtained from three females and two males of the following ethnogeographic groups: African American, Asian-Chinese, Hispanic-Mexican, and white. A total of 16 different DNA libraries were constructed with three different insert sizes: 2 KB, 10 KB, and 50 KB. Both ends of the insert from clones chosen at random were sequenced to produce “mate pair” sequencing reads. The average distance between mate pairs was known because the range of insert sizes for a clone taken from a particular library could be characterized by calculating the distance between mate pairs in previously sequenced stretches of the genome. Celera generated a set of 27.26 million reads with an average length of 543 bp.

Celera combined its sequence data with all data from the publicly funded efforts available up to September 2000 in GenBank and pursued two different assembly strategies. The whole-genome assembly strategy involved “shredding” the publicly funded sequences into small fragments, combining these fragments with Celera’s reads, identifying overlapping sequences, and joining them to produce long, continuous consensus sequences. These contigs were ordered, and gaps between contigs were quantified using information obtained from the mate pair reads. A variant of the process described previously, known as compartmentalized shotgun assembly , yielded slightly better results because sequences were first clustered, based on mapping information, to a region of the chromosome before the process already described was performed.

Analyses of draft sequences performed by the two groups yielded similar results. Some genes were derived from bacteria or transposable elements, and large segmental duplications were apparent throughout the genome. The distribution of genes, CpG islands, recombination sites, and repeats was found to be highly variable across the genome. Widespread genetic variation was apparent, and about 2.1 million single-nucleotide polymorphisms—1 per 1250 bp—were discovered. One of the most surprising results revealed by completion of the draft sequence was the estimate that the genome contained about 30,000 protein-coding genes, considerably fewer than the 100,000 or more that had been postulated. The number is only one-third greater than that of the worm. However, probably a much larger number of different proteins were noted because of alternative splicing ( ). Scientists used many gene-prediction methods to arrive at the gene estimate of 30,000. Gene-prediction algorithms predict the location of unknown genes in the genome using sequence characteristics learned from known genes, including codon and nucleotide composition within coding regions and conserved sequences at exon/intron boundaries and within promoter regions. The human genome sequence has a low signal-to-noise ratio because the coding regions represent only about 3% of the genome; therefore, algorithms produce a large number of false-positives. Most algorithms, however, use two other important sources of information: (1) similarity to known human proteins and expressed transcripts and (2) homology to proteins and sequences characterized in other organisms. (For a review of gene prediction algorithms, see .)

Finishing the Sequence of the Human Genome

In October 2004, in the journal Nature , the International Human Genome Sequencing Consortium published an article entitled, “Finishing the Euchromatic Sequence of the Human Genome” ( ). This article reported that the draft sequence was missing about 10% of the euchromatic portion of the genome, contained about 150,000 gaps, and had many sequence segments that had not been assigned an order or orientation. The current sequence contains 2.35 billion nucleotides, covers about 99% of the genome, and has only 341 gaps (with an error rate of 1 mistake every 100,000 bases). The sequence revision predicts between 20,000 and 25,000 protein-coding genes; this discrepancy in number is due to differences in gene prediction algorithms. The newly published sequence reveals that segmental duplications cover about 5.3% of the euchromatic portion, providing insight into the evolution of the human genome and aiding the study of diseases caused by deletions and rearrangements of these regions (e.g., DiGeorge syndrome) ( ). This latest sequence also makes it possible to trace the birth and death of genes—genes recently born as a result of gene duplication and genes lost as a result of mutation.

Sequencing of the human genome is a monumental achievement that “holds an extraordinary trove of information about human development, physiology, medicine, and evolution” ( ). It has provided the infrastructure for sequencing other genomes and for understanding the structure and complexity of human genetic variation. This detailed map of the human genome and predicted coding regions has facilitated the emergence of several high-throughput technologies through which scientists have begun to explore the complete set of gene and protein expressions in healthy and diseased human tissue. These technologies promise to revolutionize the classification of human disease and to usher in an era of individualized molecular medicine.

High-Throughput Technologies

Genomic

An intimate understanding of cellular machinery represents the first step toward unraveling the complexity of human disease. Important insights into the function of a gene can be deduced by determining the cell type and conditions under which a gene is expressed, and by quantifying the level of message transcribed. A technique commonly used to assay the level of expression of a single gene (represented by messenger ribonucleic acid [mRNA]) across a few different conditions is a Northern blot. This “gene-by-gene” approach began to change in the early 1990s as a result of the success of the Human Genome Project and various technological advances that enabled the development of high-throughput gene-expression analyses whereby several genes could be assayed simultaneously. Three different genomic high-throughput technologies are discussed here. Serial analysis of gene expression (SAGE) ( ) and DNA microarrays ( ; ) were both developed in the 1990s and are currently in widespread use. Real competitive polymerase chain reaction (PCR) using matrix-assisted laser desorption ionization time-of-flight mass spectroscopy (MALDI-TOF-MS) was first published in 2003 ( ). The basic principles underlying DNA microarrays are discussed in Chapter 68 ; those underlying real competitive PCR are discussed in Chapter 69 .

Depending on the experimental design, samples are obtained from cell cultures or surgical tissues, and total RNA is isolated. Techniques such as laser capture microdissection ( ) are often used before RNA isolation to obtain a homogeneous population of cells from tissue specimens. After total RNA or mRNA is obtained and is reverse transcribed to make complementary DNA (cDNA) (see Chapter 69 ), each technology uses a different protocol to rapidly measure the transcript levels of the genes in each sample. The principles of each method, as well as corresponding advantages and disadvantages, are outlined in the following sections.

Serial Analysis of Gene Expression

SAGE measures the expression level of genes in a sample by isolating and sequencing several thousand short 10- to 14-bp tags isolated from cDNA. Two important pieces of information can be deduced from the output: (1) the sequences of the tags usually allow identification of their corresponding genes and (2) the number of times a particular tag is sequenced is a measure of transcript abundance. The ability to uniquely identify a transcript using as short a sequence as 9 bp resides in the probability of defining this sequence from others. Such a sequence can distinguish 262,144 transcripts (4 9 )—a number greater than current estimates of the number of transcripts in the human genome ( ; ).

The SAGE protocol involves isolating RNA from a sample of interest. RNA is converted to double-stranded cDNA using a biotinylated oligo primer for first-strand synthesis. SAGE then uses two types of restriction enzymes. The first, called an anchoring enzyme , recognizes a specific 4-bp sequence (e.g., NlaIII). Any 4-bp–recognizing enzyme may be used because, on average, enzymes cleave every 256 bp (4 4 ). The anchoring enzyme leaves a short, overhanging, single-stranded piece of DNA at the 5′ end of the site of cleavage. The biotinylated fragment, which represents the 3′ end of the gene, then is bound to streptavidin beads, capturing only digested cDNA fragments that contain a portion of the poly(A+) tail. Captured fragments are purified and are randomly split equally into two pools. The second enzyme, called the tagging enzyme , behaves differently and cleaves DNA 14 to 15 bp immediately 3′ of its recognition sequence (e.g., BsmFI). The recognition sequence for the tagging enzyme is engineered into the sequence of linkers described in the following section.

After the purified fragments are split into two pools, two different oligonucleotide linkers, each containing a different PCR primer sequence (for purposes of discussion, the linkers will be labeled A and B)—the tagging enzyme recognition sequence and a single-stranded DNA overhang that is part of the anchoring enzyme recognition sequence—are designed and synthesized. Linker A is ligated to the 5′ ends of cDNA fragments in one pool, and linker B is ligated to the 5′ ends of cDNA fragments in the other pool. Ligation proceeds by means of base pairing between complementary single-stranded overhanging DNA ends on both cDNA fragments and linkers, creating an intact anchoring enzyme recognition sequence. The cDNA fragments in each pool are then cut with the tagging enzyme, resulting in a new fragment that contains the linker plus a short, 10-bp region of cDNA, known as a tag (the remaining cDNA fragment and the poly[A+] portion of the tail are removed). The two pools of cDNA fragments are ligated together, creating ditags (the new sequences are as follows: linker A–tag–tag–linker B). Ditags are amplified by PCR using primers designed on the basis of sequences in linkers A and B.

Creation of the ditags is important for several reasons. First, ditags can be amplified by PCR for subsequent cloning steps. Second, each tag within a ditag is linked tail to tail and is flanked by anchoring enzyme–recognition sequences, providing important orientation information used to identify the genes corresponding to each tag. Finally, even if tags are highly abundant, the probability of creating identical ditags is extremely low. As a result, the occurrence of identical ditags indicates PCR bias; these ditags are excluded from the final analysis to ensure accurate quantification of transcript abundance ( ; ).

After amplification, the anchoring enzyme is used to cleave the linkers from the ditags. Ditags from different reactions are ligated together end to end to form strings of 10 to 50 tags. The concatenated strings of ditags are cloned into plasmids and are sequenced. Typically, about 50,000 tags are sequenced for each sample of interest using a high-throughput sequencer. (For additional details on the method, see the review by .) The absolute expression levels of genes in the sample are quantified by counting the number of sequenced tags that correspond with each gene. Figure 80.2 provides a visual scheme of the procedure.

Figure 80.2, Serial analysis of gene expression (SAGE) is a technique that measures the expression levels of genes in a sample of interest. RNA is isolated from the sample, complementary DNA (cDNA) is synthesized, and several thousand short base pair (bp) tags are isolated. SAGE uses two types of restriction enzyme: one, called the anchoring enzyme (i.e., NlaIII), recognizes a specific 4-bp sequence and cleaves DNA every 256 bp, immediately 5′ of the sequence tag. The second enzyme, called the tagging enzyme (i.e., BsmFI), behaves differently and cleaves DNA 14 to 15 bp immediately 3′ of its recognition sequence. First, cDNA fragments are cut with the anchoring enzyme, captured using streptavidin-biotin affinity chromatography, purified, and split into two pools—A and B. Second, two different linkers (A and B) are ligated to the 5′ ends of the cDNA fragments in their respective pools. The cDNA fragments in each pool are then cut with the tagging enzyme (TE) , resulting in a new fragment that contains the linker plus a short 10-bp region of the cDNA, known as a tag . The two pools of cDNA fragments are ligated together, creating ditags, which are amplified by polymerase chain reaction (PCR) using primers designed on the basis of sequences in linkers A and B. After amplification, the anchoring enzyme is used to cleave the linkers off the ends of the ditags. Ditags from different reactions are ligated together to form strings of 10 to 50 tags. The concatenated strings of ditags are cloned into plasmids and sequenced. Sequences of the tags are used to identify the corresponding genes and represent a measure of transcript abundance.

SAGE technology has continued to mature, and technologies have been used to refine this technology toward its application, including an increase in sequencing efficiency (deepSAGE), improved tag-to-transcript mapping of SAGE tags (longSAGE), and reduction of the amount of required input RNA (microSAGE). The expansion of SAGE application from solely transcriptosome-based analysis to genomic analysis has given rise to Serial Analysis of Chromatin Occupancy, which identifies genomic signature tags that pinpoint transcription factor–binding sites. Because of the advent of microSAGE, small amounts of material obtained from needle biopsies and from specific cell types (obtained via fluorescence-activated cell sorting or laser microdissection) are sufficient to allow characterization of global gene expression ( ).

The Cancer Genome Anatomy Project of the National Cancer Institute has chosen SAGE technology to sequence more than 5 million tags across more than 100 different human cell types. Data are stored in a public database known as SAGEmap ( ; ). Several tools, such as SAGE Genie ( ), are available for reliably assigning tags to genes and for performing data analysis and visualization. Additional information and details can be obtained at https://sagebionetworks.org/.

SAGE differs from microarray in that the former employs a sequence-based sampling technique that is not contingent on hybridization and does not require well-defined known genes or sequences. Through this approach, novel genes or gene variants can be elucidated. Furthermore, SAGE provides better gene quantification because it directly counts the number of gene transcripts and is less subject to background “noise” of the microarray; however, it is more expensive. Modifications of SAGE, such as longSAGE, utilize different restriction endonucleases as the tagging enzyme that cuts 17 bp 3′ from the anchoring site, generating a tag with a uniqueness probability of >99% and providing the ability to tag a greater proportion of the unannotated genome. Other modifications that require minimal quantities of mRNA for library construction (SAGE-Lite and Micro-SAGE) have also been described ( ; ).

Microarray

DNA microarrays are orderly arrays of spots, each composed of DNA representing a single gene and immobilized onto a solid support such as a glass slide, as described in Chapter 70 . DNA microarrays take advantage of Watson-Crick base pairing; therefore, only strands of DNA that are complementary will hybridize and produce a signal that can be used as a measure of gene expression. Production and use of microarrays requires several steps—including creation of probes, array fabrication, target hybridization, fluorescence scanning, and image processing—to produce a numeric readout of gene expression. Throughout this chapter, a probe will refer to a nucleic acid sequence that is attached to a solid support and the target will be a complementary free sequence of nucleic acids measured for its abundance with the use of microarrays. A detailed description regarding the fabrication of cDNA and oligonucleotide arrays, along with how the RNA is prepared and hybridized to these platforms, is found in Chapter 70 .

DNA microarray experiments can produce millions of data points; this requires a suite of data-processing steps to select relevant genes. Although no standard protocol is available, the following steps usually are included in analysis of a DNA microarray experiment ( Fig. 80.3 ). Image files are converted to numeric values that are normalized and summarized using a software program. Both free and commercially available programs, such as Affymetrix (Affymetrix Inc., Santa Clara, CA) and Agilent (Agilent Technologies, Wilmington, DE), among others, may be used ( ; ). Poor-quality arrays are removed from the analysis. Genes that are not accurately detected by the array and genes that show little variation across samples are filtered out. Then, a variety of computational and statistical analyses are performed. Exploratory data analyses, including classification and identification of differentially expressed genes, can be divided into two categories: supervised and unsupervised methods. Supervised methods, such as class-prediction algorithms, use predefined groups of samples (referred to as the training set ) to identify genes that can distinguish between groups to accurately classify unknown samples (referred to as the test set ). A large number of supervised class-prediction algorithms have been applied to microarray data, including linear or quadratic discriminant analysis ( ), k-nearest neighbors ( ), weighted voting ( ), artificial neural networks ( ), support vector machines ( ), and shrunken centroids ( ). Supervised algorithms—such as significance analysis of microarrays (SAM) ( ), as well as parametric and nonparametric statistical tests between groups of samples—can be used to identify differentially expressed genes. Unsupervised methods, also known as class-discovery methods , can be used to find previously unknown classes, such as novel cancer subtypes, within a data set. Various clustering techniques, such as hierarchical clustering ( ) and self-organizing maps, are commonly used class-discovery algorithms (for a general review, see ; for a cDNA array analysis review, see ; and for an oligonucleotide array analysis review, see ).

Figure 80.3, Analysis of DNA microarray data. The schematic diagram outlines the various steps required in the analysis of DNA microarray data. Image files are converted to numeric values; normalized, poor-quality arrays are removed from the analysis; and genes that are not accurately detected by the array and genes that show little variation across samples are filtered out. Downstream computational and statistical analysis of the data can be divided into two categories: supervised and unsupervised methods. Supervised methods, such as class prediction algorithms , use predefined groups of samples to identify genes that can distinguish between groups and can accurately classify unknown samples. Unsupervised methods, also known as class discovery methods , try to find previously unknown classes, such as novel cancer subtypes, within a data set.

The Gene Expression Omnibus database (GEO; available at http://www.ncbi.nlm.nih.gov/geo ) is a central repository for high-throughput gene-expression data through which a wide range of microarray data from published experiments is publicly available. All microarray data deposited in GEO adhere to the Minimum Information About a Microarray Experiment guidelines established to provide basic information about experimental designs, samples, and types of technology used. In addition to GEO, a wide range of bioinformatics tools is available for microarray data analysis ( ; ). No standards for analyzing microarray data have been established, and various methods can often produce different results (for a discussion, see ). Validation studies with larger sample sizes or using a different technology are usually necessary to confirm significant findings.

Real Competitive Polymerase Chain Reaction

Real competitive PCR combines conventional competitive PCR techniques with single base extension and MALDI-TOF-MS to measure gene-expression levels. The principles of mass spectroscopy are discussed in Chapter 5 , and its applications to drug analysis, identification and speciation of microorganisms and proteomics of cancer-associated proteins are discussed in Chapter 24, Chapter 57, Chapter 77 , respectively. The use of MALDI-TOF-MS in high-throughput genomic studies is a recent innovation that is capable of absolute gene quantification with extremely high sensitivity and produces results consistent with real-time PCR and DNA microarrays. The basic principles underlying real-time PCR are discussed in Chapter 69 .

Analogous to real-time PCR, use of this technique requires previous knowledge of the sequences of the genes of interest. Through this approach ( Figs. 80.4 and 80.5 ), total RNA from a sample is reverse transcribed using random hexamers or gene-specific primers. An 80- to 100-bp region is selected from the gene of interest, and primers are designed to amplify this region in a PCR. A known concentration of an 80- to 100-bp DNA oligonucleotide of the same length and sequence, except for a single-point mutation (known as the competitor ), is added to the PCR for amplification with the gene of interest. The competitor and the gene of interest will be amplified with the same kinetics because their sequences are almost identical. As a result, the concentration of the gene of interest can be calculated on the basis of the amount of competitor present in the PCR. A series of PCRs are performed with different concentrations of the oligonucleotide competitor to accurately titrate the final concentration of the candidate gene. Next, each PCR is subjected to a base extension reaction. A short base extension primer (approximately 23 nucleotides long) is designed to anneal to both 80-bp amplified PCR products adjacent to the site of the single-point mutation. A base extension reaction is then carried out with three dideoxydinucleotide triphosphates and one deoxynucleotide triphosphate to produce two extension products that differ in their terminal nucleotide. As a result of this one-nucleotide difference, MALDI-TOF-MS is able to identify and quantify the two products on the basis of their different molecular weights. The throughput of the assay can be increased by a technique known as multiplexing , whereby several genes are quantified in a single PCR and primer extension reaction using unique primers, competitors, and extension oligonucleotides for each gene.

Figure 80.4, Real competitive polymerase chain reaction (PCR) coupled with matrix-assisted laser desorption ionization time-of-flight mass spectroscopy (MALDI-TOF-MS) is a technique used to measure the transcript abundance of a gene in a sample of interest. RNA is isolated, complementary DNA (cDNA) is synthesized from a sample, and a region (≈80 bp) of the gene of interest is selected for PCR amplification. A known concentration of an oligonucleotide competitor of the same length and sequence except for a single-point mutation is added to the PCR reaction for amplification. A short base extension primer (≈23 bp) is designed to anneal to both amplified products adjacent to the site of the single-point mutation. A base extension reaction is then carried out with three dideoxydinucleotide-triphosphates (ddNTPs) and one deoxynucleotide-triphosphate (dNTP) to produce two extension products that differ in their length by one base. As a result of this difference of one nucleotide, MALDI-TOF-MS is able to identify and quantify the two products on the basis of their different molecular weights.

Figure 80.5, Matrix-assisted laser desorption ionization time-of-flight mass spectroscopy (MALDI-TOF-MS). A laser pulse provides energy for the matrix solution to ionize peptides and oligonucleotides, which travel downstream to the mass analyzer.

Figure 80.6, Proteomic pattern diagnostics. Pattern analysis identifies m/z ratios with the “most fit” combination of proteins for distinguishing between clinical states of interest.

This system, initially developed for high-throughput genotyping, was adapted for gene-expression analysis in 2003 ( ) and was marketed by a California-based company (Sequenom, Inc., San Diego, CA). Thus, public data and analysis resources are in the process of development. Applications of the technique include measurement of expression levels for three genes in RNA isolated from buccal mucosal cells obtained from smokers versus nonsmokers ( ), measurement of allele-specific expression of ABCD1 , a gene involved in X-linked adrenoleukodystrophy ( ), and detection of infectious diseases ( ; ). Variations of this methodology, such as phage display–mediated immunopolymerase chain reaction (PD-IPCR) and competitive quantitative real-time polymerase chain reaction (cqPCR) have also been applied to the food industry to detect toxins ( ) and allergens ( ), respectively.

The three genomic high-throughput technologies highlighted previously use very different techniques to measure the abundance of gene transcripts; each technique has a unique set of advantages and disadvantages. A brief overview comparing the three different platforms can be found in Table 80.1 . The major limitation of SAGE is that it requires many laborious PCR and sequencing reactions per sample and, thus, is high-throughput in terms of genes, not samples. Another drawback of SAGE is that it requires a large amount of starting RNA. However, as noted earlier, several modifications have been made to the protocol; a new technique known as microSAGE requires only 1 to 5 ng of poly(A+) RNA ( ). Advantages of the technology include that it is not based on prior sequence information and that it is capable of discovering novel transcripts. The output from SAGE consists of sequence data that allow direct transcript identification with the use of public sequence databases.

TABLE 80.1
Comparison of High-Throughput Genomic Technologies
SAGE DNA Microarray Real Competitive PCR
Equipment needed Sequencer Arrayer (cDNA only), array scanner Nanodispenser, MALDI-TOF-MS
Throughput
Genes Medium, ≈2000–20,000 tags/day, depending on sequencer Highest, ≈40,000/array Medium, ≈100/day
Samples Low, ≈1/week Medium, ≈10–20/day High, ≈100/day
Cost $1500–$2500/sample if between 50,000 and 100,000 tags are sequenced $500–$1000/chip $1–$2/gene/sample
Amount of starting material required 500–1000 μg total RNA/sample 1–15 μg total RNA/sample, as little as 10–100 ng total RNA/sample can be used with a modified protocol 5 ng total RNA/gene/sample
Gene sequences need to be known a priori No Yes Yes
Absolute gene quantification Yes No Yes
cDNA, Complementary DNA; MALDI-TOF-MS, matrix-assisted laser desorption ionization time-of-flight mass spectroscopy; PCR, polymerase chain reaction; SAGE, serial analysis of gene expression.

DNA microarrays, on the other hand, are higher throughput than SAGE in terms of genes and samples. DNA microarrays are practical for studies assaying clinical samples for which high throughput of samples is required and only small amounts of starting RNA can be obtained. The gene-expression levels obtained from DNA microarrays are only relative transcript levels (in contrast to absolute levels with SAGE) that are dependent on probe selection, making cross-platform comparisons difficult.

Finally, with the use of real competitive PCR, it is difficult to assay several thousand genes, because each gene requires the design of specific primers for the PCR and base extension reactions. Once the assay has been designed, however, the technique can be high throughput in terms of samples. As a result, real competitive PCR probably will not be used as a discovery platform in the way that SAGE and DNA microarrays are used. Rather, it will serve as a validation tool and a potential clinical tool for assaying relatively small numbers of genes across large numbers of individual samples. Continuously evolving ultra high–throughput genomic technologies offered by various vendors (Illumina/Solexa, San Diego, CA; ABI/SOLiD, Foster City, CA; 454/Roche, Branford, CT; and Helicos, Cambridge, MA) will likely be able to provide even greater throughput at reduced cost.

Multiplex PCR

Multiplex PCR (mpPCR) consists of multiple primer sets within a single PCR mixture to produce amplicons of differing sizes that specifically identify different DNA sequences. Primer sets are designed so that their annealing temperatures are optimized to work correctly within a single reaction. The resultant amplicons are different enough in size to form distinct bands when visualized by gel electrophoresis. By its original design, this assay is typically efficient for elucidating the presence and relative concentrations of from 2 to 20 distinct messages and is limited by the resolution capacity of electrophoretic gel separation (for a review, see ).

xTAG Technology

xTAG technology (Luminex Corp, Austin, TX) is a next-generation form of multiplexing that overcomes the resolution limits of mpPCR by combining the methods of multiplex amplification with particle-based flow cytometry. Like mpPCR, multiple reactions can be carried out in a single reaction. However, because of the added flow component, many more tests can be run and resolved at the same time.

Using a viral panel as an example, after obtaining a biological sample, the mRNA is reverse transcribed to cDNA. The cDNA is then amplified using a panel of primers that can specifically amplify many different pathologic/pathogenic nucleic acid sequences at the same time. Each pathogen-specific primer used is tagged with a unique oligonucleotide sequence (called the tag ) as well as a fluorophore. After the multiplex amplification step is completed, the reaction is mixed with microscopic beads that are internally tagged with varying amounts of fluorescent molecules at the time of production. Each different type of bead is also labeled with a unique oligonucleotide sequence that is complementary to the unique tag on the pathogen-specific primer (called the anti-tag ). If both the tag and the anti-tag are present, then hybridization occurs, binding the fluorophore-labeled amplicon to its appropriate fluorophore-labeled bead. The beads are then processed and placed in a special flow-enabled luminometer equipped with two lasers for reading. The first laser identifies the bead based on its internal dye content. The second laser detects how much, if any, tagged amplicon is bound to its surface.

This technology allows for the resolution of 100 or more tests from one sample at one time in one tube. It is adaptable to perform tests on nucleic acids, peptides, and proteins in a variety of sample matrixes (for a review, see ).

High-Resolution Melting Analysis

High-resolution melting (HRM) analysis is a technique for fast, high-throughput post-PCR analysis of genetic mutations or variance in nucleic acid sequences. It enables researchers to detect and categorize genetic mutations rapidly (e.g., single-nucleotide polymorphisms), identify new genetic variants without sequencing (gene scanning), or determine the genetic variation in a population (e.g., viral diversity) before sequencing.

The first step of the HRM protocol is the amplification of the region of interest, using standard PCR techniques, in the presence of a specialized double-stranded DNA (dsDNA) binding dye (e.g., SYBR Green). This specialized dye is highly fluorescent when bound to dsDNA and poorly fluorescent in the unbound state. This change allows the user to monitor the DNA amplification during PCR (as in real-time or quantitative PCR). After completion of the PCR step, a high-resolution melt curve is produced by increasing the temperature of the PCR product, typically in increments of 0.008°C to 0.2°C, thereby gradually denaturing an amplified DNA target. Because SYBR Green is fluorescent only when bound to dsDNA, fluorescence decreases as duplex DNA is denatured, which produces a characteristic melting profile; this is termed melting analysis . The melting profile depends on the length, GC content, sequence, and heterozygosity of the amplified target. When set up correctly, HRM is sensitive enough to allow the detection of a single base change between otherwise identical nucleotide sequences (for a review, see ).

Proteomic

Current proteomic technology offers a variety of promising high-throughput approaches to the investigation of cellular biology. As one moves from the slowly changing, relatively static genome to RNA transcription, and ultimately to protein translation and additional downstream modifications, the information becomes increasingly more dynamic. An estimated 30,000 genes ( ) are present in the human genome and are translated into more than 1 million proteins ( ); the complexity of this system belies simple global analysis ( ). Inability to amplify proteins often necessitates working with small quantities of biological samples. In addition, the range of protein concentrations, spanning several log units, can obscure signals of clinical utility.

You're Reading a Preview

Become a Clinical Tree membership for Full access and enjoy Unlimited articles

Become membership

If you are a member. Log in here