Toxicogenomics: A Primer for Toxicologic Pathologists


Acknowledgments

We would like to acknowledge the work of the authors in the previous edition of this chapter “The Application of Toxicogenomics to the interpretation of Toxicologic Pathology” by William R. Foster, Donald G. Robertson, and Bruce D. Car in 3rd ed. W. M. Haschek, C. G. Rousseaux and M. A. Wallig, eds. (2013) Academic Press. We would also like to express our appreciation for the critical review by Stacey Fossey and Brad Bolon as well as for the assistance provided by the image editor Beth Mahler.

Introduction

Toxicologic pathologists evaluate and integrate data in nonclinical and environmental toxicity studies from multiple sources (e.g., clinical signs, clinical pathology, gross, and microscopic pathology data). Traditional pathology assays have been developed over a couple of centuries and have become well established with but incremental technological advances ( ). The acquisition and interpretation of pathology data are relatively standardized, resulting in reasonable confidence that the underlying biological alterations and the translational relevance are well understood and that these assays can be appropriately utilized for hazard identification and characterization as well as risk assessment ( ).

With the advent and continuing evolution of innovative research technologies, the number of biological endpoints that can be reliably measured from individual subjects in these toxicity studies has increased exponentially. Traditional pathology endpoints (e.g., macroscopic and microscopic assessments, organ weights, clinical pathology analysis) reflect numerous underlying molecular processes (see Morphologic Manifestations of Toxic Cell Injury , Vol 1, Chap 6 ). Technological advances enable the rapid measurement of alterations in DNA, RNA, proteins, lipids, metabolites, and epigenetic markers (i.e., chemical modifications [like methylation] or associated proteins [like histones] that control DNA transcription), thereby providing a basis to understand the molecular mechanisms that lead to a phenotype as measured by traditional pathology assays. This ability to simultaneously measure hundreds of thousands of molecular variables and, more importantly, enhance interpretation of the underlying biological processes represents a unique opportunity to substantially advance the field of toxicologic pathology and associated disciplines ( ).

Basics of Toxicogenomics

Approaches that are typically grouped under the broad field of toxicogenomics include several “-omics” disciplines such as genomics, transcriptomics, proteomics, metabolomics, and epigenomics (also referred to as epigenetics). These technologies enable a systems biology approach where toxicogenomics data are integrated with traditional assay endpoints to generate a comprehensive understanding of toxic and carcinogenic mechanisms. Toxicologic pathologists are key contributors to a successful “systems toxicology” approach given that their knowledge and experience with traditional toxicology endpoints provide the necessary insight for the biological interpretation of toxicogenomics data. This approach is also referred to as “phenotype anchoring” where the morphologic endpoints are linked to molecular alterations detected by -omics technologies. Phenotypic anchoring provides confidence that molecular changes are biologically meaningful and also validates the appropriate development of bioinformatics methods. In addition, toxicogenomics is being used in predictive toxicology where chemicals are evaluated using molecular signatures that were previously defined during in vitro and/or in vivo studies. These molecular signatures are explored based on prototype compounds or test conditions.

Toxicogenomics is a multidisciplinary area that requires expertise in traditional toxicology, pathology, molecular biology, and bioinformatics. As a result, it is important to at least understand the basics of each of these disciplines to optimally apply -omics technologies to toxicity investigations. Toxicogenomics technologies assess increasingly complex biological parameters by evaluating molecules ranging from DNA to metabolites (i.e., small [< 1.5 kDa] molecules that represent substrates, intermediates, and products of metabolism) within cells, biofluids, tissues, or organs ( Figure 15.1 ).

Figure 15.1, An overview of the molecules (e.g., DNA, RNA, proteins, and metabolites) in the central dogma and the corresponding toxicogenomics approaches (e.g., genomics, transcriptomics, proteomics, and metabolomics) to examine the biological program directed by these molecules.

The genetic material of all organisms is captured in the genome (i.e., the DNA). Examining the DNA (i.e., genomics) provides information regarding “what can happen” in an organism since DNA is relatively static across all tissues in an organism. In contrast, RNA expression is dynamic and may be altered in a tissue-specific manner under various physiological conditions and in response to exposure to xenobiotics. DNA is transcribed to generate RNA (e.g., messenger RNA [mRNA] and several types of noncoding RNAs [ncRNAs]), and examination of these transcripts (i.e., transcriptomics) provides information about “what might happen” when physiological homeostasis is altered. Translation of mRNA results in formation of proteins, and examination of the complement of proteins (i.e., proteomics) provides information regarding “what is happening” at the moment. Proteins can undergo posttranslational modifications and alteration by enzymatic reactions, resulting in the formation of metabolites; examination of the metabolites (i.e., metabolomics) gives an idea regarding “what already happened.” Epigenomics involves examination of heritable phenotypes that occur without altering the DNA sequence. Epigenetic alterations may be mediated through changes in chromatin packaging, histone modification, DNA methylation, imprinted genes, and ncRNAs ( ; ; ).

All of these omics technologies can be applied to promote greater understanding of nonclinical study findings. However, transcriptomics largely dominates the field for various pragmatic reasons. First, it is relatively simple and inexpensive to capture the whole transcriptome compared to the whole genome since the size of the transcriptome in any given tissue or organ is approximately 5% of the genome. Second, capturing the entire proteome and metabolome is technologically challenging compared to capturing the entire transcriptome due to the massive dynamic range (i.e., abundance and size) of proteins and metabolites as well as the lack of inexpensive and high-throughput bioanalytical platforms ( ). For a given sample, the dynamic range of proteins and metabolites can be in the order of billions, while that of transcripts is only in the range of 10–100s. Third, the transcriptome provides a snapshot at a point in time that offers an insight into “what might happen” (i.e., helps in generating hypotheses regarding the possible molecular alterations underlying the phenotype). These hypotheses can be tested at the protein (“what is happening”) and/or metabolomic (“what already happened”) levels. In most cases, the transcriptome provides sufficient actionable data in toxicity studies, so further investigations frequently are not pursued. As a result, this chapter will mainly focus on the principles of transcriptomics. However, these same concepts can be applied to other -omics technologies.

Overview of Toxicogenomic Technologies

Toxicogenomics may be broadly classified into nucleic acid-based approaches such as genomics, transcriptomics, and epigenomics and nonnucleic acid-based approaches such as proteomics and metabolomics. The following section provides brief overviews of the key -omics technologies commonly applied in animal toxicity studies. In general, genomics, transcriptomics, and epigenomics utilize similar technologies, such as microarrays or sequencing, while proteomics and metabolomics employ nuclear magnetic resonance (NMR) spectroscopy and mass spectroscopy (MS) methods.

Nucleic Acid-Based -Omics Platforms

Genomics refers to the systematic examination of the DNA sequences of the whole genome or part of the genome. However, genomics is also used as an umbrella term to refer to all genome-wide -omics approaches. Genomics may be conducted using array-based technologies such as single-nucleotide polymorphism (SNP) arrays and comparative genomic hybridization (CGH) arrays or sequencing-based technologies such as whole-genome sequencing (WGS), whole-exome sequencing (WES), or targeted gene panels ( ; ). The array-based chips as well as targeted sequencing panels are efficient tools for screening cell or animal populations for known target genes (representing specific sequences of shorter-length DNA), whereas the sequencing-based approaches, especially the WGS, allow for discovery of new genomic biomarkers. Sequencing of DNA from the entire genome (WGS) is becoming more accessible as a research tool with reduction in sequencing costs but is still quite challenging in terms of data acquisition, analysis, and interpretation. However, more targeted genomic sequencing approaches such as WES or sequencing a panel of gene targets designed to evaluate a specific aspect of biology are increasingly being used in toxicogenomics and cancer research ( ; ; ). Genomics has been used in the field of toxicology in fewer instances compared to transcriptomics.

Transcriptomics refers to the systematic examination of the products of gene transcription. The molecules that can be measured include fully processed mRNA coding for proteins; splice variants (versions of mRNA where exons have been deleted or added during transcription to produce proteins with different functions); mRNA variants with SNPs; and various RNAs that do not code for proteins but instead serve to regulate the expression of mRNAs such as microRNAs (miRNAs) or long ncRNAs (lncRNAs). In general, most toxicogenomics data are focused on assessment of mRNA expression because it is easier to interpret the data and also inexpensive to capture these data. The bulk of the transcriptomic data in the literature as well as in commercial or public databases consists of microarray data. However, with the advent of next-generation sequencing (NGS) technology such as RNA-sequencing (RNA-Seq), ncRNAs are also included in some toxicogenomics assessments.

Epigenomics refers to the systematic examination of the reversible modifications of DNA structure and conformation that affect gene expression without altering the DNA sequence. Epigenetic alterations may be mediated through alterations in chromatin packaging, histone modification, DNA methylation, imprinted genes, and ncRNAs ( ; ). Chromatin packaging determines the accessibility of transcription factors to DNA for downstream signaling. Chromatin is a tightly coiled nucleoprotein complex composed of DNA wrapped around histone proteins. Chromatin packaging may be altered by posttranslational modifications of the histone proteins by acetylation, methylation, phosphorylation, ubiquitination, SUMOylation (addition of small ubiquitin-like modifiers), and ADP-ribosylation. In general, histone acetylation decreases affinity of histone proteins for DNA and causes relaxation of chromatin packaging, leading to increased transcriptional activation. Methylation status of the cytosines in CpG dinucleotides within the promoter regions of protein-coding genes influences gene expression; hypermethylation of the promoter typically represses gene expression. Similarly, other histone modifications can increase DNA access either singly or in combination with other proteins and influence gene regulation.

ncRNAs constitute another major regulator of the epigenome. These transcripts are not translated into proteins but instead regulate gene expression at the transcriptional and posttranscriptional levels, typically acting to silence genes. They include small ncRNAs (19–31 nucleotides) such as miRNAs and piRNAs (PIWI [P-element-Induced WImpy testis]-interacting RNA), midsize ncRNAs (~20–200 nucleotides) such as SnoRNAs (small nucleolar RNA), and the lncRNA (>200 nucleotides) ( ; ). In toxicogenomics, the focus has been mainly on miRNAs and to a lesser extent on lncRNAs ( ; ). MiRNAs primarily function as negative regulators of gene expression at the posttranscriptional level. lncRNAs interact with DNA, RNA, and protein molecules and regulate gene expression and protein synthesis in various ways. For example, lncRNAs act as molecular scaffolds to modify chromatin complexes and interfere with the transcriptional machinery; modulate RNA processing events such as splicing, translation, and degradation; and regulate miRNA ( ; ). Both miRNA and lncRNA are expressed in a tissue-specific manner, but in contrast to miRNAs, lncRNAs are not conserved across species ( ). Both miRNAs and lncRNAs can serve as sensitive biomarkers of toxicity ( ; ; ; ). Each of these ncRNAs not only regulates normal cellular homeostasis but also exhibits altered expression in various toxicities and spontaneous and agent-induced cancers ( ; ; ). Several -omics technologies, such as array-based platforms (e.g., whole-genome arrays, miRNA arrays, and methylation arrays) and NGS-based approaches (e.g., chromatin immunoprecipitation sequencing [Chip-Seq], DNase I hypersensitive sites sequencing, assay for transposable accessible chromatin sequencing, whole-genome bisulfite sequencing, and RNA-Seq), are used to study epigenomics in various in vitro and in vivo biological systems.

Technologies that can comprehensively examine the genome (DNA), transcriptome (RNA), and the epigenome primarily employ microarrays or NGS. Early toxicogenomics investigations used microarrays, a technology that initiated and drove this discipline ( ). Since then, microarray technologies as well as data analysis methods have been refined and standardized, and have contributed to the bulk of the existing toxicogenomics (transcriptomic mainly) data in the literature. Due to decrease in sequencing costs and also innovations in sequencing technology, NGS approaches are being increasingly used in toxicogenomics studies. Compared to microarrays, NGS technologies such as RNA-Seq are superior due to the higher dynamic range in quantification, absolute quantification of gene expression changes, low background signal, and potential for discovering novel transcripts and isoforms. Currently, both microarray and NGS technologies are comparable in terms of reagent and assay costs; however, the costs associated with data analysis and storage are still higher with NGS approaches because the data analysis methods for microarrays are standardized, while those of NGS methods are still in various stages of development. Most of the legacy transcriptomic data are based on microarrays and are deposited in easily accessible databases such as Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/ ), DrugMatrix ( https://ntp.niehs.nih.gov/data/drugmatrix/ ), TG-GATES ( https://toxico.nibiohn.go.jp/english/ ), and the Comparative Toxicogenomics Database ( https://ctdbase.org ) where they serve as critical reference for new toxicogenomic data analysis and interpretation ( ; ; ; ). In contrast, transcriptomic databases with RNA-Seq data are still in their infancy, which limits toxicogenomic data interpretation and metaanalysis ( ). Overall, if the goal is to examine the gene expression of tissues after test article exposure and then compare expression to the legacy transcriptomic data, microarray platforms are probably more appropriate. However, if the purpose is to discover novel transcripts, alternative splicing isoforms, fusion transcripts, and/or ncRNAs, along with generating the whole-genome transcriptomic data, then NGS technologies are preferred.

In addition to the above platforms, more cost-effective, high-throughput transcriptomic platforms have been developed that are based on landmark transcriptomic biomarkers which can capture significant changes in biology using a small subset of the transcriptome. These platforms include the S1500+, which is based on TempO-Seq technology ( ); L1000, which is based on Luminex technology ( ); or variations on these approaches ( ). These platforms are being used increasingly to screen chemical libraries that contain appropriate reference compounds (usually small molecules, but biomolecules may be used) ( ; ).

Microarray Technologies

Various microarray platforms can be used in toxicity and carcinogenicity assessment depending on the objective. Microarrays (often termed “genechips” from Affymetrix) typically are used to assess various nucleic acids such as RNA for gene expression analysis, DNA for SNP genotyping, DNA for CGH to identify gene structural differences and copy number variants, and DNA/RNA bound to a particular protein that is immunoprecipitated (ChIP) for the analysis of epigenetic effects/gene regulation (“ChIP-on-chip” studies). The principles of microarray studies are very similar across various applications ( ).

A DNA microarray chip is a collection of predesigned oligonucleotide (nt) probes with known sequences (~25 nt per spot in Affymetrix, 60 nt in Agilent) attached to a glass surface in a two-dimensional grid where each spot is assigned specific coordinates. A typical protocol for gene expression analysis, demonstrated here using an RNA-based example, includes the following steps. (1) High-quality RNA (with an RNA integrity number [RIN] of >7 assessed using an Agilent Bioanalyzer) extracted from cells or tissues is reverse- ranscribed using reverse transcriptase PCR (RT-PCR) into complementary DNA (cDNA). This cDNA represents the targets. (2) The resulting cDNA is amplified by in vitro transcription in the presence of biotinylated ribonucleotides to generate labeled complementary RNA (cRNA). (3) These biotinylated cRNA targets are hybridized to the antisense oligonucleotide (DNA) probes arrayed on the glass surface. (4) After hybridization, the arrays are stained with streptavidin-phycoerythrin (where streptavidin binds to the biotin in the cRNA and phycoerythrin is a fluorophore). (5) The fluorescent signals are captured with a scanner and a differential gene expression list is generated based on the comparison of groups identified in the experimental design.

Microarray experiments typically have one of two designs. In the case of a single-color microarray, the test and control samples are hybridized to separate chips containing an identical gene array, and the differential gene expression levels for the two samples are evaluated by comparing the fluorescent signals from two chips. In the case of a dual-color microarray, the test and control samples are labeled with different fluorescent probes (e.g., Cy3 and Cy5) and then hybridized simultaneously to the same chip so that the differential gene expression is obtained from a single chip. Due to the complicated experimental designs associated with dual-color microarrays, single-color microarrays have been and remain more popular. In addition, single-color microarrays are more efficient because they permit comparison of data from each chip (since it is one sample per chip), which is not possible in dual-color microarray since the experiment has to be repeated if new comparisons are needed between samples. Hence, single-color microarray enables comparison of each individual experimental sample and also both allows more efficient comparison of new data with findings from other microarray experiments and better supports metaanalyses. Much of the published microarray data are based on single-color microarray experiments. Some of the major vendors of microarrays used in animal toxicity studies include Affymetrix, Agilent, and Illumina.

Next-Generation Sequencing Technologies

Sequencing is the process by which the order (“sequence”) of nucleotides is determined in DNA or RNA fragments. The technologies used for sequencing of genomic material have evolved from the low-throughput Sanger sequencing-based methods to NGS methods that use highly parallel protocols ( ; ). The principles of NGS are very similar for genomic, transcriptomic, and epigenomic endpoints. In order, activities include library construction, sequencing, assembly, alignment, variant calling, and sometimes other downstream purpose-driven analyses. The basic premise of library construction includes harvesting total DNA (or total RNA that is converted to cDNA) from a tissue sample followed by fragmentation into small pieces of a certain size (typically 100–5000 bp) using physical forces (sonication) and/or enzymes ( ). Next, sequencing adapters are attached to the uniformly sized fragments on both their 3′ and 5′ ends. These adapters not only help in immobilizing the DNA fragments onto a solid surface (e.g., beads or a glass surface) but also contain priming sites to permit clonal amplification of the molecules (DNA fragments) attached to the adapter. Sequencing priming sites may be present in one or both of the adapters to conduct single-end or paired-end sequencing, respectively. Generally, paired-end sequencing is preferred for long genomic DNA fragments, such as de novo genome assemblies, while single-end sequencing is preferred for small fragments such as small RNA-Seq or ChIP-Seq. For some libraries, up to 96 unique indices (with molecular barcodes comprised of unique nucleotide sequences) are added within the adapters of each sample to enable multiplexing of samples within a single sequencing run. There are several NGS platforms such as the Roche 454, Illumina/Solexa, SoliD, Ion Torrent and PacBio sequencers, and each of these instruments has their pros and cons ( ; ). Currently, Illumina is the industry leader due to their sequencing and cost efficiencies as well as their higher throughput, so most NGS data are being generated on Illumina sequencers.

For RNA-Seq studies, some additional principles should be considered that are specific to library preparation using RNA samples ( ). The total RNA isolated from any tissue comprises 80% ribosomal RNA (rRNA), 15% transfer RNA (tRNA), and 5% mRNA. Transcriptomic studies are mainly focused on biologically relevant protein-coding mRNA, so RNA-Seq library preparation is focused on selectively concentrating the mRNA population. This is accomplished by enriching transcripts with polyA tails or depleting rRNA (e.g., utilizing Ribo-zero kits). Since some mRNAs lack polyA tails, the rRNA depletion method that enriches mRNA both with and without polyA tails is usually preferred. The purified mRNA is enzymatically fragmented into 200–400 bp length, after which RT-PCR is used to convert RNA into cDNA. The cDNA is then included in the NGS workflow as described above. Care must be taken when constructing RNA libraries to maintain an RNase-free environment.

The NGS workflow is significantly different from that used with microarrays. Therefore, several decisions must be made during development of the experimental design. A detailed discussion of these factors is beyond the scope of this chapter, but some of the key considerations are given below in Table 15.1 .

Table 15.1
Next-Generation Sequencing
Next-generation sequencing (NGS) technologies are increasingly used for toxicogenomics projects. Each new omics technology comes with its own technical jargon, and it is important to understand such terms in order to interpret the data.
The RNA/DNA samples are submitted to a sequencing core laboratory to obtain the transcriptomic/genomic data. In a typical microarray experiment, relatively few decisions need to be made by the investigator prior to sample submission; the key decisions are the desired platform (Affymetrix, Agilent, or other) and the type of the gene chip for the desired molecular endpoints. However, in an NGS experiment, several issues need to be considered before samples are submitted to the core laboratory. Before starting the study, it is recommended to review the relevant literature, refer to educational resources and tutorials (from Illumina ( https://www.illumina.com/science/technology/next-generation-sequencing/beginners.html ) and Genohub ( https://genohub.com/next-generation-sequencing-guide/#top )), and often to consult with expert technical staff at the core laboratory. Here, we will discuss a few practical issues to be considered in planning a sequencing analysis.

  • 1.

    What is a sequencing read/cycle? Do I need a single-end read or paired-end reads?

Sequencing instruments “read” or detect the order of the nucleotides in a DNA fragment from one end to the other in a single-end read assay, while in a paired-end assay, it reads from one end to the other and then reads the fragment again from the opposite end. Obviously, single-end reads are faster and cheaper and thus are typically used for general profiling studies such as RNA-Seq and small RNA-Seq. Paired-end reads give double the sequencing data at less than double the sequencing costs while providing additional confidence in the inferred nucleotide sequence. Paired-end reads give additional positional information in the genome and thus are preferred for de novo genome assemblies as well as to study structural rearrangements such as deletions, insertions, and inversions. Paired-end reads can also be used to examine splice variants, SNPs, and epigenetic modifications.

  • 2.

    What read length and how many reads do I need?

A read length refers to the number of base pairs (bp) that can be analyzed at a time. Since each base is read in one cycle, the read length also corresponds to the number of cycles. Typically, read lengths or cycles of 50–75 are used for RNA-Seq applications and 150–300 are used in DNA-seq applications on Illumina sequencing machines. However, other sequencers such as those from PacBio can read up to 1500 bp. Currently, Illumina sequencing is the dominant player in the field. Read lengths of 75 bp with paired-end reads (2 × 75) for transcriptome analysis and 50 bp using single-end reads (1 × 50) for small RNA sequencing are typical. Likewise, read lengths of 2 × 150 bp for whole-genome or whole-exome sequencing and 2 × 150 bp or larger for de novo sequencing read lengths are preferred.

  • 3.

    What read depth or coverage do I need?

Read depth refers to the total number of times a single base is read during a sequencing run. Coverage describes the reads for the novel sequence in the context of the reference sequence (i.e., whole genome or a targeted region). During sequencing, library fragments are read randomly and independently. As a result, some fragments are read (or “covered”) a greater (or lesser) number of times than others (i.e., the coverage is not uniform for all portions of a long sequence). If the sequencing reads align to only 90% of the target reference, then the coverage is 90%. Higher coverage provides greater confidence that all the fragments that overlap the entire target region are represented (or “covered”) adequately in the data. Ideally, uniform sequence coverage can be confirmed by the demonstration that the data produce a Poisson-like distribution with a small standard deviation and a low interquartile range. Depending on the goals of the experiment, the recommended coverage or read depth varies. With whole-genome sequencing, whole-exome sequencing, Chip-Seq, and targeted sequencing, coverage of ~20×, 100×, 40×, and >500× is recommended, respectively. For RNA-Seq studies, read depth is typically used instead of coverage; depending on the goals of the RNA-Seq study, the read depths may range from 5 million (M) to 100M reads (e.g., 20M for typical gene expression profiling, 60M or higher to obtain information on alternative splicing, and about 5M for small RNA-Seq). Sequencing capacities of different sequencers vary and, as a result, multiple samples can be pooled (i.e., with unique molecular identifier [UMI] adapters for each sample) and run in a single lane (multiplexed) to take advantage of the sequencing capacity of the instrument. Coverage calculators are also provided by vendors and other sources to calculate how many samples may be pooled together for a given run depending on the instrument and the type of sequencing kit being used ( https://support.illumina.com/downloads/sequencing_coverage_calculator.html ).

Protein- and Metabolome-Based -Omics Platforms

Proteomics is broadly defined as “the effort to establish the identities, quantities, structures, and biochemical and cellular functions of all proteins in an organism, organ, or organelle, and how these properties vary in space, time, or physiological state” ( ). The field of protein sequencing had a head start (1950s) compared to RNA and DNA sequencing (1960s), but the technical limitations associated with proteins have hampered the field of proteomics compared to genomics ( ). Some of these technical limitations include a wide dynamic range in abundance and size, difficulties in physical handling, extensive posttranslational modifications, and the tendency to form insoluble complexes as well as preenzymatic, enzymatic, and nonenzymatic alterations that disrupt the actual protein structure. Less than 40% of the protein spectrum can be explained by mRNA transcript levels. In addition, proteins are in a constant state of flux associated with the phases of translation, maturation (three-dimensional folding and posttranslational modifications), regulation, and termination ( ). There are an estimated 300,000 proteins considering only splice variants and immediate posttranslational modifications ( ). The proteome has a massive dynamic range in abundance (12 orders of magnitude), so MS approaches require a million to a billion copies of a protein molecule for detection even when combined advanced analytical techniques such as depletion of abundant proteins and enrichment of particular proteins of interest are used ( ). In spite of these limitations, significant progress in bioanalytical technologies and bioinformatics approaches has been made in the proteomics arena in recent years, so proteomics is being used more frequently in toxicogenomic studies.

Proteomics Technologies

Proteins can be studied using antibodies (protein arrays), two-dimensional gel electrophoresis (2D-GE) or 2D differential gel electrophoresis (2D-DIGE), NMR spectroscopy, and MS ( ; ).

Protein arrays are limited in their application since antibodies are required against particular peptide targets but usually are not available for novel peptides. As a result, protein arrays are mainly used as screening tools.

2D-GE and 2D-DIGE approaches separate proteins based on their differences in charge (by isoelectric focusing) and size (as migration within the gel depends on molecular mass). For 2D-GE, proteins are visualized after staining with Coomassie blue. The 2D-DIGE method uses fluorescent probes (e.g., Cy2, Cy3, Cy5) to label the proteins from reference and test samples to compare their electrophoretic spectra when both are run on the same gel simultaneously. The individual protein spots on the 2D-GE and 2D-DIGE subsequently can be extracted and subjected to MS to further characterize the proteins.

NMR spectroscopy measures the characteristic frequency changes of the electromagnetic signals produced by the charged atomic nuclei in a molecule when placed in a strong magnetic field and pulsed with varying radiofrequencies (RFs). The frequencies at which the atomic nuclei resonate (i.e., match the frequencies of the RF pulses) produce an exclusive NMR spectrum for that sample (i.e., a unique pattern of resonance peaks at different frequencies). NMR spectra are widely used for measuring small molecules, and identification of proteins is usually limited to proteins smaller than 35 kDa.

MS measures the mass-to-charge ratio (m/z) of ionized molecules moving through electric and magnetic fields. Typically, samples for MS are bombarded with electrons to fragment and ionize them. As the ionized peptide fragments migrate through the electric fields at certain speeds based on their m/z and travel through the magnetic fields in specific paths (straight or deviated since ions with a higher m/z deviate less than ions with lower m/z charge in the magnetic field), a detector captures the mass spectrum. The resulting mass spectrum provides an idea on the structure of the sample in terms of its elemental composition while the isotopic makeup of the fragmented constituents is generally confirmed by comparing to a database of known molecules. Technologies based on MS have a limited dynamic range of detection (5 orders of magnitude). There are many variations in MS depending on the type of sample (e.g., solid, liquid, gas, complexity of the protein molecule, mixtures), ionizer, mass analyzer, and detector. Discussion on the variations of MS is beyond the scope of this chapter, but further information can be found elsewhere ( ).

Proteomics using MS may be categorized into bottom-up (BU-MS) or top-down (TD-MS) approaches. Due to the scale and throughput for protein identification, BU-MS proteomics has dominated the field ( ; ). For more technical details on proteomics and potential future technologies, please refer to .

The BU-MS approach (also called “shotgun” proteomics) is used in global protein identification as well as to systematically profile the dynamic proteome. The proteins are digested by trypsin into smaller peptides (0.8–3 kDa). The “tryptic peptides” are analyzed by electrospray ionization or matrix-assisted laser desorption/ionization, both of which ionize peptides in a gas phase, analyze their masses, and then fragment the ions to recover information about their sequence from MS. Alternatively, liquid chromatography–MS (LC-MS) can be used to separate molecules before they are ionized and subjected to MS. The resulting peptide information is not a sequence but more like a fingerprint that is compared to a database using search engines like Mascot ( http://www.matrixscience.com/ ) or Comet ( http://comet-ms.sourceforge.net/ ) to identify the protein ( ). The sensitivity of the BU-MS approach is compromised by the number of spectra required to accurately identify the protein sequence. It is estimated that about 60% of spectra are unresolved and do not provide a complete identification for the protein.

The TD-MS approach is advantageous to identify protein isoforms, sequence variants, and posttranslational modification. In the TD-MS approach, intact protein ions are introduced into the gas phase by electrospray ionization and fragmented by collision-induced or electron transfer dissociation in the mass spectrometer. The masses of both the proteins and fragment ions provide a picture of the protein structure and its modifications. However, the TD-MS approach is limited to larger proteins (50–70 kDa) and is complicated by the fact that the fragmentation efficacy of the intact proteins decreases as the molecular weight and the complexity of the tertiary protein structure increases. Therefore, the TD-MS approach mainly focuses on proteins <70 kDa, and only a few projects focus on larger proteins >100 kDa.

Metabolomics is the systematic study of small molecule by-products in cells, tissues, or biofluids ( ; ; , ). These small molecules are 50–2000 kDa and include sugars, nucleotides, peptides, amino acids, and lipids of either endogenous or exogenous origin. Endogenous metabolites are produced by the cell or organism during normal physiological processes. The endogenous metabolome is considered to have 3000–5000 metabolites. The animal's microbiome is another significant contributor to the metabolome and contributes to more than 6700 unique metabolites ( ). The exogenous metabolome is much larger and depends on various sources such as foods, drugs (e.g., therapeutic or recreational), chemicals (e.g., agricultural and occupational), several environmental attributes (e.g., air and water pollution), and lifestyle factors (e.g., alcohol- or caffeine-containing beverages, smoking). Most metabolites regardless of source are a direct reflection of the underlying biological functions of the organism and are usually conserved across species; hence, they serve as a valuable source of biomarkers for exposure and/or disease. In addition, the examination of metabolites in body fluids collected noninvasively and repeatedly enables longitudinal human epidemiological and animal studies. As a result, metabolomics is considered an ideal technology for exposome research, where the exogenous compounds as well as the consequent endogenous metabolites are queried simultaneously to gain a better understanding of the exposure and its consequences on the biological system ( ). Consequently, metabolomics may provide greater confidence when interpreting the underlying biology responsible for homeostatic processes in health. In general, metabolomics may be used for early detection and interventions, disease diagnosis, prognosis and monitoring, identification of exposures and their impact on the organism, and target identification of therapeutics and toxicants.

You're Reading a Preview

Become a Clinical Tree membership for Full access and enjoy Unlimited articles

Become membership

If you are a member. Log in here