Protein Biomarkers for Detecting Cancer: Molecular Screening


Biomarkers of early cancer detection, specific markers of a malignancy type, and predictive markers of response to treatment will aid in the early diagnosis and selection of the most efficient therapies. An exponential growth in technologies has been achieved toward this goal in the past decade. However, it is safe to say that the field of disease biomarkers produced many more publications on the subject than actual “clinically actionable” targets. By no means a criticism, this statement reflects on the status quo of a discipline that has been “trying really hard” but finding the goal more elusive with each step forward. Why has the progress been slow? Is it just the technology that is not on par with the complexity of human biology? Or perhaps, by focusing mainly on a paradigm of “DNA-mRNA-protein” as a fundamental driving force that defines a phenotype, are we oversimplifying and hence misinterpreting a system? We bring these very general questions to the attention of the reader for two reasons: first to spark a debate on the fundamental issues that biomarker discovery entails and second to put our detailed proteomics discussion into the perspective within the larger context of biomarker discovery.

Establishing a panel of biomarkers for early diagnosis of cancer holds tremendous promise but also faces daunting obstacles. The major challenge stems from the very nature of a biological system, its complexity, dynamics, variability, and versatility all making it difficult to draw clear lines between a state that would be considered normal and one that appears to be slightly different and in danger of becoming abnormal. Then come barriers of logistics, such as a dichotomy between a need to mainly rely on population studies while facing tremendous intra- and interpersonal phenotypic variability, defining adequate controls, availability of specimens, and the danger of their potential adulteration before analysis (i.e., in the course of collection, processing, and storage). Last but not least is the requirement for sensitive technologies capable of measuring target compounds directly from a biological milieu with a level of specificity and selectivity that allows differentiation among discrete cohorts of individuals/patients that is reliable enough to justify the risk of acting on specific clinical modalities. There is a growing recognition that achieving true breakthroughs in the use of biomarkers to improve human conditions, while delivering health care in an economically sustainable way, requires concerted and coordinated efforts of stakeholders across disciplines and across borders. A large consortium of authors has recently published an in-depth analysis of the status of implementation of proteomics biomarkers of disease with a focus on postdiscovery/postvalidation barriers and outlined a need for a roadmap to biomarker implementation. In our review, we concentrate on the preimplementation stages of the biomarker pipeline from the perspective of technologies that are currently available for biomarker discovery and validation, with a focus on protein biomarkers for which mass spectrometry (MS) plays a primary role.

The discipline of MS-based proteomics emerged through a serendipitous convergence of major technological developments in DNA sequencing, MS analysis of proteins/peptides and bioinformatics. In the enthusiasm that followed the success of the Human Genome Project, it appeared that understanding how the functional phenotype of a cell/tissue/organism relates to its protein/peptide repertoire was imminent, leading to exaggerated expectations that the new technology of proteomics would deliver meaningful results in a short time. A rush for quick success resulted in controversies and disappointments, and unavoidably triggered questions as to the fundamental validity and practical usefulness of proteomics approaches. The involvement of funding agencies (e.g., the National Cancer Institute [NCI]), which funded programs aimed at addressing various stumbling blocks of proteomics technologies (e.g., the Clinical Proteomics Technologies for Cancer), and of professional organizations and consortia (e.g., the Human Proteome Organization [HUPO] and the Proteomics Specification in Time and Space [PROSPECTS] Network ) has been crucial, providing funding and building foundations and collaborations among various disciplines to ensure future success in developing protein-based biomarkers of disease. Now, after more than a decade of intense effort, the promise of proteomics remains valid. Accumulated experience, however, required reassessment of the prospects of achieving quick payoffs—especially in the context of translational research, as illustrated by a recent breast cancer biomarker study using an animal model that is reviewed at the end of this chapter. Nevertheless, the field is progressing steadily, and the lessons learned in the early days of the proteomics bonanza are informing the directions of new developments. Thousands of papers on the subject have been published to date. Here we focus on selected aspects of method development that, in our opinion, will play a significant role in moving this effort toward its eventual success.

Before discussing different aspects of technology for protein biomarker discovery, it is important to ponder the nature of the objects of inquiry, that is, a protein and a proteome, in the context of the goal of finding biomarkers of physiological processes leading to tumorigenesis. At first approximation, a protein can be viewed as a gene that has been realized in a dimension defined by the available amino acid building blocks. With this simple though simplistic concept of a protein-gene equivalence in mind, the proteome can be reduced to a catalogue of gene products present in a cell. From this perspective, experiments aimed at monitoring gross changes in the repertoire of “protein parts”—that is, the presence or absence of a specific gene product, or significant differences in protein levels—would lead to the detection of robust biomarkers. Most protein biomarker discovery studies to date were based on these operational definitions of a protein and a proteome. The adopted strategies represented a natural extension of familiar approaches used in functional genomics and took advantage of the availability of the required, albeit imperfect, experimental tools. However, although proteins start their lives as strings of amino acid residues arranged in an order prescribed by DNA, they further acquire a multitude of attributes that endow them with functional competence: they fold into specific shapes; they might need to be “cut to order” by proteolytic enzymes; they are decorated with modifications, many of which are transient by design; they are localized to the proper compartment of a cell; they interact with other cell components, including other proteins, to form “molecular machines”; and, once their mission has been fulfilled, they need to be disposed of in an orderly fashion. Defects in any of these processes could lead to, or result from, disease and hence could be used as valuable biomarkers. However, because of the very nature of experimental design and confounding factors of system complexity and limitations of technology, these types of biomarker candidates are likely to be missed in studies geared solely toward detecting changes in peptide/protein concentration. Although technically more challenging, approaches that target specific attributes of protein structure and function, including but not limited to the examples listed earlier, are gaining importance in the biomarker discovery field. Of note, systems biology–based predictions are still far from being able to model all aspects of protein “life.” Thus the analysis of a final protein product is the only approach currently available for characterization of protein-driven cell processes. As we discuss in this chapter, MS-based proteomics approaches enable exploration of biomarkers in various contexts.

Defining “Normal”

“Always remember that you are absolutely unique. Just like everyone else.” This maxim, attributed to Margaret Mead, is also an excellent comment on the challenges intrinsic to biomarker discovery. What is normal for one person might be outside the norm for another. Hence, the “perfect world” scenario of early detection of malignancy, or any other malady, would use “self” as a control and rely on serial analyses of specimens collected at different time points, thus focusing on changes in, rather than absolute levels of, putative markers of health or disease. For a variety of reasons, longitudinal collection of specimens for the population at large is not feasible. Therefore, we are limited to epidemiological approaches. In this context, assessing the level of variability within a population is vital to provide a baseline/reference point for evaluating disease consequences in terms of biomarkers. To this end, a number of tools have been proposed and/or are being developed. A protein equivalent to the HapMap, and repositories of human DNA sequences, is the foundation on which disease-related changes in protein repertoire will need to be built. Databases of MS-identified peptides are being generated for human and model organisms (e.g., mouse ). In addition, the normal range of posttranslational modifications (PTMs) should be considered. For example, population proteomics proposes using targeted affinity capture approaches combined with high-throughput MS to screen large numbers of samples for a broad array of modifications. Antibodies or lectins could be used for enrichment. In this regard, the Human Protein Atlas is a valuable resource with respect to antibodies, and it will be important to generate the lectin equivalent. Examples of these datasets include publications on the normal urine, breast, oral epithelium, liver, and brain proteomes and PTMs in various settings. These catalogues of normal proteomes and PTMs will be crucial for identifying disease-related changes—that is, candidate biomarkers.

MS Proteomics for Biomarker Discovery and Validation: An Overview of Basic Methods

The ideal biomarker discovery methods should be comprehensive to sift through as many potential targets as possible; verification/validation methods should be specific and accurate to filter out all false positives. The “open-mindedness” of MS in detecting all species that are ionizable, transferable to the gas phase, and produce discernible signals under the chosen experimental conditions, makes it an excellent biomarker discovery tool ( Figure 22-1 ). Because it is not necessary to know beforehand the identities of compounds to be monitored, untargeted (shotgun) approaches can generate extensive information on sample content. Conversely, MS assays can also be executed in a targeted fashion by focusing data acquisition on prespecified compounds of interest, thus providing the high level of selectivity required for biomarker validation experiments. In their most sophisticated format, verification/validation assays employ standards labeled with stable isotopes to enable high-sensitivity detection, structure confirmation, and absolute quantification of selected protein targets ( Figure 22-2 ). In contrast to classical immunochemistry-based approaches, MS-based biomarker validation assays—that is, stable isotope dilution (SID) multiple reaction monitoring (MRM) MS (also called selected reaction monitoring [SRM] MS)—do not rely on protein-specific antibodies, and offer high multiplexing capabilities. To achieve ultimate sensitivity in detecting and quantifying selected proteins, immunoaffinity isolation of representative target peptides followed by MRM is used and is referred to as stable isotope standards with capture by anti-peptide antibodies (SISCAPA). Last, recent developments in MS technology—that is, improvements in mass resolution leading to significantly higher accuracy in detecting true targets —and greatly enhanced scan rates, enabled hybrid MS discovery workflows in which the targeted and untargeted methods are executed in a single analysis. These hybrid workflows allow for the preferential analysis of predetermined ions while the excess capacity is used to analyze “unknowns” (see Figure 22-1 , B ).

Figure 22-1, Shotgun (A) and hybrid (B) proteomics discovery workflows. A protein mixture isolated from a specimen consists of a great variety of species present at different concentrations (the size of a “protein shape” reflects its relative abundance). In either workflow, only a fraction of proteins/peptides that are present in the sample are identified. At step 1, proteins are cut to smaller, manageable pieces using proteolytic enzyme(s), most commonly trypsin, to generate peptides ( circles; different sizes represent relative concentrations). The resulting peptide mixtures, which typically contain hundreds of thousands of species, are prefractionated using chromatography (not shown) and then submitted to LC MS analysis (step 2). To identify a peptide, and hence the protein from which it was derived, peptide ions must be fragmented into smaller, sequence-dependent parts. Peptide fragmentation is performed within the mass spectrometer and is referred to as tandem mass spectrometry (MS/MS). Most of the MS/MS approaches currently used rely on analyzing each peptide molecular ion (precursor) one by one. Given the complexity of a sample, at any given time the mass spectrometer is presented with many more precursors than it can possibly analyze by MS/MS. Hence, ion prioritization for further MS/MS analysis is necessary. In an untargeted, shotgun workflow (A) , selection of a peptide molecular ion for fragmentation is based on signal intensity, leading to a stochastic choice of precursors from within a large array of potential contenders. Predictably, shotgun analysis is biased toward selecting highly abundant peptides at the expense of less abundant ones. In the hybrid workflow (B) , a predefined set of precursor ions of interest is selected for MS/MS (step 1) followed by a shotgun signal intensity-based acquisition routine (step 2), as described for (A) . The targeted ions are analyzed by MS/MS regardless of their relative abundance, as long as they meet the analysis-wide signal-to-noise threshold for precursor ion intensity (note the small size of the red dots representing target peptides selected for MS/MS in workflow B). In step 3, the experimental MS/MS spectra are compared to the theoretical in situ generated mock spectra predicted for peptides representing all proteins for which DNA sequences are known. In this process, the peptide identities and hence proteins are not derived de novo from the MS/MS data. Rather, they represent the best matches between the observed and theoretically predicted spectra. Of note, typically some of the molecular ions selected for MS/MS analysis do not generate reliable matches, and hence the number of identified peptides is always lower than the number of acquired MS/MS spectra. In step 4, bioinformatics tools are used to combine peptide-based matches to generate a list of proteins that are most likely present in the mixture. Because of the targeted nature of workflow B , low-abundance proteins (red) that are missed by the shotgun workflow (A) can now be identified. Thus, using a targeted approach for discovery overcomes the limitations of stochastic ion selection. Hence, to compare the compositions of two samples it is necessary to target the species of interest in both samples rather than rely on the unbiased nature of the stochastic shotgun workflow.

Figure 22-2, Outline of stable isotope dilution multiple reaction monitoring mass spectrometry (SID MRM MS) experiments

Discovery and verification/validation platforms share a number of analytical steps albeit their modes of execution can differ significantly. Typically, liquid chromatography (LC) is used to fractionate a sample before MS analysis. Eluting compounds are either transferred online in a continuous fashion to the mass spectrometer for electrospray ionization (ESI) MS or collected offline in discrete fractions for subsequent matrix-assisted laser desorption ionization (MALDI) MS. The classical experimental approach to proteomics MS data acquisition, which still dominates the field, involves rapidly toggling between two distinct automatically executed modes of operation: MS and tandem MS (MS/MS). The MS mode operates under conditions that maintain the molecular integrity of analytes and delivers a survey of mass-to-charge ratios ( m/z ) and relative intensities of the detected molecular ions (precursors). During MS/MS, a subset of precursor ions are transferred, one by one, to a collision cell where they are dissociated under controlled conditions into a series of fragment (product) ions whose m/z values and relative intensities carry structure-specific information. Although the elemental steps are common for both verification/validation and discovery platforms, the paradigms of data acquisition and analysis differ significantly. In the former case, precursor ion selection before MS/MS is centered on predetermined masses of the components of interest; all other species in the sample are ignored. In the latter case, the targets are yet to be discovered and, hence, precursors are selected in a stochastic fashion. As to data analysis, in discovery assays, the experimental MS/MS spectra are compared to the theoretical in situ generated mock spectra predicted for peptides representing all proteins for which DNA sequences are known. In verification/validation assays (SID MRM MS), confirmation of analyte identity is based on matching the MS/MS data to the previously experimentally established, rather than predicted, MS/MS fragmentation features (i.e., types and relative intensities of product ions).

Completeness of Shotgun Biomarker Discovery Proteomics

Limitations Inherent to Peptide-Centric Strategies for Protein Biomarker Discovery

With the exception of protein profiling, described later, the great majority of workflows that are currently used for protein biomarker discovery and validation rely on the analysis of peptides generated from proteins via enzymatic or chemical digestion. Consequently, the information about the structure of a protein identified in the sample is limited to a portion of amino acid sequence that is encompassed by the observed peptides ( Figure 22-3 ). When the goal of analysis is to detect differences in relative protein concentration, limited sequence coverage by detected peptides often leads to ambiguity in identifying highly homologous proteins. Most importantly, any subtle but functionally relevant disease-related structural modifications resulting from posttranscriptional/posttranslational processing are likely to remain invisible when shotgun approaches are used (see Figure 22-3 ). To remedy this problem, a number of specialized methods based on either chemical or biochemical characteristics of specific modifications (e.g., affinity capture using antibodies or lectins) are being developed to increase the chances of detecting biomarkers that reflect proteome alterations extending beyond changes in protein expression and/or degradation. The ideal solution would be to use a protein-centric approach: to analyze intact proteins by MS to directly reveal potential changes in highly heterogeneous protein populations. Presently, high-resolution MS and MS/MS analysis of intact proteins in biological samples remains technically challenging. However, great strides in instrument design, along with novel method development, continue to advance this area of proteomics research.

Figure 22-3, Challenges and opportunities presented by special features of protein structure, e.g., protein posttranslational modifications in proteomics analysis

Technical Limitations of Data Generation

The great complexity and large dynamic range of protein concentrations in biological samples present major challenges to all MS proteomics workflows because of a mismatch between the number of molecular ions generated and the MS analyzer’s capacity to process them by MS/MS. As a result, a large fraction of the detectable ions are not identified in a single MS/MS experiment (see Figure 22-1 ). Furthermore, because of the combination of the inherent nonlinearity of the system, and stochastic selection of precursors, subsets of ions analyzed by MS/MS will not be identical across replicate experiments, generating dissimilar, albeit overlapping, protein/peptide lists. Adverse effects of crowding-related competition lead to undersampling, affecting untargeted discovery platforms to a much higher degree than targeted approaches. The actual extent of these limitations was demonstrated through a series of controlled, parallel benchmarking experiments, which established that prior knowledge of signature peptides (i.e., peptides having sequenced that are unique to targeted proteins) markedly improves overall detection sensitivity and reliability of quantification. Performing replicate LC MS runs increases the number of identified m/z features. To maximize efficiency and information content, it is advantageous to set up replicate analyses in an intelligent fashion to ensure that the same set of molecular ions will be interrogated across all samples, while allowing room for new discoveries in the course of each iteration. To this end, the hybrid untargeted/targeted workflows described earlier are best suited to monitor specific molecular ions that were missed in parallel analyses, thus minimizing information gaps in sample sets. Undersampling tends to skew the results in favor of the most abundant species, especially when ion intensity is used as the sole criterion for precursor selection—that is, when “data-dependent” routines are used. Thus, novel iterative “information-dependent” data acquisition routines are being developed that intelligently prioritize targets for tandem MS on the basis of previously collected data.

You're Reading a Preview

Become a Clinical Tree membership for Full access and enjoy Unlimited articles

Become membership

If you are a member. Log in here