Cancer Systems Biology: The Future


Over the past decade, complementary and at times antithetic views of tumor initiation and progression have emerged, often based on the introduction of novel high-throughput technologies for the characterization of the cell’s genetic and epigenetic landscape. On the one hand, the availability of a comprehensive map of the human genome has allowed the development of gene expression profiling techniques, mostly microarray based, to monitor the dynamic state of RNA transcripts in cancer cells. These efforts have revealed the existence of molecularly distinct subtypes of morphologically indistinguishable tumors, often associated with differential outcome, progression, and chemosensitivity. They have also helped identify key genetic programs that are consistently activated (e.g., proliferation, migration, immunoevasion), inactivated (apoptosis, senescence), or frequently modulated (adhesion, angiogenesis, etc.) in tumorigenesis. On the other hand, genome-wide studies of both heritable and somatic human variability have moved from theoretical concept to practical reality, opening a new window on both the heritable and the somatic components of cancer etiology. Yet, even as we achieve increased sensitivity in the identification of recurrent somatic alterations for several of the major tumor types, elucidation of the mechanistic role of genetic variability in cancer remains, overall, an elusive target.

Despite these advances, the relationship between genetic alterations and activation/inactivation of specific genetic programs contributing to cancer subtypes remains poorly understood, and the precise cascade of molecular events leading to tumorigenesis and progression is largely uncharted. For instance, although the mesenchymal subtype of glioblastoma is now universally accepted as a distinct subtype, only relatively rare mutations in the NF1 gene appear to co-segregate with it, and the mechanism by which NF1 drives the subtype has not been elucidated ( Figure 20-1 ). Similarly, despite massive sequencing efforts, many mutations discovered in diffuse large B-cell lymphoma fail to precisely co-segregate with its two main functional subtypes, the activated B-cell (ABC) and germinal center B-cell (GCB) phenotypes, which are associated with differential outcome. Even in very common tumors, such as prostate cancer, the repertoire of genomic alterations that contribute to the indolent versus the more aggressive tumors is still unknown. Critically, because of impractical requirements for cohort sizes and lack of methodologies that maximize power for such detection, few epistatic interactions and low-penetrance variants have been identified so far.

Figure 20-1
Genomic alterations in glioma co-segregate with only some of the identified molecular subtypes.

(With permission from Verhaak RG, Hoadley KA, Purdom E, et al. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell. 2010;17:98-110).

This chapter introduces a set of novel approaches and strategies, mostly developed over the past decade, for the elucidation of mechanisms associated with cancer initiation, progression, and chemosensitivity that, overall, go under the name of cancer systems biology. A fundamental departure from the previous methodologies is that, instead of being driven by the isolated analysis of a specific data modality, such as genomic alterations or gene expression profiles, the new discipline is both highly integrative and, more importantly, model driven. By the latter term, we mean that cancer-related datasets are analyzed using small- or large-scale models of the cellular machinery that is most likely to have generated it. These models are still in their infancy and are largely imperfect and incomplete. Yet, even in this embryonic state, they are starting to provide significant new insight and dissecting power, which is only going to increase as the models become more accurate and comprehensive.

Specifically, a key challenge for previous methods, such as genome-wide association studies (GWAS), is lack of statistical power once datasets become truly genome wide. Indeed, given the very large number of somatic events routinely discovered in cancer genomes, including mutations, translocations, gene fusions, aberrant copy number changes, and structural rearrangements, distinguishing “drivers” from “passengers” is challenging and often impossible on a purely statistical basis ( Figure 20-2 ). Not surprisingly, the tumors where greater progress has been made are those with somewhat more benign mutational landscapes, such as leukemias and lymphomas. Still, a significant fraction of these tumors lacks an appropriate causal genetic characterization or mechanistic elucidation of the relationship between genetic alterations and molecular phenotypes. The same can be said for other genetic or epigenetic data modalities, from gene expression to DNA methylation profiles, which produce long lists of candidate genes with no intrinsic prioritization.

Figure 20-2
Circos plot showing the whole-genome catalogue of somatic mutations from the malignant melanoma cell line COLO-829
This genome has approximately 30,000 somatic base substitutions and 1000 somatic insertions and/or deletions. In coding exons, 272 somatic substitutions are present, including 155 missense changes, 16 nonsense changes, and 101 silent changes. The numbers and types of mutations are highly variable across different cancer genomes. Chromosome number and karyotype are indicated on the exterior of the plot. Key: blue lines, copy number across each chromosome; red lines, sites of loss of heterozygosity (LOH); green lines, intrachromosomal rearrangements; purple lines, interchromosomal rearrangements; red spots, nonsense mutations; green spots, missense mutations; black spots, silent mutations; brown spots, intronic and intergenic mutations (merged).

(With permission from Garnett M, McDermott U. Exploiting genetic complexity in cancer to improve therapeutic strategies. Drug Discov Today. 2012;17:188-193).

Recently, alternative approaches to those pursued using GWAS statistical approaches have started to emerge. The rationale for these methods is that genome-wide regulatory models representing causal molecular interactions in the cell—for example, transcription factors regulating their transcriptional targets or protein kinases activating their substrates—may help us identify a relatively small number of candidate genes, upstream of genetic programs that are dysregulated, which may be tested for genetic and epigenetic alterations ( Figure 20-3 ).

Figure 20-3
The -omics layers of the cell both encode and are processed by a context-specific regulatory logic.
At the atomic level, this logic is implemented via molecular interactions, such as protein-DNA, protein-protein, protein-RNA, and RNA-RNA. Dissection and interrogation of this logic in context-specific fashion, using systems biology approaches, is starting to allow elucidation of driver genes responsible for the presentation of relevant cancer-related phenotypes.

Variants of such a genetic genomics approach were pioneered in plants and metabolic disease and have been used successfully in cancer-related studies. For instance, identification of the novel HUWE1-MYCN-DLL3 cascade in brain tumors was possible by reverse engineering posttranslational modulators of MYCN activity as well as its downstream targets using reverse engineering algorithms. Similarly, the role of RUNX1 as a tumor suppressor mutated in T-cell acute lymphoblastic leukemia (T-ALL) was elucidated based on its most significant overlap with the TLX1 and TLX3 oncogene regulatory programs. In some cases, a network-based view of cancer biology may allow elucidation of the dependency of a phenotype on an entire collection of genetic events, which would be virtually impossible to dissect using statistical approaches. For example, it was recently shown that deletion of any combination of 13 genetic loci distributed across the entire genome leads to functional inactivation of PTEN in glioma patients, via a novel interaction mechanism involving competitive endogenous RNA (ceRNA). Indeed, cancer systems biology applications have exploded over the past 3 years, with studies ranging from the study of key drivers of tumorigenesis in melanoma to the dissection of tyrosine kinase signals downstream of ERBB receptors.

Such a regulatory-model–driven view of cancer biology is thus emerging as an important systems-level contribution to the study of this disease. By taking a more holistic view of tumor-related processes, anchored in gene regulatory mechanisms, cancer systems biology mediates the genetic and the genomic views of cancer to provide novel insight into its mechanisms. Specifically, the proponents of these approaches argue that among all genetic and epigenetic alterations in a tumor, those contributing to its initiation, progression, or drug sensitivity cannot affect regulatory interactions in a random way but must co-segregate within specific regulatory subnetworks that are thus globally dysregulated across different samples of a given tumor subtype. Hence, if the full complement of regulatory interactions regulating the behavior of a specific cancer cell population were known, then it should be able to use its structure to separate driver from passenger alterations. The example of RUNX1 in T-ALL is particularly revealing in this case. Here, the functional role of RUNX1 mutations could only be elucidated after determining that its targets are virtually overlapping with those of two previously established oncogenes, TLX1 and TLX3. Without this regulatory insight, it would have been impossible to identify these mutations as statistically significant across the full repertoire of genes.

A key issue, then, is how to assemble accurate and comprehensive repertoires of molecular interactions to create a quantitative regulatory model that may be interrogated to elucidate drivers of tumor-related phenotypes. This is an important question, because virtually all cancer-related publications today contain appealing graphical presentations of molecular pathways in cancer. These bona fide models could provide a starting point for a systems-level study of cancer, as proposed, for instance, by pathway-wide association study (PWAS) strategies.

Unfortunately, knowledge of molecular pathways governing physiological and tumor-related traits is still very poor. Indeed, canonical cancer pathways are more reflective of the researcher’s desire to understand biological processes as a relatively linear and interpretable set of events than of the true complexity of cellular regulation. Specifically, these representations have two major limitations. First, they are not context specific. For instance, the EGFR pathway would be identically represented for a glioma and for a lung-cancer cell.

Second, they constitute a manually curated collection of published facts, of which several are actually incorrect, and which represents less than 1% of the total complement of regulatory interactions in the cell. Hence, their use introduces a strong bias toward what is already known (prior knowledge). Indeed, in the absence of a prior hypothesis, interrogation of canonical cancer pathways has been largely unsuccessful in the elucidation of novel tumor-related mechanisms. To understand the difference between a true regulatory network and a canonical cancer pathway, consider Figure 20-4 , A , showing the differential phosphorylation of canonical EGFR pathway proteins in the H1650 cell line, where EGFR has an activating mutation, compared to the average of all cell lines. In contrast, Figure 20-4 , B , shows the differentially phosphorylated proteins for the same cell lines in a signal transduction network, inferred de novo from a large-scale collection of phosphopeptide profiles of non–small-cell lung adenocarcinoma. Whereas the pathway-based representation provides no clue that the EGFR pathway may be dysregulated, the network-based representation shows a clear hyperphosphorylated protein pattern surrounding both EGFR and MET.

Figure 20-4
Pathway-based vs. network-based representation of differential protein phosphorylation in H1650 cells.
(A) Pathway-based representation of differential EGFR pathway protein phosphorylation in H1650 cells, harboring an EGFR-activating mutation. Proteins tagged with a red circle are hyperphosphorylated, those tagged with a green circle are hypophosphorylated, and those with an orange circle are unchanged. Phosphopeptide abundance for the remaining proteins was not detected. (B) Network-based representation of differential protein phosphorylation in H1650 cells. Signal transduction network was inferred de novo from a large phospho-proteomic dataset for non–small-cell adenocarcinoma. Red proteins are hyperphosphorylated, whereas those in green are hypophosphorylated. The red and blue circles represent EGFR and MET substrates, respectively.

In the following, we discuss the idea of a simultaneous, de novo reconstruction of context-specific gene regulatory networks from large-scale molecular profile data, and of the genetic and epigenetic variability they harbor and mediate. A classic systems biology workflow generally involves three steps: First is acquisition of molecular profiles for a variety of molecular species, several of which represent gene products, from mRNA to phosphopeptide abundance, as well as of genetic and epigenetic alterations. Second is data integration and reconstruction of the regulatory models for the specific cellular context of interest. The final step is regulatory model interrogation, using genetic and genomic signatures that represent the cellular states of interest. Given the abundance and prior coverage of molecular profile data for cancer, we concentrate on the two latter steps.

Reverse Engineering Regulatory Networks

From a systems biology perspective, cell behavior is driven by the processing of endogenous and exogenous signals and maintenance of homeostasis by a complex network of molecular interaction, that is, the regulatory model of the cell. The latter consists of several cross-interacting layers, including transcriptional, posttranscriptional, signal transduction, stable protein-complex formation, and metabolic interactions. Disruption of network topology or dynamics, within one of these layers or, more frequently, across layers, can aberrantly reprogram the cell by activating specific genetic programs, with the potential outcome of a stable phenotypic transformation such as is observed in tumorigenesis. Systems biology, as a field, has evolved on the premise that these regulatory models, from simple kinetic models describing a handful of genes to probabilistic models of genome-wide regulation, could be dissected or “reverse engineered” from experimental data to infer their topology and behavior. One should be aware, however, that regulatory interactions in the cell are both dynamic and context dependent. For instance, the Stat3 transcription factor must be phosphorylated to be transcriptionally active. Hence, the presence or absence of kinase activity or upstream signals may activate or inactivate its role as a transcriptional regulator in dynamic fashion. Of course, one could generate a fully representative, multivariate model of regulation that would represent both states of the transcription factor, but this requires the ability to detect changes in the pairwise interactions between the transcription factor and its targets, as the result of the presence or absence of other molecular species. In addition, the complete model is likely so complex and unyielding that it may be more convenient to use simpler, contextualized models of regulation.

Over the past decade, multiple strategies have been developed by systems biologists to reconstruct the regulatory networks of living cells. Initially, these efforts have been driven by the study of yeast and bacteria as simple model organisms. One advantage in these organisms is that regulatory regions on the genome, that is, regions where transcription factors and other chromatin binding proteins bind and regulate gene expression, are relatively short, allowing the efficient use of sequence information in reverse engineering. For instance, in yeast, promoter regions have an average length of 600 bp, whereas human genes may have distal regulatory elements hundreds of kilobases away from the transcription start site. In addition, gene regulation in higher eukaryotes is made dauntingly more complex by the presence of alternative splice variants, alternative start sites, and multiple poly-A tails.

Fortunately, as data generation technology and computational algorithms advance, regulatory models are becoming increasingly quantitative and predictive, thus capturing regulation of biological process more precisely. Currently, reverse-engineering methods can be mostly grouped into four categories. The following is not intended to provide a comprehensive description of all reverse engineering approaches in systems biology, but rather to provide a more general understanding of key differences between approaches.

Optimization-Driven Machine Learning Approaches

Because of the high-dimensional nature of the regulatory space covered by molecular profiles and the comparatively small number of distinct molecular profiles available in tumor repositories, such as those assembled by the Cancer Genome Atlas (TCGA) consortium and the Catalogue of Somatic Mutations in Cancer (COSMIC), classical methods such as maximum likelihood are not directly applicable to inferring causal relationships between regulators and regulated gene products. However, several assumptions, such as maximum parsimony, have allowed the successful use of machine learning (ML) approaches. In this context, ML addresses this problem by asking what is the regulatory model with the largest posterior probability to have generated the observed molecular profile data. This cannot be addressed by enumerating all the possible models, of course. As a result, many approaches rely on greedy algorithms and underlying approximations, such as assuming that regulatory models can be effectively represented as directed acyclic graphs (DAG) that lack feedback loops. The final model can then be used to infer systems behavior inferences with future data. Examples of such methods include the analysis of regulators of gene expression modules, as well as the use of Bayesian and dynamic Bayesian networks for reverse engineering transcriptional and signal-transduction networks. For a general review of these methods, see Refs. . Factors that affect the precision of the predictions by ML approaches include the dataset quality, feature preselection for single residues, and algorithm selection based on the purpose and data type.

Integration of Prior Knowledge and Experimental Evidence

Rather than predicting interaction from a single data modality, such as gene expression profile data, systems biologists have embraced the vast number of repositories containing experimental data from high-throughput approaches. These range from gene expression profiles, to genome-wide chromatin immunoprecipitation data (GW-ChIP), to yeast-two-hybrids and nuclear pull-down assays. Although partial and often inaccurate in isolation, the knowledge contained in these repositories can be effectively integrated into a single unified model, using computational models to combine the probability about a specific event (e.g., the interaction between two molecular species) from a wealth of independent facts. For instance, transcriptional interactions may combine data from GW-ChIP, DNA binding site motif analysis, and co-expression, among a number of other relevant data types. Use of ML frameworks for the integration of multiple weak clues, from naïve Bayes classifiers, to Bayesian networks, to a variety of consensus scoring methods, has been very successful in generating more accurate and comprehensive molecular interaction models. Recently, an intriguing result has been shown in the analysis of the Dialogue on Reverse Engineering Assessment Methods (DREAM) challenges. DREAM is an attempt to objectively measure the ability of computational approaches to correctly infer facts about regulatory network structure. Specifically, it was shown that integration of the results of many different inference algorithms performs generally better or at least as well as the best individual algorithm. This is an important result, as we often do not have a principled approach to objectively assess the quality of each given method and may instead want to use the integrative results of several of them. An additional value of integrative methods is that they allow the integration of completely heterogeneous types of data. For instance, it was recently shown that protein structure information from x-ray and nuclear magnetic resonance (NMR) crystallography can be effectively integrated with functional data to accurately predict protein-protein interactions. For a more comprehensive review of integrative approaches, see Ref. .

Regression Analysis

Regression techniques have long been used to estimate parameters for kinetic models from experimental data and could, at least in theory, be extended to the inference of parameters for entire regulatory models. Various regression methods have been proposed for pathways or network inference, including maximum likelihood, least squares, and Bayesian inference, to obtain estimates of model parameters. Maximum likelihood approaches infer parameter values from a distribution, as those maximizing the posterior probability of the experimental data; least squares approaches determine the parameter values that minimize the sum of the squares of the residuals, that is, the difference between each experimental and model data point; and finally, Bayesian methods use a priori models for the unknown parameter distribution to compute their most likely values. A key problem of regression methods is that they are generally underdetermined. A determined problem is one where the number of independent observations of a system is equal to the number of parameters that must be estimated. Overdetermined problems—that is, those with more observations than parameters—have the advantage that estimates have some level of statistical robustness. When the number of parameters is much larger than the number of observations, however, an infinite number of parameter values becomes equally possible, thus requiring other heuristics. This is a common issue in real biological systems, as the number of parameters for a system with tens of thousands of interacting molecular species can be in the several hundred thousands, but few datasets with more than a few hundred independent experimental profiles are available for the same system. To address this problem, a number of dimensionality reduction approaches have been developed, which work by either splitting a single high-dimensional problem into a number of independent lower-dimensional ones, such as via singular value decomposition, or by penalizing models with a larger number of connections via sparsity constraints.

You're Reading a Preview

Become a Clinical Tree membership for Full access and enjoy Unlimited articles

Become membership

If you are a member. Log in here