Statistical Analysis of High-Dimensional Data for Pancreatic Cancer


Acknowledgments

TTW and EMC were supported by NSF grants CCF-0926194 and 0926181. HG was supported by the new faculty start-up grant from the Saint Louis University.

Introduction

Pancreatic ductal adenocarcinoma (PDAC) is the most common type of pancreatic cancer, arising from the lesions that occur in pancreatic ducts. It is one of the leading causes of cancer deaths in the United States due to poor diagnosis in the early stage . Despite extensive research over the past 30 years, the five-year survival rate is still less than 5%, and we are still far from developing effective strategies for the early diagnosis and treatment of this lethal disease . PDAC is characterized by early and aggressive metastasis and high resistance to conventional chemotherapy and radiotherapy due to the stromal interaction between the pancreatic cancer cells and fibrous tissue composed of extracellular matrix (ECM) proteins . Tumor signaling is regulated by the complex interactions of thousands of genes and tens or hundreds of signaling pathways. The cross talk between different signaling pathways may be responsible for the pancreatic cancer cell survival even if some pathways are blocked by certain single-gene targeted therapies.

Modern molecular pathologic and genetic technologies have changed the way that we study the complex biological systems. Researchers in the area of pancreatic cancer are able to make genome-wide expression profiling within tumors in order to better understand the nature of the disease and eventually design multigene targeted therapy. To analyze those high throughput data, statistical methods are designed and developed specifically to those types of data . Due to the high-dimensional nature, two major approaches—feature extraction and feature selection—have been taken to extract the important information from the massive data. The feature extraction approach, for example, principal component analysis (PCA), projects the high-dimensional feature spaces into lower dimensions; while the feature selection approach, for example, least absolute shrinkage and selection operator (LASSO), selects a subset of features from the large number of candidates in the data . The shortcomings associated with the feature extraction approach are well recognized, which include lack of meaningful scientific interpretation. Therefore we focus on the second approach and introduce two statistical methods for high-dimensional data in the chapter. The advantages of the feature selection approach include, but are not limited to, alleviating the effect of the curse of dimensionality, retaining scientific meaning, enhancing generalization capability, improving stability, and accelerating computational speed . This approach can be used in many areas, and, of course, include pancreatic cancer studies.

For different purposes, different statistical models and methods are used to explore novel molecular and epigenetic targets for pancreatic cancer from DNA microarray data and RNA sequencing data. A number of differentially expressed and metastasis-associated genes have been found in pancreatic cancer. For example, recent global genomic analyses identified 69 gene sets and 12 core signaling pathways that are frequently mutated in most pancreatic cancers. Most of those studies focused on the inference and identification of the frequently mutated or metastasis-associated genes. However, an important clinical factor—survival time—has been neglected for a long time. The two methods introduced in this chapter focus on survival analysis in high-dimensional data. Similar ideas can be used in other statistical models (e.g., regression and generalized linear regression) in different studies.

A comprehensive understanding of the genetic signatures and signaling pathways that are directly correlated with pancreatic cancer survival will help cancer researchers to develop effective multigene targeted and personalized therapies for pancreatic cancer patients at different stages of the disease, and improve their survival rates. Stratford et al.’s work analyzed the microarray data of 102 PDAC patients and identified six genetic signatures associated with metastatic pancreatic cancer using a sequence of statistical techniques, including the significance analysis of microarray (SAM) , centroid-based predictor , Pearson correlation, X-Tile , Kaplan–Meier estimator , and Cox model . Though the authors applied the Cox model to test whether the six-gene signature is significantly correlated with survival time, the prediction was not based on survival time. These genes could only help discriminate high- and low-risk patients, and they are not directly related to pancreatic cancer survival. The two methods in this chapter are to infer and identify the genetic information that is directly associated with survival time. Both are based on the Cox proportional hazards model , which is a classic model used to describe the relationship between survival time and predictor variables.

Survival data differ from the data we usually observe because of the partial missingness. Partial missingness occurs in two forms: censoring and truncation. Right censoring is the most commonly observed type in survival analysis, and we deal with right-censored data here. Ideally, the time when the event of interest happens is observed and collected, but sometimes we only know that the event happens after some time but not exactly when. The former is true survival time and the latter is right-censored survival time. For example, a study is performed to measure the lifetime of cancer patients. If the death of a patient is observed, the lifetime of the patient is known (complete data, as it is safe to assume the birth is known). If the patient is still alive when the study ends, it is only known that the death date is after the study’s endpoint (right-censored). Here the event of interest is the survival time of the pancreatic cancer patient, so event/survival time is when the patient dies. Right censoring also occurs for the loss-to-follow-up subjects. With censored observations, the censoring effect has to be considered for unbiased estimation. The Cox proportional hazards model can handle right-censored data with a simple form and easy interpretation. The downside of the Cox model is that it requires the proportionality assumption of the hazards rates, which is strong and not valid in some cases.

In high-dimensional data, the number of features/predictors (genes) far exceeds the number of subjects (patients). The goal of the feature selection approach is to identify a subset of predictors from the large pool of candidates. To this end, one can use the regularization method and penalize the Cox model . For example, a LASSO penalty can be imposed to individual variables to automatically remove unimportant ones by shrinking their regression coefficients to be exactly zero . In this chapter, we first describe a LASSO penalized Cox regression method on individual genes . This model has been applied to analyze the localized and resected PDAC data collected between 1999 and 2007. Twelve signature genes that are directly correlated to the pancreatic cancer survival time are found out of 43,376 probes using this model. Eight of the 12 genes are confirmed to be genetically altered and differentially expressed in the cancer of stomach, colon, ovaries, breast, skin, kidney, lung, and pancreas in in vivo and in vitro experiments . As some genes belong to the same pathways and get involved in the same biological processes, it is important to incorporate the pathway information into the analysis. The pathway information is biologically essential to our understanding of gene regulatory network and cancer development . Therefore, we introduce a doubly penalized Cox regression method secondly. By imposing two penalties, one on pathways and the other on individual genes, we can achieve both group and within-group selection for the pancreatic cancer survival analysis. Both models need well-designed computational algorithms for the high-dimensionality data. Cyclic coordinate descent algorithms will also be described in this chapter.

In the next section, we describe the LASSO regularized Cox regression based on the partial likelihood function and the coordinate descent algorithm, which can quickly dismiss irrelevant variables and speed up the estimation of the regression coefficients. Later, we describe the non-overlap and overlap cases of doubly regularized Cox (DrCox) regression model. A modified version of the cyclic coordinate descent algorithms for parameter estimation is also talked about. At last, we apply the two methods to analyze the high-dimensional microarray data of pancreatic cancer patients with localized and resected PDAC collected between 1999 and 2007. The genes and pathways that are found by the two methods are shown in this section.

LASSO Penalized Cox Regression

The LASSO penalized regression method is a popular variable selection technique used for the analysis of the high-throughput and high-dimensional data. Given high-dimensional microarray data, the LASSO method can identify the most important genes that are related to the phenotype of interest in a fast and effective way. Variable selection and estimation of regression coefficients are performed simultaneously—important variables will have nonzero regression coefficients and unimportant variables will have zero coefficients in the model.

Next, we will describe the general framework for variable selection through regularized partial likelihood of the Cox model first. We then derive the cyclic coordinate descent algorithm, which can estimate the regression coefficients coupled with Newton’s method.

Variable Selection via LASSO Regularized Partial Likelihood

Suppose there are n subjects, and each subject has p predictor variables X = ( X 1 ,…, X p ) t . The survival time and the censoring time for the subject i are denoted by T i and C i , respectively. We use the triplets {( Y i , δ i , X i ), i = 1,…, n } to represent the observed survival data, where Y i = min( T i , C i ) denotes the observed survival time since it might be right-censored, and δ i = I ( T i C i ) is a censoring indicator which equals 1 if the actual death is observed and 0 otherwise. The censoring mechanism is assumed to be noninformative. The censoring time C i and the survival time T i are assumed to be conditionally independent given the predictor X i .

The Cox proportional hazards regression model is written as


h ( t | X ) = h 0 ( t ) exp ( j = 1 p β j X j ) ,

where h 0 ( t ) is the nonparametric baseline hazards function, and β j is the regression coefficient for X j . It is reasonable to assume no ties in the observed time when the failure time is continuous. The partial likelihood of the Cox model is


L n ( β ) = i D exp ( X i t β ) l R i exp ( X l t β )

where D is the set of indices of uncensored events (i.e., observed deaths), and R i is the set of the subjects available for the event (i.e., still alive) at time Y i . Variable selection can then be conducted by minimizing the negative log-partial likelihood function
l n ( β ) = log { L n ( β ) } n
plus a penalty function P λ ( β ) on the coefficients β :


g ( β ) = l n ( β ) + P λ ( β ) .

The penalty function P λ ( β ) has to be singular in order to achieve the desired sparsity, hence variable selection. In LASSO penalized method, we penalize the log-partial likelihood by the LASSO penalty
P λ ( β ) = λ j = 1 p | β j |
, which is nondifferentiable at point β j = 0, and therefore it is able to eliminate the irrelevant variables and keep the most relevant ones. The objective function is written as


g ( β ) = l n ( β ) + λ j = 1 p | β j | ,

where λ > 0 is a tuning constant, which controls the number of variables included in the final model. The larger λ is, the fewer variables are retained in the model. By minimizing the objective function , one can select important predictor variables and estimate regression coefficients simultaneously. The variables with nonzero coefficients will be selected and the rest are eliminated.

You're Reading a Preview

Become a Clinical Tree membership for Full access and enjoy Unlimited articles

Become membership

If you are a member. Log in here