Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Clinical trials began to emerge in their modern form only in the early 20th century, with the first randomized controlled trials conducted in the 1940s, but are now firmly established as the fundamental basis of modern evidence-based medicine. An evidence-based approach based on clinical trials enjoyed an early start in cancer, with a commitment and sponsorship by the US National Cancer Institute (NCI), beginning in the mid-1950s in partnership with academic investigators for the treatment of patients with acute leukemia. From these programs grew the multicenter Cancer Cooperative Groups (currently the National Clinical Trials Network program ) and other academic-based cancer centers and institutes that are now the mainstay of clinical cancer research and related translational science in the United States, with similar entities around the world. These programs and clinical trials have been invaluable in discovering and advancing effective therapies while greatly enhancing our understanding of cancer. Despite their primacy, clinical trials are under challenge to continually innovate and adapt as new knowledge is gained and as patients and caregivers seek treatments that are better on many measures, including effectiveness, safety, economics, and long-term welfare. For example, with the development of “omics”-based technology, different cancer types are no longer narrowly and purely defined based on clinical and pathological taxonomic systems. Instead, the so-called precision medicine paradigm that uses biologically relevant and molecular-level information is becoming a reality in some cancer types. Also, cancer treatment has more frequently become multidisciplinary, requiring collaborative investigations between surgeons, medical oncologists, and radiation oncologists to advance new therapies. Moreover, medical oncologists often face myriad options with combinations of traditional cytotoxic, cytostatic, molecularly targeted, and, more recently, immunotherapy agents, while radiation oncologists and surgeons have available a wide array of continually evolving technologies. In conjunction with these developments, advances in computing and the advent of electronic storage of virtually all medical and scientific information has led to an interest in “big data” sources as an alternative to clinical trials for evidence generation. All of these developments have posed new challenges in the design and analysis of all phases of oncology clinical trials.
In this chapter, we present readers with a concise review of the fundamentals of clinical trials that have remained largely unchanged. With it, we aim to provide an appreciation of the current critical issues in clinical trial design and conduct needed to ensure that trials continue to provide state-of-the-art therapy evaluation. The focus here is predominantly on study design, as analysis naturally follows a well-specified and purposeful trial design. The material is presented conceptually; we refer the reader to several more comprehensive texts that present details of oncology clinical trial design and conduct as well as recommend close collaboration with biostatisticians well versed in cancer clinical trials.
In the clinical trial–based development paradigm, there are a series of sequential steps that advance a new cancer treatment from first use in humans to establishment as clinically effective therapy. These steps, referred to as phases, are designed to answer certain questions. If a candidate treatment is successful in one phase, it will proceed to further testing in the next phase. More broadly, trial phases can be conceived of as developmental stages for which there may be overlap in goals and information obtained. During early development (Phases I and II), researchers determine whether a new treatment is safe, what may be the best dose, and which specific adverse effects, both expected and unexpected, may be encountered. In the latter portion of early development, moving into Phase II, whether the treatment demonstrates some benefit, such as slowing tumor growth or influencing other intermediate disease endpoints, is formally assessed. In the later phase (Phase III), researchers definitively evaluate whether the treatment works better than the current standard therapy and further evaluate safety. In addition to comparing safety of the new treatment with that of the current standard in an objective manner, additional adverse-event information is obtained that may emerge as the new regimen is used in larger numbers of patients with longer follow-up. Phase III trials typically include randomized treatment assignment (the virtues of which are discussed later) and a sufficient number of participants to ensure that the result is valid and reliable. The specific statistical considerations for each phase will be elaborated later in this chapter.
From a statistical design perspective, there are five key components in any cancer clinical trial regardless of its phase: (1) clearly written objectives, (2) well-defined endpoints, (3) a rigorous study design appropriate for the question, (4) a well-justified sample size, and (5) an appropriate and detailed statistical analysis plan. Cursory attention to any of these five components could lead to a trial with flawed or uninterpretable results.
Identifying the primary objective requires careful thought about what key conclusions are to be made at the end of the trial. In Phase I trials, the primary objective typically is to identify a suitable dose (which may be optimal by some metric) and summarize the toxicities observed. In Phase II trials, the types of objectives may vary depending on the specific context and situations. Historically, the primary objective is to evaluate preliminary efficacy at the established dose to (1) aid in determining whether there is sufficient cause to warrant a more definitive (Phase III) trial and (2) to obtain preliminary clinical efficacy estimates to help plan the Phase III trial. Phase II trials also provide for further explorations of safety and toxicity of the experimental regimens. The primary objective of a Phase III trial is to provide a definitive head-to-head comparison of alternative treatment regimens, typically an experimental regimen versus a standard-of-care comparator.
All clinical trials most appropriately have a single primary objective. While there may be numerous secondary objectives, the distinction should be clear. It is generally not appropriate to assume or plan for additional development goals being met—for example, planning an early-phase trial with hope of an extraordinary benefit leading to rapid adoption as established treatment. Sufficient rigor should be focused on the developmental objective of the trial instead.
The selection of endpoints for a clinical trial is the next critical step in determining the appropriate design and analysis for a trial. An endpoint generally refers to a measure of disease status, symptom, or laboratory value that constitutes one of the target outcomes of the trial. It requires the following traits: it should be clearly defined, quantitatively measurable in an unbiased fashion, and directly linked to the trial primary objective. Choice will depend on the phase of the trial and other factors, such as cost and feasibility of assessment. In early-phase safety development, endpoints may be simple frequencies of adverse events, but even these must be carefully defined with respect to type(s) of interest, time frame of occurrence, probable attribution to the intervention, and other constraints. Examples of early efficacy endpoints in pilot efficacy (Phase II) trials include tumor response (i.e., reduction, stabilization), which is generally based on radiographic tumor measurements and expressed through the proportion responding, or response rate, and time free of evidence of further disease progression. In later-phase definitive trials, overall survival (time surviving with respect to death from any cause) is historically considered the most meaningful efficacy endpoint. However, depending on the context, disease-specific endpoints may be considered definitive. Often in time-to-event data, composite endpoints, in which several different event types such as disease progression or death from any cause are combined to define progression-free survival (or disease-free survival ), which may include second primary cancers as events, are the endpoint of choice in definitive trials and may also be used in Phase II trials. A large body of health-related quality of life (HRQoL) endpoints—which may include various caregiver assessments and, increasingly, patient-reported outcome (PRO) endpoints—are also used in all phases of trials. Depending on the trial question, they may serve as primary endpoints as well.
The general study design is the structure under which the inferential procedure will address the trial objective. The merits of an experimental treatment are assessed against either target specifications (such as a maximum adverse-event rate permissible), an expected clinical outcome based on historical experience in a similar clinical scenario, or a concurrent control group incorporated into the study. Depending on the trial objective, type, and other factors, one of these may form the basis of valid inference, but all require careful design considerations in order to ultimately provide reliable conclusions.
In dose establishment and safety evaluation, there is critical concern with proceeding deliberately so that patients do not incur excess risk while at the same time seeking to avoid administering subtherapeutic doses to many patients (although therapeutic benefit is not an explicit goal of the study). For this reason, Phase I trials enroll patients singly or in small cohorts and proceed through dose determination sequentially. In some cases, there may be multiple substudies enrolling concurrently. Then, groups may be randomized or deterministically assigned, but the same deliberate enrollment scheme pertains.
For pilot efficacy evaluation, the most economical design from a resource perspective is a single cohort, all receiving the experimental regimen. This approach provides the maximum information about the new intervention under study and may have ease of enrollment because all patients have an opportunity to receive what is usually perceived (correctly or otherwise) as a promising treatment. In a traditional single-arm pilot efficacy (Phase II) trial, all patients receive the same treatment; typically, results are assessed against previous historical experience (historical control). The historical control patient population should have similar patient characteristics, similar standard of care, and use the same diagnostic and screening procedures as patients anticipated to be entered into the new study. In addition, the primary outcome should be objective and consistently defined so that results are comparable and interpretable when comparing with historical estimates. Unfortunately, a historical estimate may not always be available when, for example, the patient population for a current study is defined by newly discovered biomarkers. In addition, historical estimates for the apparent same treatment and same patient population may vary substantially and be subject to temporal changes in prognosis due to influences of ancillary care, diagnostic definitions, and other unknown factors. These issues make it difficult to choose an appropriate value against which the new treatment should be assessed, introducing the possibility of uncertainty and lack of reliability in findings from single-arm trials (considerations to mitigate them are discussed in more detail later). Nonetheless, the single-cohort design remains an important part of oncology clinical trials.
Trials may alternately include a parallel cohort receiving a different intervention, to which randomization is invariability employed to assign patients to the treatment groups. Randomization ensures that patients are assigned to treatment arms without systematic differences in any and all characteristics that may influence outcomes. Randomization is the cornerstone of clinical trials methodology as it pertains to evidence generation, addressing the fundamental problem of confounding treatment effect. Confounding factors are those that are related to both treatment assignment (choice or receipt) and outcomes. These can be any known demographic or disease prognostic factors or other yet unknown factors. Confounding factors also include characteristics that might influence someone to participate in, or withdraw from, a trial, or potential conscious or unconscious bias in patient selection by treating physicians, or self-selection bias by patients themselves. Randomization does not completely ameliorate these concerns (although consistent, well-defined trial entry criteria does), but still promotes external validity, as patients in each of the treatment arms have similar characteristics to the sample obtained from the population. That said, randomization itself does not ensure that the study will include a representative sample of all patients with the disease. However, internal validity is ensured, as patients are similar between arms and the confounding of treatment effect by both known and unknown factors is minimized. In large studies, simple randomization is sufficient to guarantee that treatment arms will be balanced with respect to patient characteristics, while in small or moderately sized studies, imbalances in important patient characteristics can occur by chance. In all randomized studies, additional design features (i.e., stratified randomization and analysis, discussed later) can correct this influence.
Under the classical (frequentist) hypothesis testing paradigm, which continues to be the dominant approach, the inferential (i.e., testing) procedure and subsequent material decision regarding a treatment evaluation depends critically on a small number of quantities that must be specified as part of trial design. Alternative frameworks, such as the Bayesian inference paradigm (which is not covered in detail here), likewise require some conditions and assumptions that will determine the scope of the study and ensure an informative conclusion.
In the classical statistical decision framework, one posits a null hypothesis (e.g., absence of a treatment effect) and alternative hypothesis (presence of a treatment effect) and collects data from a sample to provide an estimate of the true state of nature in the population. A decision in favor of the null or alternative hypothesis follows. The type I error probability—or α, or significance level—is the probability that we conclude that a treatment effect is present (based on the data) when, in fact, it is not (false-positive conclusion). This is an inevitable consequence of using sampling from a population and probabilistic reasoning and, thus, is not an “error.” The acceptable type I error rate is decided on in the planning stages of the trial; the conventional 0.05 level may or may not fit a particular problem. A closely related concept to the significance level α of a test is the p value, the probability under the null hypothesis of a result equal to or more extreme than the one that we observed. When it is smaller than a given prespecified α level, the result is declared statistically significant, as the observed effect is unlikely to have arisen under the null hypothesis (of no treatment effect). While, by definition, the smaller the p value the less likely the observed result under the null hypothesis, it must be recognized that false declaration of an effect when there is none is a real and natural phenomenon when using this decision paradigm. To decrease false-positive findings in trials in certain circumstances (discussed later) the significance probabilities are altered to be more stringent. Even outside of these circumstances, historically and more recently, there has been a general call for redefining significance criteria in order to address multiple problems with too much emphasis on statistical significance as a hallmark of meaningful findings.
The statistical power of a test for a particular alternative hypothesis is defined to be 1−β, where the quantity β (or type II error) represents the probability of not rejecting the null hypothesis when, in fact, a treatment effect is manifest in the population (i.e., the alternative hypothesis is the true state). Thus, power equals the probability of detecting a difference that is really there. Ideally, trials should be designed to provide high power for differences that are realistic and clinically meaningful. This is mostly driven through adequate sample size relative to variability of the outcome measure. In designing trials, it is important for the researchers and statistician to discuss the magnitudes of the clinical improvement that would be meaningful to detect in order to design a study with a small enough type II error to make the conclusions credible irrespective of the conclusion. If the true treatment effect were very large, it might be relatively easy to detect even with a small sample size. However, when the treatment effect is more modest, yet meaningfully different clinically, an adequately large number of patients is required to detect this effect with high probability. Thus, a trial that failed to detect an effect, but was based on a small sample size, does not provide reliable evidence of no effect and should be interpreted with caution (since a type II error or false-negative result is likely). In general, we set β to 0.1–0.2, for example, we aim to have at least 80% to 90% power to detect a truly effective treatment.
As alluded to earlier, in conjunction with the somewhat abstract type I and type II error parameters, the most critical aspect of the study design is the expected effect size , or magnitude of effect that one aims to detect, as from these arises the sample size. Inherent in this formulation is also the expected outcome under the standard or comparator treatment upon which we aim to improve, represented by the alternative hypothesis outcome. Historical information from earlier studies is useful in specifying the assumptions required for sample size calculations and must be accurate in single-arm noncomparative studies (because otherwise, “improvements” in outcome will be falsely attributed to treatment) and in randomized comparative studies (because, although the comparison remains internally valid, the scope and scale of the study is affected by these quantities). For fixed-time endpoints, such as the proportion responding or failure free at a given time landmark, the total sample size is then implied based on absolute difference or relative difference (often expressed as a rate ratio). For time-to-event endpoints (e.g., overall survival), the effect to be detected is typically expressed in terms of a failure rate (event/time) ratio, known as a hazard ratio (HR). The required sample size is then driven by the number of failure events required rather than the number of patients accrued. Hypothetically speaking, if all patients were followed to the failure event, the number to be accrued would be the same as the number of events expected. While the number of events is determined by α, β, and HR, the total number of patients to be accrued will be influenced by several additional factors: the failure rate under standard therapy, any study attrition or loss of event observation due to other factors, the accrual rate, and the amount of follow-up time to be allowed until planned study reporting. A relatively small study of patients with rapidly lethal disease may have the same power as a very large study of patients with a low death rate and lengthy follow-up as long as the numbers of deaths are same. Furthermore, the rate of accrual to the trial and the total calendar time in which the trial is aimed to be completed are two factors that must be considered together, as there are trade-offs between them to arrive at the requisite patients needed. In addition, if the expected rate of failure among enrolled patients is substantially inaccurate, the timeline of the trial will also be affected. Last, it should be noted that when we express the effect size in terms of HR, it may be worth considering whether there is evidence that this standard assumption is suitable for the particular disease under study, as power may be reduced if the assumption is incorrect.
In terms of effect size, ideally, a trial should be designed to have sufficient power to detect the smallest difference that is clinically meaningful. A study is doomed to failure if it is designed to have sufficient power to detect only unrealistically large differences. In practice, trials are often designed to have adequate power only to detect the smallest feasible difference, where feasibility is dictated by funding resources, the available patient population and timeframe of trial conduct, and other practical constraints. Consideration should be given to whether the feasible difference is plausible enough to warrant doing the study at all, since it is considered a waste of resources and by some even an ethical breach to conduct a trial with little chance of yielding a definitive conclusion.
Each design phase of trials has specific critical design features, although some are shared across phases. Also, increasingly, there is a movement toward approaches that combine phases, facilitating transitions between developmental steps in a more seamless and efficient manner. In this section, we review some general design considerations for Phase I through Phase III trials and recent innovations related to trial design. Table 14.1 provides representative examples of many of the trial designs discussed here. The reader may wish to use these as a reference point for concepts discussed put into practice in trial conduct.
Study Title | Type of Trial | Primary Trial Question a |
---|---|---|
NRG-DT001: A Phase Ib Trial of Neoadjuvant AMG 232 Concurrent with Preoperative Radiotherapy in Wild-Type P53 Soft-Tissue Sarcoma | Phase I “3 + 3” dose finding, with expansion cohorts | To assess the maximum tolerated dose of AMG 232 in combination with standard-dose radiotherapy. |
NRG-BN002: Phase I Study of Ipilimumab, Nivolumab, and the Combination in Patients with Newly Diagnosed Glioblastoma | Phase I rolling 6-dose finding, with modification to slow accrual | To evaluate the safety of (i) single-agent treatment with ipilimumab, (ii) nivolumab, and (iii) the combination of ipilimumab and nivolumab, each with maintenance temozolomide. |
NRG/RTOG 0813: Seamless Phase I/II Study of Stereotactic Lung Radiotherapy (SBRT) for Early Stage, Centrally Located, Non–Small Cell Lung Cancer (NSCLC) in Medically Inoperable Patients | Phase I time to event continual reassessment (TITE-CRM), followed by Phase II single-arm response evaluation | To test the safety of stereotactic body radiation therapy (SBRT) at a range of increasing dose levels. |
NRG/RTOG 0933: A Phase II Trial of Hippocampal Avoidance during Whole Brain Radiotherapy for Brain Metastases | Phase II pilot efficacy: nonrandomized design | To evaluate delayed recall as assessed by the Hopkins Verbal Learning Test–Revised (HVLT-R) 4 months after hippocampal avoidance during whole brain radiotherapy (HA-WBRT). |
NRG/RTOG 0712: A Phase II Randomized Study for Patients with Muscle-Invasive Bladder Cancer Evaluating Transurethral Surgery and Concomitant Chemoradiation by Either BID Irradiation plus 5-Fluorouracil and Cisplatin or QD Irradiation plus Gemcitabine Followed by Selective Bladder Preservation and Gemcitabine/Cisplatin Adjuvant Chemotherapy | Phase II pilot efficacy: randomized noncomparative design with clinical efficacy endpoint | To estimate the rate of distant metastasis at 3 years of two induction chemoradiotherapy regimens, including 5-FU, cisplatin, and BID irradiation (FCI) or gemcitabine and QD irradiation (GI), followed by radical cystectomy if the tumor response is incomplete or by consolidation chemoradiotherapy if the tumor has cleared, with both followed by adjuvant chemotherapy. |
NRG/RTOG 0915: A Randomized Phase II Study Comparing 2 Stereotactic Body Radiation Therapy (SBRT) Schedules for Medically Inoperable Patients with Stage I Peripheral Non–Small Cell Lung Cancer | Phase II pilot efficacy: randomized noncomparative design with adverse event endpoint | To evaluate the rate of 1-year grade 3 or higher adverse events that are definitely, probably, or possibly related to SBRT treatment. |
NRG-BN001: Randomized Phase II Trial of Hypofractionated Dose-Escalated Photon IMRT or Proton Beam Therapy versus Conventional Photon Irradiation with Concomitant and Adjuvant Temozolomide in Patients with Newly Diagnosed Glioblastoma | Phase II pilot efficacy: two-arm randomized design | To determine whether dose-escalated photon IMRT or proton beam therapy with concomitant and adjuvant temozolomide improves overall survival as compared with standard-dose photon irradiation with concomitant and adjuvant temozolomide. |
NRG-GY003: A Phase III Study Comparing Single-Agent Olaparib or the Combination of Cediranib and Olaparib to Standard Platinum-Based Chemotherapy in Women with Recurrent Platinum-Sensitive Ovarian, Fallopian Tube, or Primary Peritoneal Cancer | Phase III: superiority design with clinical efficacy endpoint | To assess the efficacy of either single-agent olaparib or the combination of cediranib and olaparib, as measured by progression-free survival, as compared with standard platinum-based chemotherapy. |
NRG-CC001: A Randomized Phase III Trial of Memantine and Whole-Brain Radiotherapy with or without Hippocampal Avoidance in Patients With Brain Metastases | Phase III: superiority design with neurocognitive toxicity endpoint | To determine whether HA-WBRT increases time to neurocognitive decline on a battery of tests: the Hopkins Verbal Learning Test–Revised (HVLT-R) for Total Recall, Delayed Recall, and Delayed Recognition, Controlled Oral Word Association (COWA), and the Trail Making Test (TMT) Parts A and B. |
NRG/RTOG 0415: A Phase III Randomized Study of Hypofractionated 3DCRT/IMRT versus Conventionally Fractionated 3DCRT/IMRT in Patients Treated for Favorable-Risk Prostate Cancer | Phase III: noninferiority design with clinical endpoint | To determine whether hypofractionated 3D-CRT/IMRT will result in disease-free survival (DFS) that is no worse than DFS following conventionally fractionated 3D-CRT/IMRT |
NRG-GU003: A Randomized Phase III Trial of Hypofractionated Post-Prostatectomy Radiation Therapy (HYPORT) versus Conventional Post-Prostatectomy Radiation Therapy (COPORT) in Treating Patients with Prostate Cancer | Phase III: noninferiority design with adverse event endpoint | To demonstrate whether hypofractionated postprostatectomy radiotherapy does not increase 2-year patient-reported GI and GU symptoms over conventionally fractionated postprostatectomy radiotherapy. |
NRG/RTOG 1216: Randomized Phase II/III Trial of Surgery and Postoperative Radiation Delivered with Concurrent Cisplatin versus Docetaxel versus Docetaxel and Cetuximab for High-Risk Squamous Cell Cancer of the Head and Neck | Phase II/III integrated design | Phase II: To select the better of two experimental arms to potentially improve disease-free survival over radiation and cisplatin. Phase III: To determine whether the selected experimental arm will improve overall survival over a radiation and cisplatin control arm. |
a Trial synopsis and current status available at https://www.nrgoncology.org/Clinical-Trials/Protocol-Table .
Become a Clinical Tree membership for Full access and enjoy Unlimited articles
If you are a member. Log in here