Clinical Trial Designs in Oncology


Summary of Key Points

  • Phase I trials to determine recommended doses and schedules for further testing of new treatments need to be designed to minimize the number of patients exposed to unacceptable toxicity.

  • The use of formal interim monitoring of accruing outcome data in phase II and phase III trials is important in order to allow trials to be stopped as soon as possible because of positive or negative results, while retaining the statistical validity of the trial conclusions.

  • Phase II/III trial designs, multiarm designs, and the use of master protocols, which allow treatment arms to be added to an ongoing trial, can speed the development of new treatments but are not without disadvantages.

  • It is important that the primary end point of a trial be chosen so that the results of the trial will meet its clinical objectives.

  • Trial designs have been developed to assess whether a biomarker can be used to identify a subgroup of patients who benefit from an agent that targets tumors that express a particular molecular abnormality.

  • Simultaneous screening of patients for multiple, possibly rare, biomarkers in order to assign them to different biomarker-restricted subtrials is a highly efficient way to assess multiple biomarker-driven treatment hypotheses.

Clinical trials are designed to answer specific clinical questions in the development of new treatments. In order to address the study question, it is important for the trial protocol to describe prospectively how patients will be treated and how the resulting data will be analyzed. This chapter begins with a review of the traditional classification of clinical trial designs: phase I trials (to find safe doses and schedules of new agents), phase II trials (to assess the biologic activity or to offer a preliminary assessment of clinical efficacy), and phase III trials (to provide definitive evidence of treatment efficacy for changing clinical practice). Interim monitoring, which allows trials to be stopped early based on the accrual of positive or unpromising results, is discussed next. Phase II/III trials, which combine the phase II and phase III trials into one trial, can speed the development of new treatments. For all phases of clinical trials, it is important to choose the primary end point to meet the objectives of the trial; some commonly used end points are discussed. The chapter ends with a discussion of trial designs that use and evaluate biomarkers and treatments together, considering both situations in which there is a single biomarker and associated experimental treatment and situations in which there are multiple biomarkers that can potentially help choose which among many treatments would be best for the patient. The chapter offers a brief survey of the important elements of the designs of clinical trials to address various cancer treatment questions. More detailed expositions including reviews of statistical methods are given elsewhere.

Phase I Designs

The purpose of a phase I trial is to find a dose and schedule of a new therapy to take forward for further evaluation. This is typically done by sequentially treating small cohorts of patients with advanced disease (e.g., three patients), starting at a low dose of the agent(s) that is not expected to cause toxicity, and then increasing dose levels until unacceptable rates of toxicity (dose-limiting toxicity [DLT]) are seen. Phase I trials are designed to be small and are typically not limited to a specific histologic type, so that the investigators can quickly move on to testing the therapy for efficacy.

Although there are many possible phase I dose escalation designs, there are two general approaches to choosing the next dose level to be tested based on the DLTs seen up to that point. One approach uses only the information on the patients being treated at the current dose level (and, if available, at the dose levels immediately above and below it). For example, the commonly used 3 + 3 design treats cohorts of three patients and decides the next dose level (escalate one level, de-escalate one level, or remain the same level) based on the number of DLTs seen at the current dose: no DLTs in three patients or one DLT in six patients, escalate; one DLT in three patients, remain at the same level; and two or more DLTs in three patients or two or more DLTs in six patients, de-escalate. The recommended dose is the highest dose level at which there was one or no DLTs in six patients. A modification of the 3 + 3 design that is slightly more aggressive, the “rolling six design,” is sometimes used when the accrual is rapid as compared with the evaluation period. Additional modifications that accelerate the design with one or two patient cohorts initially until some (possibly less than DLT) toxicity is seen have also been considered. These designs implicitly target a 20–25% DLT rate as maximally acceptable; designs that target different DLT rates are possible.

The other general approach to phase I dose escalation is based on a statistical model for the probability of DLT as a function of the dose level. An example of a model-based method is the continual reassessment method. Model-based escalation approaches use the DLT data from all the current patients treated to determine the dose level for the next cohort of patients. For example, the decision to escalate from dose level 3 to level 4 will partially depend on the proportion of DLTs seen at dose level 1. The benefit of model-based escalations is that by using all the data, the investigators can better choose the next dose level at which to treat patients (if the assumed model is correct). The problem with the methods is that if the assumed model is incorrect, too many patients may be treated at dose levels that are too high. To mitigate this problem, it is useful to specify some constraints on model-based approaches—for example, not allowing the treatment of new patients at (or above) a dose level at which 33% or more of at least four patients have had a DLT. Regardless of whether one uses a statistical model to guide the dose escalation, one could use a model to fit the toxicity data (as a function of dose) after the trial is over to identify the dose to recommend for further testing.

Combinations of Agents

For phase I trials involving a combination of agents, the goal is to identify the doses of each of the agents to be used in the combination. There may be many dose combinations that have acceptable toxicity. For example, for a combination of agents X and Y, both high-dose X plus low-dose Y and low-dose X plus high-dose Y may have acceptable toxicity (but not high-dose X plus high-dose Y), so the choice between the two acceptable combinations (and the dose escalation scheme) will need to be made based on biologic and clinical considerations. If it is known that a certain minimal dose of X is necessary for its effectiveness, then the escalation would be of agent Y, with the dose of X fixed at its appropriate level. In addition, the toxicity profiles of the individual agents may suggest dose escalation schemes.

Late Dose-Limiting Toxicities

For some treatments, some DLTs of concern may occur late—for example, toxicities occurring up to a year after radiation treatment. This can lead to a very long phase I trial if one has to wait a lengthy evaluation period (e.g., 1 year) for each cohort of patients before the next cohort of patients can be treated. One approach to this problem is to model the occurrence of DLT times, for example, by assuming that DLTs are uniformly likely to occur over the evaluation period. If this were true, then one could extrapolate from the early toxicity experience of the patients to allow dose escalation before the patients have been followed for the full evaluation period. An example of this approach is the time-to-event continual reassessment method (although this particular method has apparently not worked well in practice ). If one designs a trial assuming the uniform occurrence of DLTs over the evaluation period and the assumption in reality does not hold, then one is at risk of exposing too many patients to DLTs. For example, if all the DLTs typically occur near the end of the evaluation period, then too many patients may be accrued at that dose level and a higher dose level based on not seeing any early DLTs. Because of this concern, additional constraints on these approaches are useful—for example, not accruing more than three patients at a dose level until the initial three patients at that dose level have been observed for at least one-half the evaluation period.

Biologic End Points

Phase I designs that escalate to find the highest dose level with acceptable toxicity (maximum tolerated dose) are based on the assumptions that (1) higher doses are more effective than lower doses, (2) higher doses are associated with more toxicity than lower doses, and (3) there is a dose with acceptable toxicity that will be effective. As opposed to cytotoxic agents, with targeted agents the first two assumptions may not hold, and with immunologic agents the first assumption may not hold. This suggests the use of a biologic (nontoxicity) end point to guide the dose escalation and to choose the dose for further testing. Although attractive in theory, designs using nontoxicity end points have been infrequent in practice. Instead, it may be preferable to assess the biologic end point at the maximum dose level determined according to DLT, and possibly at lower dose levels for comparison. In the context of a targeted agent, “phase 0 trials” are very small first-in-human trials that are designed not to find the recommended dose of the new agent, but instead to assess the agent's effect on its molecular target through use of low exposures of the agent not expected to cause any toxicity.

Phase II Designs

We first define some statistical parameters used to design clinical trials to evaluate treatment effects. The three principal parameters are (1) the target treatment effect—the treatment effect of interest, which the study should have a reasonable chance to identify (referred to as the alternative hypothesis, contrasted with the null hypothesis of no treatment effect); (2) the false-positive error rate (type I error)—the probability of the study findings being positive when the null hypothesis is true; and (3) the false-negative error rate—the probability of the study findings being negative when the alternative hypothesis is true, that is, the target treatment effect is present. (Note that the study power, the probability of a positive study result under the alternative hypothesis, is 1 minus the false-negative error rate.) An appropriate statistical design is obtained by selecting values for the three parameters corresponding to the study goal.

Phase II trials are designed to provide preliminary efficacy data to identify therapies that have sufficient activity to be worthy of further testing in definitive (phase III) trials. Because they are preliminary, the designs allow the probability of moving an ineffective therapy forward to be larger than one would use in a phase III trial (e.g., a type I error of 10% instead of 2.5%). More important, phase II trials typically use end points that measure treatment activity (e.g., tumor shrinkage); the effect of the treatment on these end points is expected to be larger and available more quickly than the effect of the treatment on a phase III end point that measures direct patient benefit (e.g., survival). Both of these factors allow for a smaller sample size and shorter time to completion than a phase III trial.

For a phase II trial of a single agent, the historical approach is a single-arm trial in a specific histologic type designed to assess whether the experimental agent yields sufficient responses (partial or complete responses as defined by the Response Evaluation Criteria in Solid Tumors [RECIST] ); the denominator for calculating the response rate is typically taken to be all patients who started the agent. For example, one could consider a trial of 32 patients in which the agent would be deemed worthy of further study if four or more responses were seen. This design would have both false-positive and false-negative error probabilities of less than 10% (a typical value chosen) for the null and alternative response rates of interest. If the true response rate was 5% or less (the null hypothesis, considered too low to be interesting), then there would be less than a 10% probability of declaring the agent worth pursuing, and if the true response rate was 20% or higher (the alternative hypothesis), then there would be less than a 10% probability of a negative conclusion.

To minimize the number of patients treated with an inactive agent, various two-stage designs have been developed that allow for early stopping because of negative results. For example, in a study with a Simon minimax two-stage design for targeting 5% versus 20% response rates (with 10% error rates), 18 patients are treated at the first stage. If there is at least one response among these 18 patients, a second stage of 14 patients is accrued. The agent is considered worthy of further study if there are at least four responses seen among the 32 patients. The required sample size of a single-arm phase II trial (either one-stage or two-stage) depends on the hypothesized targeted response rates of interest (e.g., 20% versus 5%), and the error probabilities (both 10% in the aforementioned examples). The sample size will be smaller when the difference in target response rates is larger (e.g., 25% versus 5%) or when the error probabilities are larger (e.g., 15% instead of 10%).

Randomized Screening Designs

There are two general situations in which a single-arm phase II trial design will not be appropriate because the results would be difficult to interpret. The first is when the agent is thought to be primarily cytostatic rather than cytotoxic, so an effective agent might not shrink tumors in a nonnegligible proportion of patients. One could, in theory, use a time-to-event end point such as progression-free survival (PFS) or overall survival (OS) in this situation. However, because of trial-to-trial variability in time-to-event outcomes, it can be difficult to identify a benchmark for an experimental treatment to beat in a single-arm trial. The other situation in which a single-arm phase II trial design would not be appropriate is when the treatment being evaluated is a combination of agents (agent X + agent Z), one of which is known to be active (agent Z). The problem is again identifying a null benchmark (e.g., response rate), in this case to isolate the contribution of agent X; there may be some data on the response rate of agent Z alone, but typically not from enough different trials in comparable settings that one could feel comfortable moving the combination therapy forward if its observed response rate was moderately higher than that of the response rate of Z alone.

For situations in which a single-arm design is not appropriate, a randomized screening design is an alternative. In this design, patients are randomized to the experimental treatment versus a control treatment to demonstrate superior activity with use of a phase II end point. Some examples of sample sizes for a response rate end point and a time-to-event end point (e.g., PFS, OS) are shown in Tables 18.1 and 18.2 , respectively. Note that for time-to-event end points, in which the time-to-event (survival) curves are typically compared with a log-rank statistic, the design can be specified in terms of the hazard ratio and the required number of events to be observed. For example, a trial to detect a hazard ratio of 0.5 with 90% power and 10% type I error would require 54 events. If one were targeting an improvement to a median PFS of 6 months (for the experimental treatment) from a median PFS of 3 months (for the control treatment), then one could randomize 62 patients over 1 year; the analysis would be performed when 54 events were observed, which would be about 9 months after the last patient was randomized. There is a choice regarding which patients should be included in the primary analysis of a randomized screening design: an intent-to-treat analysis, which includes all eligible randomized patients, or only those eligible randomized patients who started their assigned treatment. Restricting the analysis to patients starting treatment will provide a more sensitive assessment of treatment efficacy but is subject to bias if substantially more patients in one arm drop out of the trial before starting treatment; in the context of a phase II trial, this is not an issue if the trial is blinded.

Table 18.1
Examples of Required Total Sample Sizes for Phase II and Phase III Randomized Trials With a Response Rate End Point to Achieve 90% Power for the Designated Target Response Rates
Response rates Control 20% 20% 20% 40% 40% 40% 60% 60%
Experimental 30% 40% 50% 50% 60% 70% 80% 90%
PHASE II (ONE-SIDED TYPE I ERROR = 10%)
Sample size 530 156 78 688 182 82 156 66
PHASE III (ONE-SIDED TYPE I ERROR = 2.5%)
Sample size 824 236 116 1076 280 126 236 98

Table 18.2
Examples of Required Total Numbers of Events for Phase II and Phase III Randomized Trials With a Time-to-Event (Survival) End Point to Achieve 90% Power for the Designated Target Hazard Ratio a
Hazard ratio 0.90 0.8 0.71 0.67 0.6 0.56 0.50 0.40
1/Hazard ratio 1.11 1.25 1.40 1.50 1.67 1.80 2.00 2.50
PHASE II (ONE-SIDED TYPE I ERROR = 10%)
Number of events 2367 528 232 160 101 76 55 31
PHASE III (ONE-SIDED TYPE I ERROR = 2.5%)
Number of events 3786 844 371 256 161 122 87 50

a For outcomes that are approximately exponentially distributed, the hazard ratio is approximately equal to the ratio of the median survival in the experimental treatment arm to the median survival in the control arm.

Unequal randomization is possible in a screening design, whereby, for example, twice as many patients are randomized to the experimental arm as the control arm. Unequal randomization makes the required sample size larger—for example, 13%, 33%, or 278% larger for 2 : 1, 3 : 1, or 9 : 1 randomization, respectively. It is sometimes justified because it is said that the unequal randomization will make the trial more attractive to patients, thereby speeding accrual (although we know of no hard evidence on this). A better justification for unequal randomization is that in situations with limited experience with use of the experimental agent, it will allow more experience to be obtained (e.g., 40 patients instead of 30 in a 60-patient trial using 2 : 1 randomization instead of 1 : 1 randomization). Other issues concerning randomized screening designs are discussed later in the section on phase III randomized designs.

Randomized Selection Designs

Another type of randomized phase II design is the randomized selection design. In this design, the two (or more) treatment arms are experimental treatments. The goal of the trial is to select which of the treatment arms to take forward to further testing. At the end of the trial, the treatment arm that has the better efficacy outcomes is selected for further development. For example, if response rate was chosen as the outcome, then the arm with the better response would be selected. The sample size for this design is chosen so that there is a high probability that one will not select a worse treatment arm, if in fact there is one that is worse by a specified indifference margin. Selection designs require smaller sample sizes than screening designs because the conclusions from the design are much weaker (demonstrating that treatment X is not much worse than treatment Y instead of demonstrating that treatment X is better than treatment Y). When it is possible to evaluate each arm individually for efficacy—for example, when a response rate end point is being used for evaluating single agents—it is possible to embed within the selection design minimum (single-arm) response rates. For example, one could select the treatment arm with the better response rate provided that the response rate was at least 15%; otherwise, neither arm would warrant further study. The typical use of selection designs is for deciding which of several schedules of a new agent to use for further agent development.

Designs With Biomarkers

Phase II trials offer an opportunity to assess biomarkers for their ability to enable identification of a patient population for which the experimental treatment will be effective or especially effective. Unless there is strong evidence that the treatment will work only in biomarker-positive patients, the phase II trial should not restrict enrollment to such patients. Instead, the trial design should accommodate the possibility that benefit will be restricted but not assume it. For example, in a single-arm trial with a response rate end point, two-stage designs can be used wherein both biomarker-positive and biomarker-negative patients are accrued at the first stage; depending on what responses are seen at the first stage, the trial is either stopped or more biomarker-positive or unselected patients are accrued.

When there is strong evidence that a molecularly targeted agent will work only in biomarker-positive patients and some evidence that its efficacy may not depend on tumor histologic type, a single phase II trial that pools response data across biomarker-positive patients in different histologic groups may be useful. However, in the absence of reliable evidence that the activity level is uniform across these histologic groups, a separate evaluation may have to be conducted in each group to allow for a reliable evaluation. Note that the use of Bayesian hierarchical modeling to borrow information across groups does not work in this setting, although simpler pooled futility rules can be useful.

In settings in which a randomized screening design is used, one would again generally not restrict enrollment to biomarker-positive patients. After enrolling all comers, one could consider in an ad hoc manner the experimental-versus-control treatment effect in the overall group and the biomarker-positive and biomarker-negative subgroups. Alternatively, one could use a formal phase II biomarker trial design that directly addresses the relevant drug development question : Should a phase III trial be performed, and if so, how should the biomarker be incorporated into that design? Phase II designs with multiple biomarkers are discussed later under Designs With Multiple Biomarkers.

Phase III Designs

Phase III trials are the definitive randomized comparisons of treatments and should provide sufficiently compelling evidence to change clinical practice. Accordingly, they are designed with a small false-positive rate—for example, a one-sided type I error rate of 2.5%, and high power for detecting truly positive effects (e.g., 80% to 90%). In addition to depending on the type I error and power, the required sample size depends inversely on the magnitude of the targeted treatment effect between the treatment arms and is typically hundreds or thousands of patients (see Tables 18.1 and 18.2 ). As noted previously, the required sample size for time-to-event end points is a function of the expected number of events, rather than the sample size. Therefore the required sample size will be larger for a clinical setting with lower event rates than one with higher event rates. For example, with 3 years of accrual and 3 years of follow-up, a setting with 70% 2-year event-free survival will have a required sample size roughly 40% larger than that needed for a setting with 50% 2-year event-free survival. For time-to-event end points, the definitive analysis and interim analyses are typically performed when the expected number of events is observed in both treatment arms together, but there are other alternatives ; the specific timing of the definitive and interim analyses should be specified in the trial protocol. The trial protocol should also specify the analyses that will be performed and which patients will be included the analyses. Typically, the intent-to-treat principle is used wherein all eligible randomized patients are included in the analyses, and eligibility is determined before randomization or assessed blindly based on prerandomization patient samples.

Randomization and Stratification

The randomization in a phase III trial is typically 1 : 1 (to avoid the inefficiency of an unbalanced randomization), and, after informed consent has been obtained, is ideally performed right before the experimental treatment is given (to avoid having to include patients who drop out of the trial before receiving the treatment under study). The randomization is typically stratified on a small number of variables thought to be highly associated with outcome to balance the distribution of these variables across the treatment arms and thus ensure that approximately equal proportions of patients in each arm of the trial will have relatively good and bad prognoses (as determined according to these stratifying variables). A stratified analysis can be used even if stratification was not part of the randomization—for example, when the stratifying variable is defined with a biomarker assay that is performed on a prerandomization sample after the randomization. Stratification is more important in smaller trials than larger ones because the chances of a worrying imbalance are larger in a small trial. When outcomes have a subjective component, blinding of the treatment assignment with placebos (when feasible) can remove a concern about potential bias in the trial results.

Multiarm Trials

A trial with multiple experimental arms and a single control arm can be an efficient way to test multiple treatments in a single disease setting. For example, a single randomized clinical trial (RCT) with a control arm and three experimental arms can be 33% smaller than three separate RCTs evaluating each experimental treatment separately. Multiarm trials can add complexity because they may require multiple placebos, restriction of eligibility to ensure that all the agents can be given safely, agreement among multiple industry partners to participate and agreement regarding how the data will be analyzed and shared, complex funding arrangements, and additional regulatory complexities.

A factorial trial design is a multiarm trial in which the arms represent all possible combinations of the experimental and control treatments. For example, in a 2 × 2 factorial design evaluating experimental treatments A and B (relative to no treatment), patients are randomized to one of the following four arms: (1) no treatment (arm C), (2) treatment A (arm A), (3) treatment B (arm B), or (4) treatment A plus treatment B (arm AB). (In some applications, a common backbone treatment will be given in all the arms.) A factorial analysis of a factorial design assumes that the effect of treatment A does not depend on whether the patient received treatment B—that is, the benefit of arm A over arm C is the same as the benefit of arm AB over arm B. (This is equivalent to assuming that that effect of treatment B does not depend on whether the patient received treatment A.) This assumption, which is referred to as no statistical interaction among the treatments, is very strong. When the assumption is satisfied, one can assess the efficacy of A and B in a trial with the same sample size as would be required to assess a single agent, a remarkable savings. However, if the assumption is not true, then a factorial analysis can incorrectly suggest that agents work when they do not, and do not work when they do. Even when a factorial analysis is problematic, the factorial trial design with a nonfactorial analysis is an efficient way to study two treatments and their possible interaction.

When a multiarm trial allows new trial arms to be added as it is ongoing, it is referred to as a “master” protocol or a “platform” trial. The idea is that when promising new agents become available they can be added (possibly along with corresponding control arms). In addition, trial arms are dropped when the questions involving these arms have been answered. An example of a master protocol is the STAMPEDE, which evaluates various agents in addition to a standard hormone therapy for advanced prostate cancer. The advantage of a master protocol is that it allows testing of agents more quickly because a new protocol does not need to be developed for each new treatment. This can be especially important when new treatment arms are suggested by the ongoing discovery of new biomarkers and targeted treatments. For any master protocol, one must avoid potential pressure to add new treatment arms or trial questions that may not be of the highest priority just to keep the trial ongoing.

Noninferiority Trials

In a noninferiority trial, one is interested in showing that the new treatment is not inferior to the standard treatment. These trials arise when the new treatment is expected to be less toxic or more convenient than the standard therapy. The sample size considerations are similar to those of a superiority trial (see Tables 18.1 and 18.2 ) with one important exception: the targeted treatment effect will typically be smaller, leading to much larger trials. The targeted effect is small because one does not want to recommend a new treatment that is inferior to a standard treatment by a nonnegligible amount. In particular, it has been suggested that the targeted difference should be no greater than a fraction of the benefit (e.g., 50%) seen for the standard treatment, which itself may not be large. In designing a noninferiority trial, the role of type I error and power are reversed, so that priority is given to minimizing the probability of declaring noninferiority when the new treatment is actually inferior; this probability should be kept very small (e.g., 2.5%), whereas the probability of declaring noninferiority when the treatments are equally effective could be set at 80% to 90%. For noninferiority trials, using the intention-to-treat principle can be problematic if there are a nonnegligible proportion of patients who do not receive their assigned treatments, because this can lead to underestimation of any true loss in efficacy ; to accommodate this, an additional analysis of only those patients who received their assigned treatments is typically performed.

In some situations, it is expected that the new treatment will actually be modestly better than the standard treatment, but because it is less toxic, it would be sufficient to demonstrate that it is noninferior to the standard treatment. In these special situations, the required sample size can be reduced.

You're Reading a Preview

Become a Clinical Tree membership for Full access and enjoy Unlimited articles

Become membership

If you are a member. Log in here