Bayesian Statistical Methods in Stem Cell Transplantation and Cellular Therapy


Elements of Bayesian Statistics

The main objects in a statistical model are observable data, which may include one or more outcomes, Y, possibly a vector, X, of patient covariates and treatments, and a parameter vector, θ. Parameters are unobserved conceptual objects, such as the probability of response, median survival time, effects of X on Y, or latent effects of individual patients or subgroups, such as disease subtypes. In the Bayesian paradigm , θ is considered to be random. To reflect one’s knowledge or uncertainty about θ before observing data, a Bayesian model includes a prior probability distribution, prior (θ). In contrast, classical frequentist statistics considers θ to be fixed but unknown.

The next model component is the likelihood function, lik (data|θ), which is a probability distribution for the data for given θ. Some commonly assumed likelihoods are the normal or ‘bell-shaped’ distribution for numerical Y, the binomial distribution for Y equal to the number of responders in a sample of subjects, and the Weibull distribution for Y an event time, such as progression-free survival (PFS) or overall survival (OS) time. In Bayesian statistics, the observed data are used to learn about the distribution of θ by applying Bayes’ Law , which combines the prior and likelihood to compute the posterior distribution,


p o s t e r i o r ( θ | d a t a ) = l i k ( d a t a | θ ) p r i o r ( θ ) / p r o b ( d a t a )

where prob( data) is defined as the average of lik ( data |θ) prior (θ) over θ. Bayesian statistical inferences about θ are based on the posterior, which is a function of the data and the prior. When prob(data) cannot be computed mathematically, modern Markov chain Monte Carlo algorithms are used to compute posterior (θ | data ) numerically. This produces a large posterior sample of parameter values θ (1) , θ (2) , . . ., θ (M) that can be used to compute any posterior quantity of interest, such as a mean or 95% credible interval, for a given entry of θ.

When θ is a probability, it commonly is assumed to follow a beta prior distribution. Because of its simplicity, the beta distribution will be used here to illustrate some general Bayesian ideas. A beta (a, b) distribution has positive-valued hyperparameters a and b, which determine its mean a/(a + b) and effective sample size ESS = a + b. The ESS quantifies a distribution’s informativeness, with an ESS ≤1 corresponding to a noninformative prior. Fig. 4.1A shows three beta distributions with ESS = 1, 10, and 20, and means = .30, .30, and .80. Fig. 4.1B shows a beta (.30, .70) ( dotted line ) and three betas with ESS = 20 and means, .20, .40, and .80. Fig. 4.1C shows three betas with mean = .40 and ESS = 20, 40, and 100, corresponding to samples with 8/20, 16/40, or 40/100 responses. This illustrates that a nominal “40% response rate” is a statistical estimator that can have very different levels of reliability, depending on sample size. Fig. 4.1D illustrates a Bayesian two-sample B-versus-A treatment comparison based on a binary response. If 10/40 responses are observed with treatment A versus 18/40 responses with treatment B, then the posterior probability that B has a higher response rate than A is Pr(θ A < θ B data ) = .97. In contrast, from a frequentist viewpoint, based on these data values a two-sided Fisher’s Exact test of the null hypothesis H 0 : θ A = θ B has p-value = .10. Conventionally, this is considered “nonsignificant” because it is larger than .05, so the test would not lead to rejection of H 0 . Thus, the Bayesian analysis and the conventional frequentist test lead to opposite conclusions. Discussions of problems with how p-values are often misused or misinterpreted in practice are given by Gelman and Stern, Wasserstein and Lazar, and in Chapter 5 of Thall, among many others.

Fig. 4.1, Illustrations of beta distributions with various means and ESS values. A. gives three distributions, B. gives three distributions with different means but the same ESS = 40, the three overlapping shaded areas in C. are Pr(θ > .30) for a beta (8,12), beta(16, 24), or beta (40, 60) distribution, and D. gives two distributions with ESS = 40 but different means. ESS is Effective sample size.

To see how the binomial-beta Bayesian model may be used in practice, suppose that an experiment with a binary “success or failure” outcome is performed repeatedly, a total of N times, with all repetitions independent of each other. Let Y = number of successes out of the N repetitions, and denote the probability of success for any given experiment by θ. It follows that Y follows a binomial distribution with parameters N and θ, which may be written formally as [Y | θ] ~ binom (N, θ). If it is assumed that θ follows a beta prior distribution, written θ ~ beta (a,b), as noted earlier, a noninformative prior is obtained if the ESS = a + b is a small value. A useful property of the binomial-beta model is that, given observed data Y and N, the posterior distribution is [θ | Y, N] ~ beta(a + Y, b + N–Y). This shows that the beta is a conjugate prior for the binomial distribution, since the posterior belongs to the same family of distributions as the prior. In practice, conjugacy greatly facilitates computing posterior quantities. This posterior beta distribution has mean (a + Y)/(a + b + N), which can be written as the weighted average


a + b N + a + b a a + b + N N + a + b Y N

of the prior mean a/(a + b) and the sample mean Y/N. This form shows that the posterior mean may be thought of as “shrinking” the sample mean toward the prior mean. For large sample size N and small or moderate prior ESS = a + b, since (a + b)/(N + a + b) is small and N/(N + a + b) is close to 1, the posterior mean is close to the sample mean Y/N, so the data dominate the prior, but for small N, the prior mean plays a more prominent role. For example, if θ ~ beta (.5, .5), N = 4, and Y = 1 success is observed, then the posterior is beta (1.5, 3.5), which has mean 1.5 /5 = .30, while the empirical mean is Y/N = .25.

For applying more complex Bayesian models, advances in computers and numerical algorithms have greatly facilitated practical application of modern Bayesian methods in numerous areas, including medical research, behavioral science, ecology, finance, marketing, artificial intelligence, machine learning, and professional sports. The Bayesian paradigm provides a natural framework for making interim decisions sequentially based on accumulating data, which is especially useful for monitoring clinical trials. Bayes’ Law may be applied repeatedly, using the posterior computed from the accumulated data after each stage of the trial as the prior for the next stage, a process known as Bayesian learning . This can be done for sequentially adaptive dose-finding, applying early stopping rules for futility or safety, or for comparing treatments in randomized group sequential trials.

For example, consider a standard therapy, S, for a particular disease, and define θ S = Pr(response) for a patient treated with S. If asked to specify prior knowledge about θ S , a physician might reply, “between 20% and 40%.” This may be made more precise by saying that the physician is 95% sure, which is formalized by the Bayesian equation Pr(0.20 < θ S < 0.40) = 0.95. The limits 0.20 and 0.40 are called a “95% credible interval (ci)” for θS. This ci corresponds to the assumption that θ S follows a beta (23, 54) prior, which has mean 23/77 = .30 and ESS 23 + 54 = 77, and would correspond to observing 23 responders and 54 nonresponders in 77 patients treated with S. Suppose that θ E is the response probability with an experimental treatment, E, to be studied in a single-arm phase II trial. To reflect little knowledge about E, one might assume the prior θ E ~ beta (0.30, 0.70), which also has mean .30 but ESS = 1. Thus, the prior on θ E is noninformative, while the prior on θ S is informative. When interim response data on E have been observed, Bayes’ Law may be applied to compute the posterior distribution, posterior E data ). This quantifies what has been learned from the data, and may be used to make statistical inferences about θ E and construct sequential monitoring rules for use during the trial of E. In this example, a Bayesian rule to stop the trial of E because of futility might take the form Pr(θ S + .20 < θ E data ) <.04. This says to stop accrual if a desired .20 improvement of E over S in response probability is unlikely, given the observed data. The decision cut-off .04 was chosen by doing preliminary computer simulations to obtain a rule with desirable frequentist operating characteristics (OCs). Such simulations are done repeatedly while assuming two or more different fixed “true” parameter values. This Bayesian posterior decision criterion accounts for both the average value and the variability of θ S , rather than assuming that θ S is a fixed constant, which would be done in frequentist hypothesis testing. For maximum sample size 50 and cohort size 10, simple numerical computations show that this rule says to stop accrual to the trial if [number of responses]/[number of patients evaluated] is less than or equal to 2/10, 5/20, 9/30, or 13/40. The OCs of this rule may be summarized by saying that it stops the trial early with probability .78 and median sample size 20 if the true θ E true = .30, and it stops with probability .08 and median sample size 50 if the true θ E true = .50. That is, the rule is likely to stop the trial early if the true response probability of E is the undesirably low value .30, which is the mean of θ S , and it is unlikely to stop the trial early if the true response probability is the desirably high value .50, which is the mean of θ S plus the desired improvement .20. Thus, the rule has desirable OCs. These stopping probabilities are different from, and should not be confused with, the Type I error probability or power of a frequentist test of hypotheses. In this design, there are early stopping rules, but no hypotheses are being tested.

The next three sections give examples of Bayesian clinical trial designs, followed by three Bayesian data analyses. All examples arise from stem cell transplantation, cellular therapy, or biotherapy.

Monitoring in a Single-Arm Trial of CTLs for Viruses Due to Immunosuppression

A single-arm phase II trial was conducted of the experimental treatment, E = steroid-resistant cytotoxic T-lymphocytes (CTLs), at a fixed dose of 2 × 10 5 cells per kg, as treatment for viruses (cytomegalovirus, adenovirus, or BK virus) that may occur because of immunosuppression following an allogeneic hematopoietic cell transplantation (allo-HCT). There were two coprimary endpoints. Toxicity was defined as a grade 3, 4, or 5 toxicity of any kind attributable to the CTLs within 42 days of infusion, graft-versus-host disease within 40 days of infusion, or cytokine release syndrome (CRS) within 14 days of infusion requiring transfer to intensive care. Response was defined as the patient being alive and in remission at day 30 post CTL infusion. Secondary outcomes included PFS time, OS time, and response at day 100.

The protocol specified a maximum of 120 patients to be treated in up to eight cohorts of 15 patients each. This sample size ensures that if, for example, 36 patients (30%) have a Response with E then, assuming a beta(.30, .70) prior for θ R = Pr(Response), a posterior 95% ci for θ R would be .22 - .38. Similarly, if 48 patients (40%) experience Toxicity then, assuming a beta(.40, .60) prior for θ T = Pr(Toxicity), a posterior 95% ci for θ T would be .31 - 49. The method of Thall et al. was used to construct two early stopping rules, one for unacceptably low θ R and one for excessively high θ T , with the two rules applied simultaneously. This design generalizes the approach of Thall by monitoring two events, rather than one. To monitor both Response and Toxicity, a fixed upper limit on θ T was set at .40, and a fixed lower limit on θ R was set at .30. Because Response includes 30-day survival as a subevent, monitoring the parameter θ R includes monitoring the early death rate. In this sense, the early stopping rule for unacceptably low θ R , given below, is both a futility rule and a safety rule.

The four possible elementary events may be denoted by A 1 = [Response and Toxicity], A 2 = [Response and No Toxicity], A 3 = [No Response and Toxicity], and A 4 = [No Response and No Toxicity]. Since Response = [A 1 or A 2 ] and Toxicity = [A 1 or A 3 ], these events share the elementary event A 1 , which may be considered to be an outcome that is both good and bad. Denote the probabilities of the four elementary events with E by θ E = (θ E1 , θ E2 , θ E3 , θ E4 ), so θ E,R = θ E1 + θ E2 and θ E,T = θ E1 + θ E3 . As a comparator, a historical standard treatment distribution was used, with probabilities θ S = (θ S1 , θ S2 , θ S3 , θ S4 ), so θ S,R = θ S1 + θ S2 and θ S,T = θ S1 + θ S3 . It was assumed that θ E followed a noninformative Dirichlet prior distribution, Dirichlet (.12, .18, .28, .42), which has ESS = 1. This implies that θ E,R followed a beta (.30, .70) prior, and that θ E,T followed a beta (.40, .60) prior. It was assumed that θ S followed a Dirichlet (120, 180, 280, 420) distribution, which has the same four means as θ E but ESS = 1000, so it is highly informative. This implies that θ S,R followed a beta (300, 700) distribution and θ S,T followed a beta (400, 600) distribution. The trial would be stopped early for unacceptably low Response probability θ E,R if Pr(θ E,R < θ S,R | data) > 99, and stopped early for unacceptably high Toxicity probability θ E,T if Pr(θ E,T > θ S,T | data) > .99. These two posterior probability criteria imply that the trial would be stopped early if either [number of patients with Response]/[number of patients evaluated] was less than or equal to 0/15, 3/30, 6/45, 9/60, 13/75, 16/90, or 20/105, or if [number of patients with Toxicity]/[number of patients evaluated] was greater than or equal to 11/15, 19/30, 27/45, 34/60, 41/75, 48/90, or 55/105. This design’s OCs are summarized in Table 4.1 . In scenario 1 of the table, both of the fixed values θ E,R true and θ E,T true are acceptable. In scenario 2, θ E,R true = .10 is too low. In scenario 3, θ E,T true = .60 is too high. In scenario 4, both θ E,R true and θ E,T true are unacceptable. Thus, the only scenario where E has desirable properties is Scenario 1. The design has good OCs, because it has a low incorrect early stopping probability of .06 and large median sample size 120 in Scenario 1, and high stopping probabilities, with median sample sizes ranging from 30 to 45, in each of Scenarios 2, 3, and 4.

Table 4.1
Operating Characteristics, for Maximum Sample Size 120, of the Two Monitoring Rules Applied After Interim Sample Sizes of 15, 30, 45, 60, 75, 90 and 105 in the Trial of CTLs for Viruses Following allo-HCT
Assumed True Outcome Probabilities With E = Steroid-Resistant CTLs
Joint Probabilities Marginal Probabilities
Scenario θ E,1 true θ E,2 true θ E,3 true θ E,4 true θ E,R true θ E,T true Pr(Stop Early) Sample Size Quartiles
1 .12 .18 .28 .42 .30 .40 0.06 120, 120, 120
2 .05 .05 .35 .55 .10 .40 1.00 30, 30, 45
3 .12 .18 .48 .22 .30 .60 0.96 30, 45, 60
4 .05 .05 .55 .35 .10 .60 1.00 15, 30, 30
Allo-HCT , Allogeneic hematopoietic cell transplantation; CTLs , cytotoxic T-lymphocytes.

At the end of the trial, planned data analyses included estimating unadjusted distributions of OS and PFS using the method of Kaplan and Meier, and evaluating the relationships of OS and PFS to prognostic covariates by fitting Bayesian piecewise exponential survival regression models.

Optimizing CAR-T Cell Dose for Hematologic Malignancies Using Efficacy-Toxicity Trade-Offs

You're Reading a Preview

Become a Clinical Tree membership for Full access and enjoy Unlimited articles

Become membership

If you are a member. Log in here