Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
General texts concerning statistics ( ; ; ; ; ; ; ; ; ) are available; however, this chapter has been written as a practical guide to common statistical problems encountered when using differing research strategies in toxicologic pathology, and to provide the methodologies available to solve them. First, core issues in mathematical decision-making are discussed, followed by details concerning some of the principles used in making mathematical inferences, and finally discussions of why a particular procedure or interpretation is recommended. Enumeration of assumptions that are necessary for procedures to be valid is described, and problems often seen in the practice of toxicology and toxicologic pathology are discussed. A glossary of terms can be found at the end of the chapter.
Since 1960, as it has evolved, the field of toxicologic pathology has become increasingly complex and controversial in both its theory and its practice. As in all other sciences, toxicologic pathology started as a descriptive science, where pathologists described morphological changes seen following accidental exposure to xenobiotics. The need to understand the dose response for adverse effects of specific xenobiotics on tissues and organisms is a core function of predictive toxicologic pathology and safety assessment.
Prediction of adverse effects in humans is most often made when animals are dosed with (or exposed to) chemical or physical agents and the resultant adverse effects are observed at a specific dose. With the accumulation of these results, it is possible to infer, and study, underlying mechanisms of action. Toxicologic pathology has now developed to the mechanistic stage, where active contributions to the field encompass both descriptive and mechanistic studies.
Statistical analysis applied in the field of toxicology has also evolved during the last 50+ years. However, it is imperative that statistical approaches be performed in a competent and ethical manner in order to ensure transparent practices, reproducibility in results, and valid interpretations ( ). Statistical approaches are used to describe the data, conduct hypothesis testing, model observed responses, and perform dimension reduction. Attention to statistical detail should be employed at every aspect of a study, including experimental design, data collection and processing, statistical analysis, interpretation, and reporting of results ( ). These considerations will be described in detail for practitioners of toxicologic pathology throughout the course of this chapter.
Studies continue to be designed and executed to generate results in the form of measurements or observations (data), which can be statistically analyzed giving a mathematical probability of significance. The toxicologic pathologist is well trained to assist in interpretation of the biological importance of analyzed data from inferential experiments. Mathematical significance is irrelevant if there is no biological significance. In addition, the peculiarities of toxicologic pathology data need to be understood before procedures are selected and employed for analysis. There are characteristics of toxicology experiments in animals that impact their extrapolation to human populations. First, relatively small sample sets of data are collected from the members of an experimental animal population, which is not actually the human or target animal population of interest. Second, sample data are often censored on a basis other than by the investigator's design. Censoring occurs when the data points are not obtained as planned. This censoring usually results from biological factors, such as death or morbidity of an animal, or from a logistical factor (e.g., equipment failure or a tissue not collected during necropsy). Third, the conditions under which experiments are conducted are extremely varied. In pharmacology, the possible conditions by which a chemical or physical agent may interact with a person are limited to a small range of doses given via a single route of exposure over a short course of treatment to a defined patient population. In toxicologic pathology, however, the investigator seeks to control all test variables, such as dose, route, time span, and subject population. Finally, time frames available to identify, assess, and give opinions regarding issues are limited by practical and economic factors. This frequently means that there is no time to repeat a critical study. Therefore, a true iterative trial-and-error approach to toxicologic pathology is not possible.
Observationsand measurements are essential to all aspects of toxicologic pathology. Even when detailing a new case, descriptive observations can be collected in regard to lesion numbers and distribution as well as sample color, size, and texture. These observations are then summarized as to severity and the distribution and the disease process affecting a specific tissue can be defined as a morphological diagnosis.
All measurements and observations produce scalar measurement , ordered severity, or various categories. Each of these data types has implications regarding the type of statistical analysis that should be undertaken and how inference can be made using that analysis. We shall describe three types of outputs, for which an understanding of each is necessary to select the correct method to derive inference regarding the relationship between a treatment and an observed effect. Data are collected on the basis of their association with a treatment, intended or otherwise, as an effect (a property) that is measured in the experimental subjects of a study. These identifiers, i.e., treatment and effect are termed variables . Treatment variables previously selected and controlled by the researcher are termed independent variables . Effect variables, measurements, and observations such as weight, life span, and number of neoplasms that are believed to dependent on the treatment being studied are termed dependent variables .
All possible measures of a given set of variables in all the possible subjects that exist are termed the population for those variables. Such a population of variables cannot be truly measured. For example, one would have to obtain, treat, and measure the weights of all the Fischer 344 rats that were, are, or ever will be. Instead, we deal with a representative group (sample). If the sample of data is appropriately collected and of sufficient size, it serves to provide good estimates of the characteristics of the parent population from which it is drawn.
Regardless of the type of data, optimal design and appropriate interpretation of experiments require that the researcher understands both the biological and technological underpinnings of the system being studied and of the data being generated. From the point of view of the statistician, it is vitally important that the experimenter both knows and is able to communicate the nature of the data and understand its limitations. The types of data generated include quantal (e.g., dead or alive), categorical (e.g., morphological diagnosis), ordinal (e.g., lesion severity), and interval continuous (e.g., clinical pathology data).
Categories, also referred to as classes , are quintessential to toxicologic pathology. The most common category with which the toxicologic pathologist is familiar is the definitive diagnosis. Here, there may be a number of categories (diagnoses) into which the toxicologic pathologist places his or her observations (e.g., hepatocellular carcinoma, chronic interstitial nephritis). The categories only allow accumulation of “whole” events. It is not possible to have a partial diagnosis or make an “average” diagnosis. Either the diagnosis is made, or it is not. These data are grouped on the basis of name or category; hence they are usually termed “nominal” or “categorical” data.
A special type of category results in only one of two outcomes: yes or no. Typical examples include live or dead, pregnant or not, etc. These data are referred to as dichotomous , quantal , or binary , and are often seen in safety studies. Again, there is only one choice, as partial choices cannot be made (i.e., an animal cannot be both dead and alive).
Toxicologic pathologists are also familiar with the use of scales to express degrees of severity. These qualitative scales are not measured, but the scale shows an internal relationship. For example, severity may be scored as mild (+), which is less than moderate (++), which is less than severe (+++). Because the severity is ordered, the resulting data are often called “ordinal” in nature. Nonparametric statistical analysis of such data is required to make inferences with respect to cause and effect.
Measurements of a scalar quantity use a recognized scale (e.g., grams) to define the places of various data points. Measurements can be either continuous , such as length and weight, or discontinuous , like white blood cell (WBC) numbers and other hematological parameters. Continuous measurements are those where any value within the scale may be assigned (e.g., for weight, the scale can be registered in grams, or ever-smaller fractions of grams), while discontinuous measurements produce whole number values. For practical purposes, some discontinuous measurements can be treated as if they were continuous numbers. A partial white cell cannot be recorded, but because there are so many cells, the data approximate a continuum. Situations where discontinuous data can be considered as a continuum for the purpose of analysis will be discussed later.
Data can be recorded as proportions ratios and rates . Examples include the percentage of subjects affected with a specific tumor, the male: female ratio, and the incidence of occurrence. These data are continuous, as fractional values are possible. However, these quantities are seldom directly measured and are not usually normally distributed. These data require distribution-free (i.e., nonparametric) statistical analysis, unless a specified mathematical distribution can be approximated for the data set. Analyses of these data will be discussed later in this chapter.
Proportions can be misleading without reference to the absolute values from which they were calculated. For example, a “50% affected” rate represents one of two animals affected, or one million affected in a population of two million. Unscrupulous allusions regarding effectiveness can be attempted using proportions without reference to the numbers contained in the study population.
A summary of data types is presented in Table 16.1 .
Classification | Type | Example |
---|---|---|
Continuous scale | Scalar | Body weight |
Ranked (ordinal) | Severity of a lesion | |
Discontinuous scale | Scalar | Weeks until the first observation of a tumor in a carcinogenicity study |
Ranked (ordinal) | Clinical observations | |
Attributes (nominal) | Eye colors in fruit flies | |
Quantal (nominal) | Dead or alive; present or absent | |
Frequency distribution | Normal | Body weights |
Bimodal | Some clinical chemistry parameters | |
Others | Time to capacitation |
Continuous variables are those which can at least theoretically assume any of an infinite number of values between any two fixed points (such as measurements of body weight between 2.0 and 3.0 kg). Discontinuous variables, meanwhile, are those that can have only certain fixed values, with no possible intermediate values (such as counts of five or six dead animals, respectively).
Statistical methods are based on specific assumptions. Parametric statistics , those that are most familiar to the majority of scientists, have more stringent underlying assumptions than do nonparametric statistics. Among the underlying assumptions for many parametric statistical methods (such as the analysis of variance [ANOVA]) is that the data are continuous. Parametric statistical analyses assume that the data come from populations with distributions that can be modeled with a predetermined set of parameters. As a rule, the normal distribution is assumed for parametric statistical analyses. Independence of cases and equality of variances across groups are also required for parametric approaches. Nonparametric techniques should be used whenever the requirements for parametric tests cannot be verified or are not reasonably expected to be true. For example, continuous data such as cadmium concentrations in kidney tissue from an exposed versus an unexposed population could be studied using nonparametric statistics when the assumptions for parametric analysis cannot be met ( ; ). Parametric tests are more powerful than nonparametric tests, requiring smaller sample sizes.
Biological variation is central to all our lives. Diversity in our own species is recognized not only as visible characteristics, such as height, but also as functional characteristics, such as biotransformation abilities. Unfortunately, biological diversity interferes with efforts to test treatment effects, even when the experiment is designed and controlled a priori. No matter how inbred study animals are, and consequently how alike their physiological responses are likely to be, there is always a range of response displayed in measurements made on these animals. This fact has been confirmed in monozygotic human twins and in genetically identical organisms ( ).
The normal or Gaussian distribution is an essential underpinning of many commonly used statistical analyses. This distribution is described as a bell-shaped curve ( Figure 16.1 ). This distribution is the background of “noise” against which backdrop observations are made. Mathematics can help clarify whether the results seen in an experiment are a result of biological noise or a treatment-related signal. Just as the experimenter cannot be sure that the treatment did have an effect, statistical analyses do not give a definite yes or no answer, but rather renders a probability statement regarding the likelihood that the treatment is responsible for inducing the effect.
The mathematics used in the analysis results in a probability that the variability in results is caused by a biological variation (i.e., by chance) and not by the treatment ( Figure 16.2 ). The experimenter can then use statistical testing to evaluate the evidence against a predetermined null effect. This decision point is referred to as rejecting the null hypothesis ( H 0 ), where the H 0 is that the variability of effects is due to normal biological variation and not the treatment, and hence that the groups are the same. By convention, we reject this H 0 when the probability of making a false rejection is 5% or less ( P < .05). Hypothesis testing will be discussed in more detail later.
It is essential that a professional who firmly understands statistical concepts interpret any analysis of study results. These concepts include the nature and value of different types of data, and the difference between biological significance and statistical significance .
To illustrate the importance of biological versus statistical significance, we shall consider the four possible combinations of these two different types of significance, which produces the relationship shown in Table 16.2 .
Statistical significance | |||
---|---|---|---|
No | Yes | ||
Biological significance | No | Case I | Case II |
Yes | Case III | Case IV |
Cases IV and I give us no problems, for the answers are the same statistically and biologically. However, cases II and III present problems. In Case II (the false positive), we have a circumstance where there is statistical significance in the measured difference between treated and control groups, but there is no true biological significance to the finding. This is not an uncommon happening, as shown by the case of clinical chemistry parameters with values falling just outside the statistically defined range. This existence of statistical significance in the absence of biological relevance is called a Type I error by statisticians, and the probability of this happening is called the alpha level. When this type of error occurs, the H 0 (i.e., no difference) is rejected when it is true.
In Case III (the false negative), we have no statistical significance, but the differences between groups are nonetheless biologically/toxicologically significant. Statisticians call this situation a Type II error , and the probability of such an error happening by random chance is called the beta level. In this situation, the H 0 is accepted when it is false. An example of this second situation is when a very rare tumor type is observed in a few treated animals.
In both Case II and Case III, numerical analysis, no matter how well done, is no substitute for professional judgment. Along with this, however, one must have a feeling for the different types of data and for the value or relative merit of each. Note that the two error types interact, and in determining sample size, we need to specify both α and β levels ( ).
The power of a statistical test, which is also known as the sensitivity , is the probability that a test results in rejection of a null hypothesis, H 0 , when some other hypothesis, H 1 , is valid. In other words, power is the probability that the test will reject H 0 when H 0 is actually false (i.e., the likelihood of not committing a Type II error [making a false-negative decision]). The probability of a Type II error (i.e., the false-negative rate) is β, while power is equal to 1 − β. In general, power is a function of the possible distributions, often determined by a parameter, under the alternative hypothesis. Increasing the power decreases the chance of making a Type II error.
Power analysis can be used to calculate the minimum sample size required so that one can be reasonably likely to detect an effect of a given size. Power analysis can also be used to calculate the minimum sample size required so that one can be reasonably likely to detect an effect of a given size (see, for example, Table 16.3 ). In addition, the concept of power is useful for comparing different statistical testing procedures, such as a parametric and a nonparametric test of the same hypothesis. The larger the power required, the larger the necessary sample size. Stated another way, small sample sizes give a lower power, and heighten the chance of not detecting a true difference. Conventionally, power should be at least 80%.
Background tumor incidence | P a | Treatment Tumor Incidence b | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
0.95 | 0.90 | 0.80 | 0.70 | 0.60 | 0.50 | 0.40 | 0.30 | 0.20 | 0.10 | ||
0.30 | 0.90 | 10 | 12 | 18 | 31 | 46 | 102 | 389 | |||
0.50 | 6 | 6 | 9 | 12 | 22 | 32 | 123 | ||||
0.20 | 0.90 | 8 | 10 | 12 | 18 | 30 | 42 | 88 | 320 | ||
0.50 | 5 | 5 | 6 | 9 | 12 | 19 | 28 | 101 | |||
0.10 | 0.90 | 6 | 8 | 10 | 12 | 17 | 25 | 33 | 65 | 214 | |
0.50 | 3 | 3 | 5 | 6 | 9 | 11 | 17 | 31 | 68 | ||
0.05 | 0.90 | 5 | 6 | 8 | 10 | 13 | 18 | 25 | 35 | 76 | 464 |
0.50 | 3 | 3 | 5 | 6 | 7 | 9 | 12 | 19 | 24 | 147 | |
0.01 | 0.90 | 5 | 5 | 7 | 8 | 10 | 13 | 19 | 27 | 46 | 114 |
0.50 | 3 | 3 | 5 | 5 | 6 | 8 | 10 | 13 | 25 | 56 |
a P = Power for each comparison of treatment group with background tumor incidence.
b Numbers indicate the minimum treatment group size necessary to detect statisitical differences among groups for a given power and alpha level.
By now, the reader may be puzzled as to how we agree that a result is the truth. Even when it is established that an effect is of biological importance, the illustration above shows that there are always some false positive and negative outcomes based on the set false-positive level, the sample size, and the resulting false negative probability. In addition to uncertainties of biological importance based on clinical judgment and statistical probability, there are other sources of uncertainty that should be recognized.
The generation of data through laboratory tests (particularly immunologic, hematologic and enzymatic) all inherently shows false-negative and false-positive results. These outcomes may occur because of the test itself, or because of disease classification criteria used for defining the normal and abnormal ranges for the parameter of interest. For example, an immunocytochemical stain may not label all tissues adequately or may produce too much background noise. In this example, we can detect false negative and false positives occurring not as a result of statistical error, but as a result of error inherent to the test. The concept of false negatives and false positives can be followed through reagents used in the test, etc.
The toxicologic pathologist requires knowledge of true positives and true negatives to interpret the relevance of statistical significance to biological outcomes. True positives are the probability of the test demonstrating positive findings in diseased subjects, which is termed sensitivity. True negatives are the probability of a negative test result in the absence of disease, which is termed specificity. In addition, it is important to know the ability of the test to give the same answer over a number of runs (precision) and also the ability to give data that correspond to the true value of a measured parameter (accuracy) .
If one reexamines the table given for the combinations of biological and statistical significance, the relationship between test results and disease state may be illustrated effectively.
Using the information in Table 16.4 , one can now define sensitivity, specificity, and other test predictions using a series of simple equations:
Disease present | ||||
---|---|---|---|---|
No | Yes | |||
Test result? | No | (a + b) | Case I (a) | Case II (b) |
Yes | (c + d) | Case III (c) | Case IV (d) | |
Total | (a + b + c + d) | (a + c) | (b + d) |
Determination of the test sensitivity (probability of a positive test when the disease is present) ( Eq. 16.1 )
-where b has a disease with a negative test result, and d has a disease with a positive test result.
Determination of the test specificity (probability of a negative test when the disease is absent) ( Eq. 16.2 )
-where a has no disease with a negative test result, and c has no disease with a positive test result.
Determination of the test false-positive rate (probability of a positive test when the disease is absent ) ( Eq. 16.3 )
-where a has no disease with a negative test result, and c has no disease with a positive test result.
Determination of the test false-negative rate (probability of a negative test when the disease is pr esent) ( Eq. 16.4 )
-where b has a disease with a negative test result, and d has a disease and with a positive test result.
Determination of the test positive predictive value (probability that the disease is present if the test is positive) ( Eq. 16.5 )
-where c has no disease with a positive test result, and d has a disease and with a positive test result.
Determination of the test negative predictive value (probability that the disease is absent given a negative test) ( Eq. 16.6 )
-where a has no disease with a negative test result, and b has a disease with a negative test result.
As precision and accuracy require data other than those given above in Table 16.2 , they will not be addressed further in this chapter (See Clinical Pathology Testing , Vol 1, Chap 10 ) .
Although all these values have utility, it is the positive predictive value ( Eq. 16.5 ) that gives the experimenter the most information concerning the reliability of the data.
Confirmation that treatment causes an effect requires an understanding of the underlying mechanism and proof of its validity. At the same time, it is important that we realize that not finding a good mathematical correlation or suitable significance associated with a treatment and an effect does not prove that the two are not associated. In other words, failure to show a statistical correlation does not mean that a treatment does not cause an effect. At best, it gives us a certain level of confidence that under the conditions of the current test, the treatment and the effects were not associated. These points will be discussed in greater detail in the Assumptions Table ( Table 16.30 ) at the end of this chapter, along with other common pitfalls and shortcomings associated with the method.
Many aspects of experimental design are specific to the practice of toxicologic pathology. Before we look at general concepts for the development of experimental designs, the following aspects should first be considered.
Frequently, the data gathered from specific measurements of animal characteristics are such that there is wide variability in the data. Often such wide variability is not present in a control or low-dose group, while variance may be inflated in an intermediate dose group. That is, the standard deviation (SD) associated with the measurements from this intermediate group may be larger than that noted for the low-dose cohort. Unequal variance across dose groups is referred to as heteroscedasticity . In the face of such a data set, the conclusion that there is no biological effect based on a failure to show a statistically significance effect without accounting for heteroscedasticity might well be erroneous. When working with novel endpoints, scientists should perform pilot studies in order to characterize data variability. These preliminary studies can be used to inform the primary study design. However, caution should be used when interpreting outcomes from a new endpoint when the underlying variability in the data is not well understood.
In designing experiments, one should keep in mind the potential effect of involuntary censoring on sample size. In other words, though a study might start with five subjects per group, this small size provides no margin for error should any die before the study is ended, thus precluding sample collection and data analysis. Beginning a study with “just enough” experimental units per group frequently leaves too few at the end to allow meaningful statistical analysis. Therefore, study designs should have allowances to ensure that there will be an adequate number of subjects in each group when establishing group sizes.
It is certainly possible to pool the data from several identical toxicological studies to permit a global statistical evaluation. One approach to this is metaanalysis , considered in detail later in this chapter. For example, after first having performed an acute inhalation study where only three treatment group animals survived to the point at which a critical measure (e.g., analysis of blood samples) is performed, there would be insufficient data to perform a meaningful statistical analysis. The protocol would then have to be repeated with new control and treatment group animals from the same source using the same experimental design and test article. At the end, after assuring (by statistical calculations) that the two sets of data are comparable, the data from survivors of the second study could be combined, or pooled, with those from survivors of the first experiment. However, this approach would require a greater degree of effort than a single up-front study with larger groups. In addition, combining samples across data sets leads to increased variability in the experimental groups and decreased statistical power compared to a single study with larger group sizes.
Another frequently overlooked design option in toxicology is the use of an unbalanced design , where various group sizes are employed for different levels of treatment. There is no requirement that each group in a study, be it control, low dose, intermediate dose or high dose, has an equal number of experimental units assigned to it. Indeed, there are frequently good reasons to assign more experimental units to some group than to others.
All the major statistical methodologies have provisions to adjust for such inequalities, within certain limits. This change in number of subjects within one or more experimental groups is done to either compensate for losses due to possible deaths during the study, to give more sensitivity in detecting subtle effects at levels close to an effect threshold, or to provide more confidence to the assertion that no effect exists. It is always good practice to make sure that there is adequate statistical power to detect an effect size of interest before conducting the experiment. Scientists should play close attention to statistical power when using unbalanced designs for groups with small sample sizes to ensure that enough samples will be available for a proper statistical analysis at the end of the experiment.
We are frequently confronted with the situation where an undesired variable influences our experimental results in a nonrandom fashion. Such a variable is called a confounding variable . Its presence, as discussed earlier, makes the clear attribution and analysis of effects at best difficult, and at worst impossible. Sometimes, such confounding variables are the result of conscious design or management decisions, such as the use of different instruments, personnel, facilities, or procedures for different test groups within the same study ( ). Occasionally, however, such a confounding variable is the result of unintentional factors or actions, in which case it is sometimes called a lurking variable . Such variables are almost always the result of standard operating procedures (SOPs) being violated. Common examples include waterers that are not connected to a rack of animals over a weekend, a set of racks not cleaned as frequently as others, or provision of rations from a contaminated batch of feed. A discussion of confounding variables can be found in Study Design and Conduct Considerations that optimize Pathology Data Generation, Reporting, and Overall Study Outcome (Vol 1, Chap 28 ).
The experimental unit in toxicology encompasses a wide variety of possibilities. It may be cells, plates of microorganisms, individual animals, litters of animals, etc. The importance of clearly defining the experimental unit is that the number of such units per group is the n, which is used in statistical calculations or analyses, because the value of n critically affects such calculations. The experimental unit is the unit that receives the treatment and yields a response that is measured and becomes a datum. A distinction should be made between biological replicates and technical replicates. Biological replicates refer to the smallest experimental unit that independently receives the treatment and are therefore sometimes referred to as “true replicates.” Technical replicates represent repeated measurements of the same unit that describe the variability of the experimental protocol. In many cases, the technical replicates can be averaged over the same biological sample in order to produce the values for the biological replicates for statistical analysis.
Toxicological experiments generally are designed to answer two questions. The first question is whether or not an agent results in an effect on a biological system. The second question, never far behind, is how much of an effect is present. It has become increasingly desirable that the results and conclusions of studies aimed at assessing the effects of environmental agents and biopharmaceuticals be as clear and unequivocal as possible. It is essential that every experiment and study yield as much information as possible, and that the results of each study have the greatest possible chance of answering the questions for which the experiment was conducted. The statistical aspects of such efforts, so far as they are aimed at structuring experiments to maximize the possibilities of success of answering the questions above, are called experimental design.
The five basic statistical principles of experimental design are control, replication, randomization, concurrent (local) control, and balance ( ). The goal of the five principles of experimental design is statistical efficiency and the economizing of resources . The single most important initial step in achieving such an outcome is to define clearly the objective of the study, so that a clear statement can be made regarding what questions are being asked. The five principles may be summarized as follows.
Control in experimentation is central to determining treatment effect. Control is the term used to describe efforts made by the researcher to remove any known systematic influence on the experiment. That is, all variables except for the independent variables under study are removed. Failure to control for other systematic influences on the dependent variables means that the researcher cannot determine whether the effect was due to the experimental independent variable(s) or some other source of systematic variation . It should be noted that there are numerous sources of systematic variation ranging from the obvious, such as gender of the test species, to those that are easy to overlook, such as differences in handling by two different animal care attendants.
All inferential experiments have one or more control groups. These groups are assumed to be free of all systematic sources of variation including the independent variable(s). It may be necessary to have both negative and positive controls, where in addition to the absence of the independent variables another group is treated with a known positive outcome for the purposes of comparison. Multiple negative controls may be used where a systematic source of variation is expected. For example, a negative and pair-fed control may be used when treatment-related anorexia is known from previous studies. It is of interest that the French term for control is temoin , which when translates as “witness.” Obviously, a witness should not be biased to the outcome of an investigation.
Any treatment must be applied to more than one experimental unit (animal, plate of cells, litter of offspring, etc.). This provides more accuracy in the measurement of a response than can be obtained from a single observation, since underlying experimental errors and biological variability tend to be averaged over the replicates. It also supplies an estimate of the experimental error derived from the variability between each of the measurements taken (or replicates). In practice, this means that an experiment should have enough experimental units in each treatment group (that is, a large enough number, or n ) so that reasonably sensitive statistical analysis of data can be performed. The estimation of sample size is addressed in detail later in this chapter.
A distinction needs to be made between replication and duplication. Replication is the method of using different experimental units to increase the experimental number. An example of replication is giving the same severity rank (diagnosis, or score) to similar lesions from two mice. On the other hand, duplication is characterized by repeated measurements on the same experimental unit (e.g., coded [“blind”] reanalysis of lesions followed by giving a second severity rank to the same lesion). The purpose of duplication is to gain an understanding of precision of the measurements, in this case the severity rank. The distinction between replication and duplication is important to make when the toxicologic pathologist discusses the need—or not—to reassess sections of a study in a blinded manner.
Randomization is practiced to ensure that every treatment shall have its fair share of results from among the spectrum of possible outcomes. It also serves to allow the toxicologic pathologist to proceed as if the assumption of independence is valid, meaning that there is no avoidable (i.e., known) systematic bias in how one obtains data. Animals are often randomized by body weight in most general toxicology studies. However, mechanistic studies with focused hypotheses for particular endpoints may require more and more sophisticated randomization scheme based on multiple endpoints such as resting glucose levels and bone density. More specific aspects of randomization are discussed in detail in Section B below.
Comparisons between treatments should be made to the maximum extent possible between experimental units from the same closely defined population. Therefore, animals used for the control group should come from the same source, lot, age, gender, etc., as test group animals. Except for the treatment being evaluated, test and control animals should be maintained and handled in exactly the same manner.
A true concurrent control is one that is identical in every manner with the treatment groups except for the presence of the treatment being evaluated. This means that all manipulations, including gavage with equivalent volumes of vehicle or exposure to equivalent rates of air exchanges in an inhalation chamber, should be duplicated in control groups just as they occur in treatment groups.
When several different factors are being evaluated simultaneously, the experiment should be laid out in such a way that the contributions of the different factors might be separately distinguished and estimated. Different types of experimental design help clarify the importance and interaction of the various factors. Statistical testing is less sensitive to unequal group variance, and power is greatest, when group sizes are similar. It may be tempting to place more animals in the treated group to better see the effect; after all, we know that the untreated animals will be normal. However, such an uneven weighting among control and treatment groups weakens statistical analysis of the experiment. Counterbalancing refers to the procedure of avoiding confounding among variables by including every possible sequence of a factor in a study. For designs in which the same subject is presented with multiple conditions over time (within subjects or repeated measures designs), it is important to control for the effects of nuisance variables through counterbalancing. In a study of the effects of two treatments on a particular outcome, by applying a treatment to half of the subjects exposed to one treatment before they are exposed to the second treatment, while the other half of the subjects are exposed to the treatments in the reverse order, the order of treatment is not a confounding factor in the study.
There are multiple facets of any study that may affect its ability to detect an effect of a treatment. The most important with respect to interpretation of toxicologic pathology data are considered here.
It is important to have enough animals in the study so that a rigorous biological and statistical evaluation can be conducted. Ideally, the responses of interest should be rare in untreated and vehicle-treated control animals but should be evoked with reasonable ease by appropriate treatments. However, in practice, it is not uncommon to find a range of baseline responses in any given study. Common examples discerned by pathology evaluation include tissue degeneration and neoplasia. Some species or specific strains, perhaps because of inappropriate diets or gender-specific factors, have high background incidences of certain nonneoplastic and neoplastic conditions (e.g., chronic progressive nephropathy of F344 rats, hepatic tumors in mice) which make increases both difficult to detect and problematic to interpret. Guidelines from the Organization for Economic Co-operation and Development (OECD) recommend at least 20 animals per sex for each dose group along with a concurrent control for chronic toxicity studies. For carcinogenicity studies, there should be at least 50 animals per sex for each group ( ).
Sampling is an essential step upon which any meaningful experimental result depends. Sampling may involve the selection of which individual data points will be collected, which animals to collect tissue samples from, or taking a sample of a diet mix for chemical analysis.
There are three assumptions about sampling that are common to most of the statistical analysis techniques that are used in toxicology. The assumptions are that the sample is collected without bias, each member of a sample is collected independently of the others, and members of a sample are collected with replacements. Precluding bias, both intentional and unintentional, means that at the time a sample is selected from a population; each portion of that population has an equal chance of being selected. Independence means that the selection of any portion of the sample is not affected by, and does not affect, the selection of any other portion. Finally, sampling with replacement means that, in theory, after each portion is selected and measured, it is returned to the total sample pool and thus has the opportunity to be selected again. This last assumption is a corollary of the assumption of independence. Violation of this assumption, which is almost always the case in toxicologic pathology and all the life sciences (where tissue samples cannot be reattached), does not have serious consequences if the total pool from which samples are selected is sufficiently large (30 or greater) that the chance of reselecting that portion again is relatively small.
There are four major types of sampling methods: random, stratified, systematic , and cluster .
Random sampling is by far the most commonly employed sampling method. It stresses fulfillment of the assumption of avoiding bias. When the entire pool of possibilities is mixed (or randomized), then the members of the group are selected in the order that they are drawn from the pool.
Stratified sampling is performed first by dividing the entire pool of data into subsets (or strata) and then conducting randomized sampling from within each stratum. This method is employed when the total pool contains subsets that are distinctly different, but within each subset, the members are similar. An example is a large batch of a powdered pesticide in which it is desired to determine the nature of the particle size distribution. Larger pieces or particles are on the top, progressively smaller particles have settled lower in the container and are at the bottom, and the material has been packed and compressed into aggregates. To determine a representative answer to whether there is a particle size distribution as hypothesized, appropriate subsets from each subset (in this case, each layer) should be selected, mixed, and randomly sampled. This method is used quite commonly in diet studies.
In systematic sampling , a sample is taken at set intervals. For example, every fifth container of reagent is sampled, or a sample is collected from a fixed sample point in a flowing stream at regular time intervals. This approach is most commonly employed in quality assurance and quality control procedures.
In cluster sampling , the pool is already divided into numerous separate groups, such as bottles of tablets. Small sets from these groups (such as several bottles of tablets) are selected, and a few individual units (i.e., tablets) from each group (i.e., bottle) are selected for analysis. The result is a cluster of measures from several groups. Like systematic sampling, this method is commonly used in quality control or in environmental studies when the effort and expense of physically collecting a small group of units is significant.
In classical studies where toxicologic pathology is used, sampling arises in a practical sense in a limited number of situations. First, sampling often occurs by selecting a subset of animals or test systems to make some measurement at intervals during a study, which either destroys or stresses the measured system or is expensive. Examples include interim necropsies in a chronic study or collecting multiple blood samples from some animals during a study. Second, samples may be taken to analyze inhalation chamber atmospheres to characterize aerosol distributions with a new generation system. Third, samples of diet to which test material has been added may be collected. Fourth, quality control samples may be performed on an analytical chemistry operation by having duplicate analyses performed on some materials. In addition, duplicates, replicates, and blanks are used to ensure that the results can be relied upon; by using such samples, the specificity, sensitivity, accuracy, precision, limit of quantitation, and ruggedness can be determined. Finally, samples of selected data may be required to audit for quality assurance purposes.
The selection of dose levels and dosing methodology is a very important and often controversial aspect of study design. In screening studies aimed at hazard identification , it is normal to test at dose levels higher than those to which humans likely will be exposed, but not at levels so high that overt toxicity occurs, in order to avoid requiring unreasonably large numbers of animals. One of the tested doses must elicit toxicity. A range of doses is usually tested to guard against the possibility of a misjudgment in selecting an appropriate high dose. A dose range is required because the metabolic pathways at high doses may differ markedly from those at lower doses. In studies aimed more at risk estimation, increased frequency of lower doses may be tested to obtain better information on the shape of the dose–response curve. Unfortunately, in practice, the shape of the curve in the very low dose range often is not known. This lack is particularly important in assessing risk to cancer where the incidence of neoplasia that may be detected in a typical rodent bioassay is on the order of 2%. For the purposes of risk assessment, risk estimates applied when setting human exposure limits are at least 1000-fold lower (i.e., 10 −5 ). For this reason, carcinogenicity studies do not have much ability to characterize the shape of the curve for an assessment at lower dose (or exposure) levels.
This sample size is obviously an important determinant of the precision of the findings. The calculation of the appropriate number depends on the size of the effect it is desired to detect. Very small effects may require a larger n (number of animals) per group, while obvious effects may permit smaller group sizes. The animal number is also impacted by the false-positive rate (α level or Type I error, which is the probability of an effect being detected when none exists). Similarly, the false-negative rate (β level or Type II error, or the probability of no effect being detected when one of exactly the critical sizes exists) influences the number of experimental units required. Finally, the measure of the variability of the animals' response influences the number required.
Tables relating numbers of animals required to obtaining α and β values of a given size are given in many references in the Suggested Reading. Software is also available for this purpose. As a rule of thumb, to reduce the critical difference by a factor n for a given α and β, the number of animals required will have to increase by a factor n 2 .
The duration of the toxicity study is generally driven by the nature of the clinical trial that is being supported; however, some studies or dose groups have to be terminated prior to schedule sacrifice due to animal care and use considerations. It is important not to terminate a study too early, especially where the incidence of the effects of interest is strongly age related. The death datum is a powerful quantal point giving a definite answer (yes or no). However, it also important not to allow a study to continue for too long (i.e., beyond the point where further time on study likely will not provide any useful incremental information). For nonfatal conditions, the ideal stop point on average is to necropsy the animals when the prevalence of death is around 50%, as greater mortality than this often invalidates the assumptions used in the statistical analysis.
To detect a treatment difference with accuracy, it is important that the groups being compared are as homogeneous as possible with respect to all other (nontreatment) variables, whether or not such variables are known or suspected causes of the response. Unfortunately, there are a number of reasons why groups may not be homogeneous. For example, suppose that there is another known important cause of the response for which the animals vary; in other words, there are two, not one, systematic sources of variation (factors) in the experiment, even though the experiment was designed for one source (the treatment). Such a situation may arise when the group includes a mixture of hyper- and hyporesponders to the treatment. Because the randomization scheme did not account for the hyper- and hyporesponses seen in this experiment, it is possible that the treated group may have a higher proportion of hyperresponders. This bias will lead to a higher response in the treatment group regardless of whether the treatment has an effect of not. Even if the proportion of hyperresponders is the same as in the controls, it will be more difficult to detect an effect of treatment because of the increased between-animal variability.
If the second factor (degree of responsiveness) is known before the experiment begins, it should be taken into account in both the design and analysis of the study. In the design, it can be used as a blocking factor to correct for potential allocation bias . This blocking factor will ensure that animals with hyper- or hyposensitivity are allocated equally, or in the correct proportion, to control and treated groups. In the analysis, the second factor should be treated as a stratifying variable, with separate treatment-control comparisons made at each level, and the comparisons combined for an overall test of difference. This is discussed later, where the factorial design is provided as an example of a more complex experimental design to investigate the separate effects of multiple treatments.
Randomization is the arrangement of experimental units to simulate a chance distribution, reduce the interference by irrelevant variables, and yield unbiased statistical data. Randomization is a control against bias in assignment of subjects to test groups. If randomization is not carried out, one can never be sure whether or not treatment-control differences are due to the treatment or rather to confounding by other systematic sources of variation. In other words, unless randomization is used, we cannot determine whether the experimental treatment had an effect, or whether an observed effect was due to uneven allocation of the animals among experimental groups by chance.
The need for randomization applies not only to the allocation of the animals to the treatment but also to any method, person, or practice that can materially affect the recorded response. The same random number that is used to apply animals to treatment group can be used to determine cage position, order of weighing, order of bleeding for clinical chemistry, order of necropsy at termination, to choose the technician attending and the pathologist evaluating the gross necropsy, and so on. The location of the animal in the room in which it is kept may affect the animal's response. An example is the strong relationship between incidence of retinal atrophy in albino rats and closeness to the lighting source. Systematic differences in cage position should be avoided, preferably via randomization.
Randomization is the act of assigning a number of items (e.g., plates of bacteria or test animals) to groups in such a manner that there is an equal chance for any one item to end up in any one group. A variation on randomization is censored randomization , which ensures that the groups are equivalent in some aspect after the assignment process is complete. The most common example of a censored randomization is one in which it is ensured that the body weights of test animals in each group are not significantly different from those in the other groups. This is done by analyzing group weights both for homogeneity of variance and by ANOVA after animal assignment, then again randomizing if there is a significant difference at some nominal level, such as P
0.10. The process is repeated until there is no significant difference among groups. There are several methods for actually performing the randomization process. The three most commonly used are card assignment, use of a random number table, and use of a computerized algorithm.
For the card-based method, individual identification numbers for items (plates or animals, for example) are placed on separate index cards. These cards are then shuffled, placed one at a time in succession into piles corresponding to the required test groups. The results are a random group assignment.
The random number table method requires only that one have unique numbers assigned to test subjects and access to a random number table. One sets up a table with a column for each group to which subjects are to be assigned, while randomization starts from the head of any one column of numbers in the random table. Each time the table is used, a new starting point should be utilized. As digits are found which correspond to a subject number, the subject is assigned to a group (enter its identifying number in a column) proceeding to assign subjects to groups from left to right filling one row at a time. After a number is assigned to an animal, any duplication of its unique number must be ignored, as many successive columns of random numbers are used as is required to complete the process.
The third (and now most common) method is to use a random number generator that is built into a calculator or computer program. Procedures for generating these are generally documented in user manuals.
While historical control data can be useful on occasion, a properly designed study demands that a relevant concurrent control group be included with which results for the test group can be compared ( ). The principle that like should be compared with like, apart from treatment, demands that control animals should be randomized from the same source as treatment animals. An experiment involving treatment of a compound in a solvent, which included only an untreated control group, would indicate that any differences observed could only be attributed to the compound-solvent combination. To determine the specific effects of the compound, a comparison group given the solvent only, by the same route of administration, would be required.
A priori selection of statistical methodology, as opposed to the post hoc approach, is as significant a portion of the process of protocol development and experimental design as any other and can measurably enhance the value of the experiment or study. Prior selection of null hypotheses and statistical methodologies is essential for proper design of other portions of a protocol such as the number of animals per group or the sampling intervals for body weight. The analysis of any set of data is dictated to a large extent by the manner in which the data are obtained.
Statistical testing relies on the probability that a particular result would be found according to chance, where the risk is referred to as the significance level. A 0.05 significance level indicates that 5% of the statistically significant results would be expected to be false positives due to chance. Therefore, it is not appropriate to repeatedly switch statistical procedures while analyzing a data set until a statistically significant or desirable result is found. This approach would have a very high probability of generating a misleading result, since it fails to acknowledge that a result obtained in this way would almost certainly be found after a large enough number of attempts. This unsound approach to data analysis is often referred to as “data dredging,” “data snooping,” or “ p hacking.”
Data must be generated and recorded in an unbiased manner. It is recognized that many options exist for analysis, which will be discussed below. However, biased data generation, compiled with the knowledge of the operator, represents scientific fraud.
Good Laboratory Practices (GLPs) have been implemented to ensure that the data analyzed represent the data generated from the experiment. GLPs have three main components: requirements for personnel, requirements for facilities, and requirements to create SOPs and records of events undertaken in the study (Volume 1: Pathology and GLPs, Quality Control and Quality Assurance in a global environment). GLP practices guard against “data snooping” by requiring that investigators conform to a preplanned statistical analysis.
Observer bias can occur when the observer is aware of the treatment. This knowledge may consciously influence the observer through true bias of observation, or unconsciously as reading bias , also referred to as work-up bias. In slide reading, or examination bias, the observer increases his or her attention to a lesion, once noted in the sections, resulting in an effect known as diagnostic drift. In some situations, it may be necessary to reread all the slides in a blinded fashion and in random order to be sure that diagnostic drift is avoided.
Elimination bias occurs when data are eliminated for various reasons. Usually, such protocol violations will be recorded; however, it should be noted that valid analysis could not be conducted unless one can distinguish among animals that were examined and did not have the relevant response vs. animals that were not examined. Therefore, it is important clearly to identify which data are missing and for what reason.
In many instances, toxicologic pathologists practice an informed (nonblinded) approach to histopathological evaluation. When using an informed approach, the pathologist has full information about the dose, exposure groups, and other information pertaining to the animals in the experiment. Informed analyses may be needed in order to establish how lesions produced in response to treatment differ from the normal biology found in concurrent controls, conduct a study without exorbitant cost, or to complete evaluations in a timely manner. The danger in informed analysis would be potential subjectivity in diagnoses through observer bias and tends to contradict commonly held principles of the scientific method and statistical theory. Nevertheless, informed analysis is widely used and has become the general recommended practice in toxicologic pathology ( ; ; ; ).
When informed evaluations are used, it is important to utilize procedures to diminish potential observer bias and promote reproducibility in diagnoses. For instance, an organization might utilize an independent peer review process, clear nomenclature published that has been published and broadly accepted, and incorporation of supporting data such as criteria used for experimental design, body and organ weight data, and clinical pathology results. It is important that informed approaches be performed by well-trained pathologists without any conflicts of interest.
Informed analyses have great utility when they are implemented with careful planning and conduct. Nevertheless, blinded studies should be used to the extent possible. Informed analyses should be supported by blinded evaluations on subsets of the samples and reevaluations. Blinded evaluations may be required in circumstances for which some of the conditions above cannot be met, and for evaluations of subtle lesions or when lesion severity is similar between control and treatment conditions.
Understanding the concept of censoring is essential to the design of experiments in toxicologic pathology. Censoring involves the exclusion of measurements from certain experimental units, or the experimental units themselves, from consideration in data analysis or inclusion in the experiment. Censoring may occur either prior to initiation of an experiment as a planned procedure, during the course of an experiment as an unplanned procedure (e.g., through the death of animals), or after the conclusion of an experiment, when data usually are excluded because they represent some form of outlier response.
In practice, a priori censoring in toxicology studies occurs in the assignment of experimental units, usually animals, to test groups. The most familiar example is the practice of assigning test animals among groups in acute, subacute, subchronic, and chronic studies, where the results of otherwise random assignments are evaluated for body weights of the assigned members. If the mean weights are found to differ by some preestablished criterion, such as a marginally significant ( P < .10) difference found by ANOVA, then members are reassigned, or censored, to achieve comparability across groups in terms of starting body weights. Such a procedure of animal assignment to groups is known as a censored randomization .
The first precise or calculable aspect encountered when designing an experiment is determining sufficient test and control group sizes to allow one to have an adequate level of confidence in the results of a study. In other words, calculations are in order to ensure the ability of the study design with the statistical tests used to detect a true difference, or effect, when it is present, where the statistical test contributes a level of power to such detection.
The power of a statistical test is the probability that a test rejects the null hypothesis, H 0 , when some other hypothesis, H 1 , is valid. This is termed the power of the test with respect to the alternative hypothesis H 1 . If there is a set of possible alternative hypotheses, the power, regarded as a function of H 1 , is termed the power function of the statistical test. When the alternatives are indexed by a single parameter, simple graphical presentation of the power function (a two-dimensional line graph) is possible. If the parameter is a vector, one needs to visualize a power surface (a chart describing a shape function in three dimensions). This is the significance level. A test's power is greatest when the probability of a type II error, which is the probability of missing an effect (i.e., false negative), is lowest. Specified powers can be calculated for tests in any specific or general situation.
A general rule to keep in mind is that the more stringent (lower) the significance level desired, the greater the necessary sample size. Greater protection from an incorrect conclusion requires greater effort. More subjects are needed for a 1% level test than for a 5% level test. Two-tailed tests require larger sample sizes than one-tailed tests to maintain the same power; assessing two directions at the same time requires a greater investment. The smaller the critical effect size, the larger the necessary sample size; any difference can be significant if the sample size is large enough. The larger the power required, the larger the necessary sample size. The smaller the sample size, the lower the power. The requirements and means of calculating necessary sample size depend on the desired (or practical) comparative sizes of treatment and control groups.
The sample size n can be calculated, for example, for equal-sized test and control groups, using the following equation ( Eq. 16.7 ).
Calculation of the number of animals required for an experiment
-where z 1 is the z critical value corresponding to the desired level of confidence 1 - α given by z 1-α/2 ; z 2 is the z critical value corresponding to the probability of correctly rejecting the H 0 (or power) given by z 1-β ; σ is the common variance, derived typically from historical data; and δ is the desired treatment difference ( δ = μ 2 – μ 1 ).
There are six basic experimental design types used in toxicology. These are the completely randomized, the completely randomized block , matched pairs , Latin square , factorial , and nested design . Other designs that are used are really combinations of these basic designs and are very rarely employed in toxicologic pathology. However, before examining these six basic types, we must first examine the basic concept of blocking.
Blocking is the arrangement or sorting of the members of a population, such as all test animals within an available pool for the study, into treatment groups based on certain characteristics, which may, but are not sure to, alter an experimental outcome. Characteristics that are frequently selected for blocking may cause a treatment to give a differential effect. Examples include genetic background, age, sex, overall activity levels, and so on. The process of blocking attempts to evenly distribute members of each blocked group among the experimental groups.
A completely randomized design arranges experimental units to simulate a chance distribution. Here, animals are randomly assigned to any treatment group. There are no attempts made to evaluate the effects of any other source of variability except the treatment. This is the most common type of design and is particularly common in acute and subacute toxicologic pathology studies ( Table 16.5 ).
Treatment | |
---|---|
Placebo | Test substance |
50 | 50 |
Randomization is aimed at spreading out the effect of undetectable or unsuspected characteristics in a population of animals, or some portion of this population. The completely randomized block design merges the two concepts of randomization and blocking to produce the first experimental design that addresses sources of systematic variation other than the intended treatment .
This type of design requires that each treatment group have at least one member of each recognized group (e.g., animals of certain ages), the exact members of each block being assigned in an unbiased random fashion. In toxicologic pathology studies where the application of the treatment may take some time, the experiment may be blocked so that the first group of animals is randomly assigned in equal numbers to each and every treatment group. For example, the duration of surgery required to implant an experimental medical device may be such that blocking is necessary to account for systematic variation caused by the exact time of implantation ( Table 16.6 ).
Gender | Treatment | |
---|---|---|
Placebo | Test substance | |
Male | 250 | 250 |
Female | 250 | 250 |
A matched pairs design is a special case of the randomized block design. It is used when the experiment has only two treatment conditions, allowing participants to be grouped into pairs based on some blocking variable. Within each pair, participants then are randomly assigned to different treatments.
Table 16.7 shows a matched pairs design for an experiment. The 1000 participants are grouped into 500 matched pairs. Each pair is matched based on two factors, age and gender. For example, Pair 1 could be two women aged 21, Pair 2 might be two women aged 22, and so on.
Pair | Treatment | |
---|---|---|
Placebo | Test substance | |
1 | 1 | 1 |
2 | 1 | 1 |
– | – | – |
499 | 1 | 1 |
500 | 1 | 1 |
Like the completely randomized design and the randomized block design, the matched pairs design uses randomization to control for confounding factors. However, the matched pairs design is an improvement over the completely randomized design and the randomized block design. Of the three options, only the matched pairs design explicitly controls for the potential effects from two lurking (extraneous) variables: age and gender.
Become a Clinical Tree membership for Full access and enjoy Unlimited articles
If you are a member. Log in here