Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Three classes of statistics are used commonly in psychiatric research: psychometric statistics assessing the reliability and validity of diagnostic interviews or rating scales ; descriptive statistics used to describe a group of subjects on these clinical variables and demographic variables; and inferential statistics used to make probabilistic statements about the effects of treatments or other variables on groups of subjects .
The more statistical tests that are performed in a study, the greater are the chances of finding one or more that will be significant, when, in fact, there is not a true effect in the population from which the sample was drawn (that is, a false-positive result).
Researchers choose (depending on the level of measurement of their particular variables) the most appropriate statistical method to answer their particular research question. The simplest method available should be chosen to adequately answer a research question.
Statistical power analysis determines how many subjects are needed to minimize false-negative results in inferential statistics.
The word “statistics” derives from a term used for “numbers describing the state ;” that is, the original statistics were numbers used by rulers of states to better understand their population. Thus, the first statistics were simply counts of things (such as the population of towns, or the amount of grain produced by a particular town). Today, we call these kinds of simple counts or averages “descriptive statistics,” and these are used in almost every research study, to describe the demographic and clinical characteristics of the participants in a particular study.
Modern psychiatric research also involves two additional classes of statistics: psychometric statistics and inferential statistics. Most psychiatric studies will involve all three classes of statistics.
In psychiatric research, demographic variables (such as gender and height) can be measured objectively. However, most of our studies also require the measurement of variables that are not as objective (e.g., clinical diagnoses and rating scales of psychopathology). Here, we usually cannot measure directly the characteristics we are really interested in, so instead, we rely on a subject's score on either self-report or on investigator-administered scales. Psychometrics is concerned with how reproducible a subject's score is (i.e., how reliable it is), and how closely it measures the characteristic we are really interested in (i.e., how valid it is).
Psychiatric researchers study relatively small samples of subjects, usually with the intent to generalize their findings to the larger population from which their sample was drawn. This is the realm of inferential statistics, which is based on probability theory. Researchers are reporting inferential statistics when you see the tell-tale P-values and asterisks denoting statistical significance in the text and tables of the Results sections.
All three kinds of statistics (descriptive, psychometric, and inferential) are present in most published papers in psychiatric research, and are considered in a particular order, for the following reasons. First, without reliable and valid measures, neither of the other kinds of statistics will be meaningful. For example, if we rely solely on clinicians' judgments of patient improvement, but the study clinicians rarely agree on whether a particular patient has improved, any additional statistics will be meaningless. Likewise, a measure can be very reliably measured, as with a patient's cell phone number, but this measure is not reliable for any of the purposes of the study. Second, descriptive statistics are needed to summarize the many individual subjects' scores into summary statistics (such as counts, proportions, averages [or means], and standard deviations) that can then be compared between groups. Inferential statistics would be impossible without first having these summary statistics. Third, without inferential statistics and their computed probability values, the researcher cannot generalize any positive findings beyond the particular group being studied (and this is, after all, the usual goal of a research study).
Table 62-1 illustrates the characteristics of each class, as well as the order in which the classes must be considered, since each successive class rests on the foundation of the preceding class.
Class of Statistic | Purpose | Examples |
---|---|---|
Psychometric Statistics | Measures of reliability and validity of rating scales and other measures. Once measures are shown to have adequate reliability and validity, they can then be used as descriptive statistics. |
|
Descriptive Statistics | Statistics used to summarize the scores of many subjects in a single count or average to describe the group as a whole. After descriptive statistics have been computed for one or more samples, they can then be used to compute inferential statistics to attempt to generalize these results to the larger population from which these samples were drawn. |
|
Inferential Statistics | Statistics computed to compute probability estimates used to generalize descriptive statistics to the larger population from which the samples were drawn. |
|
To provide a concrete example of these sometimes abstract concepts, consider a fictional study based on the simplest research design in psychiatric research: a randomized double-blind trial of a new drug versus a placebo pill for obsessive-compulsive disorder (OCD).
Figures 62-1 to 62-3 contain the annotated Method and Results sections for this fictional study, showing how the various psychometric statistics are presented in the Method section, while descriptive statistics are presented in the Method and Results sections, and inferential statistics are presented in the Results section (for definitions of terms used in these figures, refer to the section on statistical terms and their definitions ).
Researchers should test only a few carefully selected hypotheses (specified before collecting their data!) if their obtained P-values are to have any meaning. The more statistical tests you perform, the greater the chance of finding at least one significant by chance alone (i.e., a false-positive result). Table 62-2 illustrates this phenomenon.
Number of Statistical Tests Performed at P < 0.05 | Probability of at Least One False-Positive Finding * |
---|---|
1 | 0.05 |
2 | 0.09 |
3 | 0.14 |
4 | 0.18 |
5 | 0.22 |
6 | 0.26 |
7 | 0.30 |
8 | 0.33 |
9 | 0.36 |
10 | 0.41 |
15 | 0.53 |
20 | 0.64 |
30 | 0.78 |
40 | 0.87 |
50 | 0.92 |
One should not be impressed by a researcher who conducts eight t-tests, finds one significant at P < 0.05, and proceeds to interpret the findings as confirming his theory. Table 62-2 shows us that with eight statistical tests at P < 0.05, the researcher had a 33% chance of finding at least one result significant by chance alone.
The two key determinants in choosing a statistical method are (1) your research goal, and (2) the level of measurement of your outcome (or dependent) variable(s). Table 62-3 illustrates the key characteristics of the various levels of measurement and provides examples of each.
Level of Measurement | Description of Level | Examples |
---|---|---|
Continuous (also known as interval or ratio) | A scale on which there are approximately equal intervals between scores | Beck Depression Scale |
Diastolic blood pressure | ||
Age of subject | ||
Ordinal (also known as ranks) | A scale in which scores are arranged in order, but intervals between scores may not be equal | Class ranking in school |
Any continuous measure that has been converted to ranks | ||
Nominal (also known as categorical) | Scores are simply names for different groups, but the scores do not imply magnitude. Often used to define groups based on experimental treatment or diagnosis | Diagnostic category |
Ethnicity | ||
Zip code of residence | ||
Dichotomous (also known as binary) | A special case of a nominal variable in which there are only two possible values | Gender (M or F) |
Survival (Y or N) | ||
Response (Y or N) |
Once the level of measurement of your outcome variable has been determined, you will decide whether your research question will require you to compare two or more different groups of subjects, or to compare variables within a single group of subjects. Tables 62-4 and 62-5 will help you choose the appropriate statistical method once you have made these decisions. (Note that these tables consider only univariate statistical tests; multivariate tests are beyond the scope of this chapter.)
Your Goal | Level of Measurement of Your Outcome Measure | ||
---|---|---|---|
Continuous | Dichotomous | Ranked | |
Compare two groups: | t-test of mean difference | 2 × 2 contingency table of proportions tested by χ 2 | Mann–Whitney U test of mean ranks |
Compare three or more groups: | Analysis of variance (ANOVA) | Contingency table of proportions tested by χ 2 | Kruskall–Wallis test of mean ranks |
Compare two or more groups while controlling for one or more other variables measured in both groups: | Analysis of covariance (ANCOVA) | Mantel–Haenszel test (not applicable for more than two groups) | N/A |
Compare two or more groups that are stratified on some other variable: | Factorial ANOVA | Mantel–Haenszel test (not applicable for more than two groups) | N/A |
Compare two or more groups that are measured on repeated occasions: | Mixed (or split-plot) ANOVA or MMRM | MMRM | N/A |
Your Goal | Level of Measurement of Your Outcome Measure | ||
---|---|---|---|
Continuous | Dichotomous | Ranked | |
Test association of a continuous variable with: | Pearson correlation coefficient ( r ) | Point-biserial correlation coefficient | N/A |
Test association of a dichotomous variable with: | Point-biserial correlation coefficient | Phi correlation coefficient | N/A |
Test association of a ranked variable with: | N/A | N/A | Spearman rank correlation |
Predict value of outcome measure from one or more continuous or dichotomous predictor variables: | Linear regression | Logistic regression | N/A |
Compare two or more groups that are measured on repeated occasions: | Mixed (or split-plot) ANOVA | N/A | N/A |
Compare change in an outcome variable measured on two occasions: | Dependent t-test | McNemar test | Wilcoxon test |
Compare change in an outcome variable measured on three or more occasions: | One-way repeated ANOVA or MMRM | MMRM with dichotomous outcomes | Friedman test |
For example, if you want to conduct a study comparing a new drug to two control conditions, and your outcome measure is a continuous rating scale, Table 62-4 indicates that you would typically use the analysis of variance (ANOVA) to analyze your data. If you wanted to assess the association of two continuous measures of dissociation and anxiety in a single depressed sample, Table 62-5 indicates that you would usually select the Pearson correlation coefficient. (Note that the procedures listed for Ranked outcome measures are those typically referred to as “non-parametric tests.”)
The final consideration in selecting a statistical procedure is whether subjects are measured on more than one occasion, as in the typical longitudinal clinical trial. In cases such as these, special statistical methods for “repeated measures” are used. Traditionally, a repeated measures analysis of variance has been the most-used method of analyzing a study comparing two or more treatment conditions in a longitudinal design. However, when, as is commonly the case, there are missing data due to subject dropout, the preferred analysis method is a sophisticated approach referred to as the Mixed-effect Model Repeated Measure (MMRM) model.
A non-significant P-value is meaningless if the researcher studied too few subjects, resulting in low statistical power. Tables 62-6 to 62-8 will help you estimate the number of subjects required to have a reasonable chance (usually set at 80%, or power = 0.80) of detecting a true effect (or put another way, a 20% change of a false-negative finding).
Effect Size (Difference between Means) | Statistical Power | |||
---|---|---|---|---|
0.50 | 0.60 | 0.70 | 0.80 * | |
0.20 SD (“small”) | 193 | 246 | 310 | 393 |
0.50 SD (“medium”) | 32 | 40 | 50 | 64 |
0.80 SD (“large”) | 13 | 16 | 20 | 26 |
1.20 SD | 7 | 8 | 10 | 12 |
Effect Size (“w” Statistic) | Statistical Power | Examples of Values of “w” | |||
---|---|---|---|---|---|
0.50 | 0.60 | 0.70 | 0.80 * | ||
0.10 (“small”) | 384 | 490 | 617 | 785 | 45% vs. 55% |
0.30 (“medium”) | 43 | 54 | 69 | 87 | 35% vs. 65% |
0.50 (“large”) | 15 | 20 | 25 | 31 | 25% vs. 75% |
0.70 | 8 | 10 | 13 | 16 | 15% vs. 85% |
Effect Size (Pearson's r ) | Statistical Power | |||
---|---|---|---|---|
0.50 | 0.60 | 0.70 | 0.80 * | |
0.10 (“small”) | 385 | 490 | 616 | 783 |
0.30 (“medium”) | 42 | 53 | 67 | 85 |
0.50 (“large”) | 15 | 18 | 23 | 28 |
0.70 | 7 | 9 | 10 | 12 |
For example, a researcher reports that she has compared two groups of 12 depressed patients, and found that a new drug was not significantly better than placebo at P < 0.05, by t-test. However, this negative result is not informative, because Table 62-6 indicates that with only 12 subjects per group, this researcher had statistical power of less than 0.80 to detect even a “large” effect; that is, even if the drug were truly effective, this study had less than a 50/50 chance of finding a significant difference.
Power analysis is now required as part of virtually all grant and institutional review board (IRB) applications submitted today.
Analysis of covariance (ANCOVA) is a form of ANOVA that tests the significance of differences between group means by adjusting for initial differences among the groups on one or more covariates. As an example, a psychologist interested in studying the effectiveness of a behavioral weight loss program versus self-dieting includes pre-treatment weights as a covariate.
This is an optimal test of significance of difference among means from three or more independent groups. As an example, if a medical researcher wants to compare the effects of three or more different drugs on a single dependent measure, he or she would compute a one-way ANOVA. The more complex, factorial ANOVA also tests for interaction effects between multiple factors. For example, if the two factors being tested were “drug/placebo” and “male/female,” the ANOVA interaction test may find that the drug is more effective than the placebo in the female subjects only. The significance of the analysis of variance is tested by the F-statistic.
This is the optimal test of significance for comparing continuous variables that are obtained through repeated measurements of the same subjects (because each subject's scores are usually correlated, the regular ANOVA would give results that are “too significant”). An experimenter may select a repeated-measures design because these are generally more sensitive to treatment effects (i.e., they have high power), since score differences between subjects are ignored.
This is a conservative method of reducing the chances of false-positive findings by testing a set of statistical tests at a more conservative P-value. The standard Bonferroni correction divides the nominal P-value (say P < 0.05) by the total number of statistical tests being conducted. For example, with 10 t-tests conducted, each would be tested at P < 0.005 (i.e., 0.05/10) to determine significance.
This is a generalization of multiple regression to the case of multiple independent variables and multiple dependent variables. It is rarely used today, except in neuroimaging studies with hundreds or thousands of correlated measurements. It is considered a multivariate statistical procedure because many inter-correlated variables are analyzed simultaneously.
This is a data-reduction technique used to group subjects together into subgroups (or “clusters”) based on their similarities or differences on a set of variables. This technique answers questions such as, “Do my subjects fall into subgroups?” and “What variables give a profile that distinguishes subgroups of my subjects?” A simple rule of thumb is “cluster analysis groups people, while factor analysis groups variables.” It is considered a multivariate statistical procedure because many inter-correlated variables are analyzed simultaneously.
A confidence interval does more than simply report that our observed mean difference between two groups is 2.5 points, significant at P < 0.05; it is far more informative to report that the 95% confidence interval around our observed mean is 0.6 to 4.4. Since 0 (the null hypothesis value of the mean difference) is not included in the 95% confidence interval, we know at a glance that the difference is significant at P < 0.05, but we also learn of the possible values of the actual mean difference between the groups, ranging from as low as 0.6 point to a high of 4.4 points (with 95% confidence). In the case of odds ratios, a confidence interval of 0.60 to 4.40 would not be significant at P < 0.05, because the null hypothesis would state that the odds ratio is 1.00, and this is included in the computed 95% confidence interval. Many journals require the reporting of confidence intervals, instead of using only P-values.
This is a test to determine whether the frequencies in each cell of a contingency table are different from the proportions expected by chance. It is most commonly used on a 2 × 2 contingency table, represented as four cells forming a square.
A common use is to answer the following question: “Is there a difference between the occurrence of a given side effect in the drug group versus a placebo group?” In this case the table is arranged with drug versus placebo as the two rows, and side effect versus no side effect as the two columns. As the (squared) difference between the observed and expected frequencies in each cell increases, the chi-square statistic also increases, and the more significant the result becomes. If all cells contain exactly the frequencies that would be expected by chance, the chi-square statistic is zero. If the frequencies differ greatly from chance, the chi-square statistic gets larger and larger. The size of the chi-square statistic is based on the number of cells in the contingency table (since df = [# rows − 1] × [# columns − 1], a 2 × 2 table always has a single degree of freedom).
This is a “Table” or “Matrix” of correlation coefficients for all variables of interest.
This is a variable that the investigator believes may influence the outcome or dependent variable and that is to be statistically adjusted for. For example, in a study of a new antidepressant, the baseline level of depression for subjects in each of the two groups may be used as a covariate.
This is usually the outcome variable of interest in a study; it is also called an “end-point.” In a study of a new antidepressant drug, a depression rating scale may be used as the dependent variable.
These are statistics used to describe a single population. Descriptive statistics used commonly to summarize the central tendency of a group are the mean (or arithmetic average) for continuous measures and the median (or “middle score”) for ordinal or ranked measures. Descriptive statistics are used commonly to describe the variability within a group. These include the variance and its square root, the standard deviation for continuous measures, and the interquartile range for ranked measures. Researchers look at descriptive statistics first to get a “big picture” of their data (and also to look for data entry errors or obvious outliers).
This is the optimal procedure to distinguish statistically two or more groups based on a group of discriminating variables. It is an important and under-used procedure. It is considered a multivariate statistical procedure because many inter-correlated variables are analyzed simultaneously.
Become a Clinical Tree membership for Full access and enjoy Unlimited articles
If you are a member. Log in here