Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Additional content is available online at Elsevier eBooks for Practicing Clinicians
Medicine is an information science. Information is being produced at an unprecedented rate and is readily accessible using electronic searches and hand-held devices, making skills to parse and use the appropriate information ever more important. Memorization of medical facts is less a necessity, while processing knowledge and critical thinking remain essential for high-value medical care. Clinical decisions and recommendations are critical features of medicine and, in the midst of a rapid expansion of medical knowledge, have never been more challenging. This chapter summarizes core competencies for clinical reasoning that should be mastered by expert practicing cardiologists.
Excellent clinical decisions require a command of medical knowledge and a deep understanding of individual patients, including their preferences and goals. Good decisions take into account the limits of knowledge, uncertainty in measurements, and the play of chance. Clinical reasoning is informed by both experiential and formal knowledge learned through years of practice and study. The translation of medical knowledge into good patient-centered decisions is a key goal of clinical reasoning and is the hallmark of an expert clinician.
Clinical reasoning is often guided by simplified rules. Early in training, physicians are taught how to recognize specific clusters of signs and symptoms, place patients in diagnostic categories, and follow the rules that apply to those categories. For example, patients with particular findings might be labeled as having acute myocardial infarction (AMI), which would trigger treatment based on studies showing benefit from aspirin and beta-blocking agents. In this context, algorithmic tools are often used to direct actions. For example, guidelines recommend that a patient with a low ejection fraction should be considered for an automated implantable defibrillator, but only after considering the etiology of the systolic dysfunction and the timeframe of the disorder.
These rule-based algorithms are not intended to force actions, but to guide decisions. The best clinicians know when adherence to such algorithms is proper and when exceptions, based on the patient’s particular situation or preferences, can lead to divergence from these algorithms. Divergence from guidelines may be appropriate, but requires adequate justification, documentation, and transparency.
Most of medical decision-making, however, lies outside of simple algorithms and requires judgment. There are two major settings, related to diagnosis and treatment, where clinical reasoning is critical.
First, there are decisions about classifying an individual who presents with symptoms or signs of disease into the proper diagnostic category. Book chapters and other reference materials are usually organized according to categories, such as a medical diagnosis. The chapter informs the reader about how a particular condition, such as aortic stenosis, might manifest. These labels are useful for clustering patients by common disease mechanisms, prognosis, and responses to therapeutic strategies.
But patients often do not present to medical attention fitting perfectly into pre-specified general diagnostic categories. They seek attention for symptoms, which requires the clinician to reverse the order of a typical textbook and to work inductively from a patient’s signs and symptoms toward a diagnostic label before a therapeutic plan can be developed. For a patient with dyspnea on exertion and a systolic murmur, aortic stenosis is a possibility, but the diagnosis is not conclusive without further testing. In some cases, uncertainty persists. About a third of patients labeled with a principal discharge diagnosis of heart failure also receive treatments for other causes of dyspnea such as pneumonia or chronic obstructive pulmonary disease. This is the reality of current practice.
Second, there are decisions about treatments. These decisions are also challenging because they involve weighing risks and benefits, speculating about estimates for these parameters, and aligning choices with the preferences of the patient. The likelihood of benefit is often probabilistic, as people are pursuing strategies to reduce risk without knowing whether they themselves will benefit. These decisions can occur in prevention, which addresses whether to intervene in the interest of preventing future health problems, based on an estimate of prognosis. In this setting, the risks and costs occur immediately while the benefit is anticipated to be in the future. These decisions can also involve treatments to address symptoms as well as reduce the immediate risk for someone with acute or chronic disease.
Risk stratification is an important application of probability and is often used to estimate patient risk and assist in decision-making. This approach generally uses the results of statistical models that have identified prognostic factors and incorporated them into a tool that may assist clinicians. In recent years, many tools have been developed to assist in the rapid assessment of patients.
Recent decades have witnessed the emergence of cognitive psychology, a branch of psychology focused on how people make decisions. The field demonstrated that people frequently develop useful reasoning shortcuts to circumvent the need to explicitly calculate probabilities, but these shortcuts come with biases that can lead decision-makers to deviate from the rules of logic and probability in predictable ways. Thus, a good understanding of clinical reasoning requires knowledge about logic and probability as well as cognitive psychology.
Cognitive psychologists have demonstrated how people often rely on intuition to make decisions in uncertain settings. , For cognitive psychologists, intuition is not merely guessing, but has a specialized meaning. The cognitive psychologist Herbert Simon described intuition by stating: “the situation has provided a cue; this cue has given the expert access to information stored in memory, and the information provides the answer. Intuition is nothing more and nothing less than recognition” (see Classic References). Expert clinicians learn to use intuition to recognize diagnoses and make clinical decisions. They learn to calibrate their intuitive judgments using scientific evidence and clinical experience. They may also be susceptible to cognitive biases that are associated with such decision-making.
Patients often present with descriptions of symptoms such as chest pain. Cues are scattered, like pieces of a jigsaw puzzle. Clinicians, like all decision-makers, often use mental shortcuts called heuristics to organize cues and to turn an unstructured problem into a set of structured decisions. , They are taught to collect the cues of an unstructured clinical problem by using an organized history and physical examination. When experts take a history, they use a process known as early hypothesis generation to develop a list of 3 to 5 possible diagnoses very early in the process (see Classic References). This enables the questioning to become more direct and the clinician to become more engaged in the fact-finding exercise.
Studies show that the mechanism of diagnostic hypothesis generation varies, depending on the stage of training. Novice practitioners who lack clinical experience use causal reasoning, which tends to be slow and less accurate. As trainees gain experience, knowledge about diagnoses becomes encapsulated into illness scripts. An illness script is a schema or map that integrates conceptual information regarding a disease and links the concepts with case experience. As physicians gain further experience, they accrue experiential knowledge. One theory is that diagnostic experience is remembered through disease prototypes, which describe the typical features of a disease. Another theory is that experiential knowledge is remembered as specific instances called exemplars, which are memories of prior experiences that have been categorized and stored in long-term memory. With experience, a clinician accumulates exemplars that are automatically retrievable and represented in memory in a fashion that is unique to that clinician and not generalizable among clinicians. Memories of exemplars give the expert an intuitive sense of both the base rates for particular diagnostic categories and the relative frequencies of features for a diagnostic category.
Because clinicians start the diagnostic process by intuitively recognizing familiar phenotypes stored in memory as exemplars, it becomes important to study how symptoms combine in individuals as unique symptom phenotypes. A recent study showed wide variation in symptom phenotypes among patients with AMI, which may have important implications on how we teach learners to recognize a diagnosis. The study also showed that women exhibited significantly more unique symptom phenotypes than men. Greater phenotypic variation could lead to more missed diagnoses and this is a promising area for further research.
After collecting, sorting, and organizing clinical data, clinicians often use a problem list as a tool to list, group, and prioritize clinical findings. With additional clinical information, a problem statement can be defined more specifically. For example, shortness of breath may be an initial problem statement that is replaced by acute systolic heart failure, as further clinical information leads to a more refined problem statement that moves from symptom to diagnosis. They then use a differential diagnosis to expand the list of possibilities to avoid premature closure of the search for the true diagnosis. This step-by-step process enables the clinician to formulate a set of hypothetical diagnostic possibilities, which can then be tested using iterative hypothesis testing. Iterative hypothesis testing allows the clinician to narrow the list of possible diagnoses and focus on the most plausible hypothesis.
Understanding probability is essential for good clinical decision-making. Probability can be estimated for outcomes that are measured as continuous or categorical variables, as shown in Figure 5.1 . The figure shows how probability of an outcome or event is distributed across a range of possibilities. For example, a laboratory test might be measured in a population of patients resulting in a distribution in which most patients are distributed to the middle of the range of possibilities and fewer scatter to the edges of the range, shown in the probability density curve in the left panel of Figure 5.1 . The probability of categories or discrete variables can also be measured, as shown in the probability distribution graph in the right panel of Figure 5.1 . If all of the diagnostic possibilities are mutually exclusive and collectively exhaustive, the probability of all of the possibilities will add up to 1, as shown by the red cumulative probability curves in Figure 5.1 . Understanding cumulative probability is important for understanding sensitivity and specificity, as discussed below.
To test a diagnostic hypothesis, clinicians use conditional probability, which is the probability that something will happen, on the condition that something else happened. Conditional probability can inform the probability of a diagnosis, on the condition of some new information such as a positive test result. Bayesian reasoning is a mental process that allows clinicians to modify their perceptions by considering prior knowledge and updating that knowledge with new and evolving evidence. It enables formation of a probability estimate and revision of that estimate based on new information using conditional probability. For example, one might ask, what is the probability of coronary artery disease in a patient, given a positive stress echocardiogram? What is the probability of pulmonary embolus, given a negative D-dimer test? What is the probability of an acute coronary syndrome, given an abnormal troponin test? The post-test probability depends on a prior estimate of the probability for that particular patient, combined with the strength of the test result. Probability theory helps the clinician understand the question and calculate the answer.
Bayesian reasoning adds mathematical rigor to clinical thinking and requires both a prior estimate of probability and an estimate of the strength of a test result. Prior estimates can come from a clinician’s own experience, or published data on the prevalence of a disease. A classic paper by Diamond and Forrester provides estimates of the prevalence of coronary artery disease in patients depending on age, sex and symptom features, for example (see Classic References). This type of observational research can be used to provide the prior probabilities that are needed for Bayesian reasoning.
Understanding probability is essential to interpreting laboratory tests. A laboratory test might be measured in a population of presumably normal individuals to determine a distribution and to define a normal range, shown in the probability density curve in the left panel of Figure 5.2 . A normal range is commonly defined as the inner 95% cumulative probability and the abnormal range is defined as values falling outside of the normal range as shown.
Another way of defining a test result is by measuring the test result in a group of subjects who are defined as normal and abnormal by another independent “gold standard” test, as shown in the right panel of Figure 5.2 . Typically, subjects with and without disease will have test results that are distributed like bell-shaped curves. A line of demarcation can be drawn to define how a new test would separate patients with positive and negative test results. Because there is overlap in subjects with and without disease, there will be false-positive and false-negative test results, as shown.
Understanding how to use clinical testing is essential to good decision-making. The utility of a test result depends, in part, on the operating characteristics of a test, namely, the sensitivity and specificity. They are rates, meaning they are proportions with different units for the numerator and denominator. The terms “true positive rate” (TPR) for sensitivity and “true negative rate” (TNR) for specificity are alternative labels. Patients with and without disease are shown separately in Figure 5.3 to show the cumulative probabilities of a true positive result (sensitivity or the TPR) on the right and of a true negative result (specificity or the TNR) on the left. Sensitivity and specificity are usually shown in a 2 × 2 table but showing the TPR and TNR in Figure 5.3 demonstrates how these rates vary, depending on the location of the line of demarcation between positive and negative test results.
The complementary probability of the TNR is the false-positive rate (FPR), as shown in the top panel of Figure 5.4 . Plotting the TPR (sensitivity) of a test on the y-axis and the FPR (1-specificity) on the x-axis creates a plot called a receiver operating curve (ROC), as shown in the bottom panel of Figure 5.4 . ROCs are useful for determining the optimal cutoff point for the line of demarcation of a test.
The denominators of sensitivity and specificity are patients with the disease and people without the disease, respectively. In clinical practice, when test results are reported as positive or negative, however, the results are reported using terms with different denominators. A clinician wants to know the probability that a positive test result is truly positive, or the positive predictive value (PPV), and also the probability of disease given a negative test result, which is 1 minus the negative predictive value (NPV). When changing from sensitivity and specificity to the PPV and NPV, the denominators of these rates change, making it difficult for a clinician to estimate these probabilities intuitively. In addition, the PPV and NPV depend not only on the sensitivity and specificity of the test, but also on the prevalence of the target condition in a population of test subjects.
The sensitivity and specificity are not fixed, and spectrum bias can result if the test subjects that defined the operating characteristics of the test are different from the subjects who are subsequently tested. , If the operating characteristics of the test are defined in a narrowly defined population (left panel of Fig. 5.5 ), but the test is used in a broadly defined population and the line of demarcation remains fixed (right panel of Fig. 5.5 ), the specificity, or TNR, will decrease. This commonly occurs with tests such as troponin testing, where the clinical sensitivity and specificity of the test are defined in a research setting, but the test is used indiscriminately in practice. When used as a general screening test in a broadly defined population, the width of the distribution of the subjects with no disease widens, yet the line of demarcation remains fixed, which decreases the TNR, as shown. This issue has also been shown in genetic testing.
In practice, clinicians usually do not formally calculate Bayesian probabilities but, in general, use a heuristic that psychologists call “anchoring and adjusting.” , Clinicians estimate a pretest probability (the anchor) and estimate the posttest probability by adjusting the anchor. For a patient with chest pain, for example, the anchor would be an estimate of the pretest probability of coronary artery disease, which would be intuitively adjusted on the basis of new information such as a stress test result to estimate a posttest probability. This is an expedient method for intuitively estimating conditional probability.
There are two potential problems when using this heuristic. One fallacy, called “anchoring,” is when the decision-maker becomes too anchored on the pre-test probability estimate and does not adequately adjust in estimating the post-test probability. The second fallacy is called “base-rate neglect,” when the decision maker overly responds to the new information to estimate a post-test probability, without regard for the pretest probability. For example, troponin tests may be positive because of renal failure or sepsis in patients with a low pretest probability of acute thrombotic myocardial infarction. Taking the test result at face value and initiating therapy such as antithrombotic drug therapy in such a patient would be an example of base-rate neglect.
Likelihood ratios are useful for Bayesian reasoning. The advantage of likelihood ratios is that, unlike sensitivity and specificity, they are dimensionless numbers, so the need for keeping track of what is in the numerator and denominator is alleviated. Likelihood ratios give a measure of the persuasiveness of a positive and negative test result, and can be used intuitively, or used to actually calculate posttest odds.
A likelihood ratio is defined as the percentage of patients with a disease who have a given test result divided by the percentage of patients without disease who have that same test result. Thus, a positive likelihood ratio is the percentage of patients with disease with a positive test result divided by the percentage of patients without disease with a positive test result (TPR/FPR, or sensitivity/[1 − specificity]). A negative likelihood ratio is the percentage of patients with disease with a negative test result divided by the percentage of patients without disease with a negative test result (FNR/TNR, or [1 − sensitivity]/specificity). It is easy to calculate the positive and negative likelihood ratios from sensitivity and specificity. Once calculated, these numbers can be used to multiply the pretest odds to calculate the posttest odds of a diagnosis. They are multipliers, so a higher positive likelihood ratio, and a lower negative likelihood ratio (which is a fraction) have stronger multiplying effects. A likelihood ratio that is close to 1 is weak because it would have very little multiplying effect, meaning it has little effect on the pre-test assessment.
Figure 5.6 shows how the probability estimate of a diagnosis can shift depending on a test result. After choosing a pre-test probability estimate on the x-axis, one can trace up to either the upper curve for a positive test result or the lower curve for a negative test result, then trace over to the y-axis to read the post-test probability estimate. The diagonal line shows that there would be no change in probability for a test with a likelihood ratio of 1. A higher positive likelihood ratio or a lower negative likelihood ratio would result in positive or negative test result curves with greater deviation from the diagonal line, representing a greater shift in the post-test probability estimate based on the test result.
Some tests are asymmetrical, meaning that either their positive or negative likelihood ratio is stronger. For example, Figure 5.7 , Panel A shows the probability of congestive heart failure based on congestion on a chest x-ray, which has a very strong positive likelihood ratio of 13.5 and a relatively weak negative likelihood ratio of 0.48. This reflects the fact that the chest x-ray is highly specific but not very sensitive for heart failure. In other words, congestive findings on a chest x-ray are highly suggestive of heart failure, but their absence is not strong reassurance about the lack of heart failure. Tests that are highly specific are better for ruling in a diagnosis and this can be remembered using the mnemonic “SpPin.” (Highly specific tests, if positive, are good for ruling in.)
On the other hand, Figure 5.7 , Panel B shows that a D-dimer for pulmonary embolus has a very strong negative likelihood ratio of 0.09 and a modest positive likelihood ratio of 1.7. This reflects the fact that a D-dimer is highly sensitive but not very specific for a pulmonary embolus. Tests that are highly sensitive are better for ruling out a diagnosis and this can be remembered using the mnemonic “SnNout.” (Highly sensitive tests, if negative, are good for ruling out.)
The likelihood ratios, however, are only as useful as the sensitivity and specificity that are used to calculate them. They give an approximate quantitative estimate of the strength of new information that provides a mechanism for calibrating intuitive probability estimates.
Clinical reasoning should guide not only test interpretation, but also test ordering. Tests that are ordered for good reasons are more conclusive, and tests that are ordered indiscriminately can cause clinicians to make poor judgments. Ideally, a test should be used to validate or reject an articulated hypothesis—a plausible conjecture that is generated by a patient’s condition. Ideally clinicians think ahead about what they would do with test results.
To aid with test selection and avoid over-testing, the American College of Cardiology (ACC) and other organizations have developed appropriate use criteria to guide clinicians’ decisions about ordering selected cardiac tests. This effort is driven by both a need to avoid excessive false-positive test results and also the need to contain the costs of medical care. The goal of appropriate use guidelines is to reduce overuse errors and to maximize the value of diagnostic testing and procedures. The general principle of any test-ordering strategy is that a plausible hypothesis (a provisional diagnosis) should be formulated first, followed by testing. The appropriate use criteria are designed to avoid testing when the results are unlikely to improve patient care and outcomes.
Recent ACC/American Heart Association (AHA) guidelines have promoted the provision of preventive treatments according to an individual’s risk of adverse outcomes. The premise is that low-risk people have little to gain by preventive interventions, while high-risk individuals may have a lot to gain. These guideline recommendations emphasize the need to consider categories based on estimates of risk and prognosis, rather than merely diagnostic labels, such as hyperlipidemia. It is important for clinicians to understand the provenance of the risk scores and their performance, including in diverse populations, to know whether the tools are useful. After calculating the risk, the challenge for clinicians is communicating risk to patients in an understandable fashion. Investigators have provided infographics that can communicate risk and risk reduction in order to facilitate a discussion regarding long-term treatment options to diminish risk, and to compare the degree of risk reduction with potential side effects and costs of treatment (see Fig. 5.8 ). Because clinicians vary in their use of qualitative terms such as “high risk,” there is a need to provide clear and understandable quantitative estimates.
A preventive or therapeutic decision is a structured choice. These decisions require medical knowledge and a balanced sense of risks and benefits, as well as knowledge of patients’ preferences, to make optimal therapeutic decisions.
Clinical trials report the average risk of an outcome for patients in a treatment group and in a comparison group. There may be heterogeneity of the treatment effect, in which some patients may receive a marked benefit and others receive no benefit at all. Subgroup analysis and tests for interaction can provide hints, but usually heterogeneity of treatment effect is not readily apparent, creating a challenge for clinicians trying to personalize treatment decisions. In a key example of heterogeneity, fibrinolytic therapy was effective in the treatment of suspected AMI and subgroup analyses revealed the benefit to be substantial in patients with ST-elevation but not in those without it. The challenge is that subgroup analyses introduce the possibility that associations have occurred only by chance. In the Second International Study of Infarct Survival (ISIS-2), the authors provided perspective on subgroup analyses by demonstrating that patients born under the astrological signs of Gemini or Libra were significantly less likely to benefit from fibrinolytic therapy. Thus, subgroup analysis is capable of producing important insights, but must be interpreted with caution.
A weakness of relative benefit estimates is that they do not convey information about what is achieved for patients at varying levels of risk. A small relative reduction in risk may be meaningful for a high-risk patient, while a large relative reduction may be inconsequential for a very low-risk patient. Absolute risk reduction, the difference between two rates, varies with the risk of an individual patient. For example, a risk ratio of 2.0 does not distinguish between baseline risks of 80% and 40% and between 0.08% and 0.04%. In one case, the absolute difference is 50% (5000 per 10,000) and in the other, it is 0.05% (5 per 10,000). In one case, 1 person out of 2 benefits and in the other, 1 out of 2000 benefits. Unfortunately, absolute benefit is not emphasized adequately in many articles.
Risk prediction is critically important for calculating the expected absolute risk reduction. In recent years, many tools have been developed to assist in the rapid assessment of patient risk, with variable uncertainty about their comparative performance. ,
In evaluating studies of risk prediction, it is important to consider whether the approach has been validated in populations similar to the patients to whom it is applied in practice. The predictors should be collected independently of knowledge of the outcome. The outcome and timeframe should be appropriate for clinical decisions and the value of the prediction should be clear. Appropriate risk prediction can assist in calculation of absolute benefit and put the balance of risks and benefits of an intervention in proper perspective.
Several studies have shown a risk-treatment paradox in which the higher-risk patients are least likely to receive interventions that are expected to provide a benefit. This pattern is paradoxical because the high-risk patients would be expected to have the most to gain from an intervention that reduces risk, assuming that the relative reduction in risk is constant across groups defined by their baseline risk. The source of the paradox is not known, although some have suggested that it is related to an aversion to the treatment of patients with a limited functional status, or concern for greater degree of harm from the same therapy.
Cardiovascular drugs and procedures are often double-edged swords, having both benefit and harm. Also, patients may have strong preferences about potential benefit and harm. For example, a patient may have a strong fear of a side effect such as a cerebrovascular accident that may overwhelm other considerations about a treatment decision. It is important to engage patients and families in a discussion to explain the considerations that go into therapeutic decisions, particularly for nuanced decisions about treatments that have substantial risks in addition to potential benefits.
Absolute risk reduction is better than relative risk reduction for estimating a treatment effect. The inverse of the absolute risk reduction, which is a term called number needed to treat (NNT), is even more intuitive.
Consider a trial with a combined event rate of 10% in the treatment group and a 15% risk in the control group, giving an absolute risk reduction of 5%. This means that 5 events are avoided for every 100 patients in the treatment group. The reciprocal of this relationship indicates that there would be 100 patients treated for every 5 events avoided. By dividing 100 by 5, which reduces the denominator to 1, there would be 20 patients treated per 1 event avoided. Thus, the NNT is 20. For NNT, the smaller the number, the better.
NNT and absolute risk reduction depend on both the relative risk reduction and the baseline risk. For conditions with a high baseline risk, the NNT can become very small (desirable). As an extreme example, for a patient with ventricular fibrillation, the baseline risk of dying without defibrillation is 100%, making the NNT for defibrillation (if always effective) equal to 1.
Primary prevention with statin drugs has a relative risk reduction of about 20% over the several-year course of a typical prevention trial. The absolute risk reduction and NNT depend upon the baseline risk, which varies depending on a number of factors. At a baseline risk of 7.5%, the absolute risk reduction would be 1.5% and the NNT would be 67, a fairly high number, which suggests marginal benefit at this level of baseline risk.
NNT is a useful intuitive tool for comparing the efficacy of various treatment strategies. NNT is also a useful way to summarize the findings of a clinical trial in a single declarative sentence. For example, the PARTNER-3 trial had an NNT of 16, meaning one would need to treat 16 low-risk aortic stenosis patients with a transcatheter aortic valve replacement to prevent one composite endpoint of death, stroke, or rehospitalization over 1 year. The EMPEROR-Reduced trial had an NNT of 19, meaning one would need to treat 19 patients with class II to IV heart failure and an ejection fraction of ≤40% with a sodium-glucose cotransporter 2 (SGLT-2) inhibitor for 16 months to prevent one death or hospitalization for worsening heart failure. With NNT, a single sentence can provide the trial name, the magnitude of the treatment effect, the trial’s entry criteria, the study drug or intervention, the duration of the trial, and the outcome measure. NNTs of 16 and 19 suggest that these treatments are strongly recommended, although some patients may not consider it worth going through the treatment if 15 of 16 people have the same outcome regardless of whether they received the intervention.
NNT is also a very personal notion of the probability of a treatment effect. Imagine bringing 19 untreated patients with congestive heart failure and an ejection fraction of ≤40% into a room and saying, “If all of you start on an SGLT-2 inhibitor, over the next 16 months, one of you will experience the benefit.” Capturing the essence of a treatment effect with NNT is a useful way to intuitively convey the impact of a treatment effect. This knowledge, packaged in a way that is more intuitive, can make it easier to combine this medical knowledge with the preferences and values of individual patients to make the best therapeutic decisions.
Nevertheless, there are limitations to NNT. NNT is an index of an average treatment effect over time and does not provide information about whether the treatment effect is immediate, delayed, or highly variable. NNT does not provide information about whether there is meaningful heterogeneity in effect among different subgroups, as the NNT often is calculated based on an assumption of a uniform effect of the therapy, with the NNT just varying based on the baseline risk.
Science is a quantitative discipline that uses numbers to measure, analyze, and explain nature. Evidence-based medicine has been defined by David Sackett as “the conscientious, explicit, and judicious use of current best evidence in making decisions about the care of individual patients.” To practice evidence-based medicine, clinicians must remain vigilant, constantly monitoring for new research findings, accompanied by a basic knowledge of statistics to make proper inferences from clinical research.
When using statistics to compare two groups, the standard method is to assume that there is no difference between the two groups, the so-called null hypothesis. The trial results are reported, along with a p value, which is the probability of deriving the difference reported in the trial, or a more extreme difference, given the assumption that the null hypothesis is true (i.e., there is no real difference between the groups). When a trial is designed, the investigators estimate the sample sizes that are required to avoid claiming that there is a difference between treatment groups when there really is no difference (a type I error or alpha error) or claiming that there is no differences between treatment groups when there really is one (a type II error or beta error). Similar to a clinical test like a stress test that can have false-positive and false-negative test results, clinical trials can have false-positive results (alpha errors) and false-negative results (beta errors). A trial with adequate sample size and rigorous statistical methods should allow investigators to avoid these errors.
When a trial is designed, the alpha level is usually set at 0.05. If the p value of the observed data is less than 0.05, one can conclude that a very improbable event occurred, a less than 1-in-20 event, assuming the null hypothesis is valid. According to the frequentist notion of statistics, one imagines that repeating a trial many times would create a distribution of possible trial results. The p value tells us where the observed results of a particular trial would sit in that imaginary distribution of trial results.
Because the p value is so commonly used in clinical research, clinicians need to be aware of several key issues. First, the threshold of 0.05 for statistical significance is arbitrary. A p value of 0.04 implies that the data could occur 4% of the time if the null hypothesis is true and a p value of 0.06 would suggest the data would occur 6% of the time. Is the difference between 6% and 4% enough to reject the null hypothesis in one case and accept it in another? Clinicians should understand that p values are continuous values and are just one piece of information needed to assess a trial. Second, p values do not inform clinical importance. A large study sample can produce a small p value despite a clinically inconsequential difference between groups. Clinicians need to examine the size of the effects in addition to the statistical tests of whether the results could have occurred by chance.
Become a Clinical Tree membership for Full access and enjoy Unlimited articles
If you are a member. Log in here