Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
The key take-home messages from this chapter are how to:
Gather existing evidence on outcomes
Interpret evidence on outcomes
Design new studies to add evidence on outcomes
Synthesize existing evidence, as in systematic reviews and meta-analysis
Recognize the importance of an essential dimension of evidence: the patient’s and the third-party payer’s perspective
Consider the cost dimension in achieving outcomes
Understand the challenges and approaches in translating good evidence into daily clinical practice.
Physicians must make decisions on difficult and sometimes complex problems under conditions of uncertainty – often with incomplete, inaccurate, or outdated information. The goal of outcomes research is to improve the information available to make these complex decisions. It is important to understand that measuring outcomes is not just about the final result of our interventions. More importantly, it is a means by which we gather evidence to improve decision-making, the processes of care, and the systems in which we work. Maximizing patient care and surgical outcomes requires continuous efforts to identify information needs and to produce the best evidence to fulfill these needs.
“Evidence-based medicine” is defined as the integration of the best available research evidence with clinical expertise and patient values. There are two ways to obtain best evidence for decision-making to improve quality of care ( Fig. 11.1 ). The first is to evaluate existing data and identify those reports that are convincing based on a good study design and clinically meaningful outcomes. To find the best available evidence, one must be proficient in performing an effective literature search. The authors recommend consulting a health sciences librarian at the start of any research project to ensure the search for the available evidence is comprehensive, the appropriate search terms are used, concepts are structured appropriately, and the necessary databases are included. When the current evidence is inadequate, the second choice is to produce new credible evidence. This section will describe a structure for evaluating existing literature.
Best evidence requires three things: (1) a study design that is suited to the research question; (2) an appropriate statistical analysis; and (3) clinically meaningful outcome measures. In a thorough review, Offer and Perks have enumerated the challenges that plastic surgeons face when trying to practice evidence-based medicine. They cite the lack of quality evidence in our literature as a major factor. Using Levels of Evidence (LOE) as a surrogate for quality, this finding was substantiated by Burns, Rohrich, and Chung, who in 2011 reported that the majority of articles published in Plastic and Reconstructive Surgery over a 20-year period were Level IV or V evidence (i.e., case series and case reports). The authors concluded that while the level of evidence was improving, there remained a long way to go. In 2019, Sugrue, Joyce, and Carroll confirmed that the quality of plastic surgery research continues to improve, as by their analysis, the percentage of higher LOE studies (Levels I and II) increased from 6.6% in 2008 to 15.7%. However, lower LOE studies (Levels IV and V) still made up half (49.6%) of publications. These Level IV and Level V studies often represent descriptions of procedures by individual surgeons and the “outcomes” of those procedures in that surgeon’s hands. Known as case reports or case series, they have limited applicability to other practices, and are not structured, not compared, and not randomized. This is not only weak evidence on which to make medical decisions, but it is not an acceptable way to make such decisions in this era of healthcare reform and pay-for-performance (P4P). As depicted in Fig. 11.1 , the stronger the evidence, the better the care one is able to deliver.
Practicing evidence-based medicine involves: (1) converting the need for information about a diagnosis, prognosis, or intervention into an answerable question, (2) searching for the best available research to answer that question, (3) critically appraising the evidence for its validity, (4) integrating critical appraisal with clinical expertise given a patient’s unique values and circumstances to guide management, and (5) evaluating our efficiency and effectiveness in steps 1 through 4 and seeking ways to improve. Many clinical questions remain unanswered because of problems formulating relevant questions, insufficient access to information resources, and a lack of search skills. Today, there are a variety of strategies and web-based resources that allow searches of the relevant literature to answer many clinical questions. Some examples are provided below.
The US National Library of Medicine provides access to more than 32 million citations through PubMed, accessing references from MEDLINE and directly from journals. Information from PubMed searches has been shown to improve both patient care and health outcomes significantly.
Clinical Queries is a feature of PubMed that can help identify citations with the study design of interest. It can link the type of question (e.g., intervention, diagnosis, natural history, and outcome) to a search strategy that specifies the desired study design. As the best evidence is most likely found within systematic reviews/meta-analyses, the authors recommend that researchers go directly to these study designs when attempting to answer a clinical or research question.
The critical first step in the pursuit of evidence-based medicine is to ask a well-formulated question. Without sufficient focus and specificity, an otherwise relevant and important clinical question can be mired in irrelevant evidence. When the PICO framework is used with the PubMed Clinical Queries (see above), it has been shown to improve the efficiency of the literature search. The acronym PICO stands for Patient problem, Intervention, Comparison, and Outcome, and is a strategy to pose a well-formulated question. The PICO framework is often expanded to include a Time horizon (T) component. Specifically, this is a standardized time point at which the investigator plans to measure the outcome of interest. In a literature search for evidence in plastic surgery, for example, a question framed using the PICOT approach might be expressed this way: In patients undergoing breast reduction (P), does preoperative antibiotic prophylaxis (I) versus no prophylaxis (C), reduce the rate of postoperative infections (O) at 4 weeks postoperatively (T)?
Systematic reviews are evaluations of the literature conducted according to clearly stated, scientific research methods that are designed to minimize the risk of bias and the errors associated with traditional literature reviews. The statistical components of a systematic review (i.e., summary effect estimates, tests of heterogeneity and publication bias) are referred to as a meta-analysis. The review process is based on a comprehensive and unbiased search of the literature using defined criteria and includes a thorough evaluation of the quality and validity of the studies identified in the search process.
The best-known source for systematic reviews is the Cochrane Database of Systematic Reviews. The Cochrane Group also provides a handbook for systematic reviews of interventions which provides a breakdown of core methods for preparing a review. Systematic reviews are influential tools in supporting evidence-based practice and some consider them to provide stronger evidence than randomized controlled trials. Moreover, they are essential for summarizing existing data, thereby avoiding the wasted effort that would result from unnecessary duplication of previous studies. However, a drawback of systematic reviews is that they can produce evidence that appears reliable on first glance, however upon closer inspection if the systematic review is based on the results of individual studies that are of poor methodological quality, this evidence is at a high risk of bias. This can be particularly relevant in systematic reviews of observational data, where often times the results of studies are not adjusted for potential confounding variables. To see examples of rigorously conducted (high-quality) systematic reviews, the reader is referred to the online Cochrane reviews database.
Conceptually, a meta-analysis is the summation of multiple quantitative studies in order to increase the sample size, thus increasing the overall power and strengthening the conclusions over that which can be drawn from any individual article for a particular research question. This analysis allows researchers to reach a more reliable conclusion even when there are conflicting results from multiple studies. Ideally, a meta-analysis is conducted using the highest level of evidence, corresponding to the least amount of potential bias, such as randomized controlled clinical trials. Although meta-analysis can also be conducted on cohort studies and even case series, the quality of the evidence and, therefore, the conclusions are weaker. Methods to evaluate risk of bias within meta-analyses are highlighted in the Cochrane handbook for systematic reviews of interventions and include the Cochrane Risk of Bias 2 (RoB 2) tool and the ROBINS-I tool for randomized and non-randomized studies of interventions, respectively.
The meta-analysis has some disadvantages. First, meta-analyses have been criticized for both overly broad inclusion criteria and, conversely, for overly strict inclusion criteria. Either may result in degradation of the results. Inclusion criteria that are overly broad may lead to inclusion of studies of lesser quality or with less reliable results. Very strict inclusion criteria may mean that the studies that are included have limited generalizability. Another disadvantage is that meta-analyses require a significant time commitment from the research team.
The technique of meta-analysis may have limited applicability in plastic surgery, where there are few randomized trials and the outcomes from one study to another may vary considerably. An example of a meta-analysis can be seen in Fig. 11.2 . This figure demonstrates an analysis of randomized trials evaluating the impact of tamoxifen on survival. As the reader can observe, there are multiple studies, with the findings of each plotted on a central axis. The summary analysis, which incorporates data from these studies, appears at the bottom of the vertical axis and suggests a benefit for the use of tamoxifen in early breast cancer. As can be seen, the 95% confidence interval for that summary measure is quite narrow. This implies a better point estimate of the true benefit of tamoxifen. Also note that the quality of the individual study is used to determine its weight in computing the summary score.
More recently, Norman et al . conducted a Cochrane systematic review and meta-analysis of randomized controlled trials comparing the incidence of surgical site infections (SSIs) in patients who receive negative pressure wound therapy (NPWT) versus standard dressings for surgical wounds healing by primary intention. The authors identified 31 studies (6204 patients) that evaluated this outcome across various surgical incisions. When these studies were pooled in the form of a meta-analysis, the results demonstrated a moderate reduction in the incidence of SSI with NPWT (RR 0.66, 95% CI [0.55 to 0.80]) versus standard dressings; corresponding to 43 fewer SSIs per 1000 people. Overall, the authors concluded that “NPWT probably decreases the incidence of surgical site infection compared with standard dressings”. To see additional examples of good-quality meta-analyses, the reader is referred to the online Cochrane reviews.
Traditional systematic reviews seek to compare two interventions (e.g. placebo vs. intervention, intervention A vs. intervention B) in the form of a pairwise meta-analysis. This technique is particularly useful when comparing a new intervention to placebo or to standard of care from direct head-to-head clinical trials. Notably, a limitation of this analysis is that it does not permit the comparison of several interventions to determine relative effectiveness. This is particularly relevant in the field of plastic surgery where several competing techniques may be used to address the same problem. For example, when comparing interventions for the treatment of keloid scarring a surgeon may compare the relative effectiveness of various interventions on the incidence of lesion recurrence, including: (1) surgical excision alone, (2) surgical excision with skin grafting, (3) surgical excision with pressure dressing, (4) surgical excision with radiation, and (5) surgical excision with adjuvant steroid. Traditional pairwise meta-analyses do not allow for conclusions to be made regarding the relative effectiveness of these treatments unless they have been compared directly in head-to-head trials. Unfortunately, direct comparisons of multiple interventions are infrequently reported within the surgical literature.
Another form of meta-analysis, known as the “network meta-analysis”, was developed to address this limitation. Specifically, network meta-analyses seek to provide an effect size estimate for all possible pairwise comparisons, irrespective of whether these interventions have been compared directly within traditional randomized or observational head-to-head study designs. This technique proposes that one can estimate the relative effect of two interventions (i.e., treatment A and treatment B), even if not compared directly, if both interventions have been directly compared to a similar third intervention (i.e., treatment A vs. treatment C; treatment B vs. treatment C). These techniques rely on the assumption that all relevant studies have similar key characteristics (i.e., patient inclusion criteria, outcome measurement and risk of bias) such that they can be combined to make both direct and indirect comparisons. For example, a network meta-analysis evaluating the effect of various interventions on keloid scar recurrence by Siotos et al . concluded that the odds ratios (OR) of keloid recurrence relative to no excision were as follows: excision with pressure (OR 0.18 [95% CI, 0.01–7.07]); excision with 2 adjuvants drugs (OR 0.47 [95% CI, 0.02–12.82]); excision with radiation (OR 0.39 [95% CI, 0.04–3.31]); excision and skin grafting (OR 0.58 [95% CI, 0.00–76.10]); excision with 1 adjuvant drug (OR 1.76 [95% CI, 0.17–21.35]); and excision alone (OR 2.17 [95% CI, 0.23–23.95]).
By including effect estimates from direct and indirect comparisons, network meta-analyses enable researchers and clinicians to simultaneously compare multiple interventions and rank treatments. While guidance on how to perform a network meta-analysis remains beyond the scope of this chapter, we refer readers to the “Undertaking Network Meta-analyses” chapter in the Cochrane Handbook for Systematic Reviews . We strongly recommend consultation with a health research methodologist prior to embarking on this study design as these techniques can be challenging to undertake and interpret without formal training.
A cornerstone of evidence-based medicine remains the hierarchical system of classifying evidence, commonly referred to as the “Levels of Evidence”. Physicians are encouraged to find the highest level of evidence, ranked according to the probability of bias, to address their research questions. However, a study design’s position atop the level of evidence pyramid does not always translate to better quality research. Specifically, it is the clinical research question that determines the selection of the study design. For some research questions (e.g., evaluating harm), higher level of evidence study designs (i.e., randomized control trials) are not feasible and investigators have no choice but to implement a study design that is lower on the level of evidence pyramid (i.e., case–control study). Thus, research quality should be determined by efforts to minimize bias, irrespective of the study design used.
Broadly, study designs can be divided into experimental and observational studies. Experimental studies, which include randomized controlled clinical trials, test a hypothesis by examining the impact of an intervention on the outcome and often occupy the top rungs of the level of evidence pyramid. In contrast, observational studies, which include cohort studies, case–control studies, case series, and case reports, describe the natural history or incidence of disease or analyze associations between risk factors and the outcomes of interest. These study designs are often influenced by both known and unknown potential confounders that may lead to bias in the results. As one might expect, there is generally a direct correlation between the complexity of a study design and the quality of the resulting data ( Table 11.1 ).
Level of evidence | Type of studies |
---|---|
I | High quality (i.e., adequate power, proper allocation concealment, sufficient blinding of outcome assessors) single- or multi-centered randomized trials or systematic reviews of these studies |
II | Low quality (i.e., insufficient power, no or improper allocation concealment, unblinded outcome assessors) randomized controlled trials or prospective cohort studies or systematic reviews of these studies |
III | Retrospective cohort studies or case–control studies or systematic reviews of these studies |
IV | Case series |
V | Expert opinion or case reports |
The randomized controlled trial (RCT) is perhaps the most complex of the experimental study designs and is viewed as the gold standard. The primary advantage of a trial over other study designs is “randomization”. Specifically, randomly assigning an intervention minimizes the influence of known and unknown confounding variables that may bias the study results. Furthermore, “blinding” functions to preserve the benefits of randomization and prevents bias in the adjudication of the intended outcome. The design of the RCT evolved over the course of many years and each component of the design attempts to minimize the influence of study bias and confounders on the results. The goal is to create a sample population that is truly representative of the whole, often referred to as the “external validity” of a study. In this way, the results of a limited trial can be generalized to the population at large with confidence.
According to the National Institutes of Health, clinical trials are conducted in four phases, each serving a different purpose and helping researchers answer different questions ( Table 11.2 ). In phase I, researchers evaluate an experimental drug or intervention in a small number of test subjects (20–80) to evaluate its safety, determine a dose–response curve, and identify potential side-effects. In phase II, the experimental drug or intervention is given to a larger group of subjects (100–300) to test its effectiveness and further evaluate its safety profile. In phase III, the experimental drug or intervention is given to a much larger group of subjects (1000–3000) in order to “confirm its effectiveness, monitor side effects, compare it to commonly used treatments, and collect information that will allow the experimental drug or treatment to be used safely”. In phase IV, post-marketing surveillance is continued to identify additional information on the risks, benefits, and optimal use of the intervention.
Phase | Purpose |
---|---|
I | To test an experimental drug or treatment in 20–80 people to evaluate safety, determine a safe dosage range, and identify side effects |
II | To test the experimental drug or treatment in 100–300 people to determine effectiveness and evaluate its safety |
III | To test the experimental drug or treatment in 1000–3000 people to confirm effectiveness, monitor side effects, compare it to commonly used treatments, and collect information to allow for its safe use |
IV | Post-marketing studies to learn more about the drug’s or treatment’s risks, benefits, and optimal use |
Unfortunately, RCTs are expensive and sometimes impractical to answer questions that compare one surgical intervention to another. While patients are often willing to be randomized to take a pill versus a placebo, not as many are willing to be randomized to one of two surgical procedures or to a sham procedure leading to smaller sample sizes. Further, surgeons often have a strong preference for one type of surgery over another and are therefore unwilling to randomize patients. As such, the ethical basis for an RCT requires genuine uncertainty as to whether a novel intervention is beneficial or harmful compared to placebo or the standard of care; termed clinical “equipoise”.
Expertise-based RCTs can overcome major challenges unique to RCTs involving surgical interventions. When comparing surgical interventions, keeping the surgeon blinded to the surgical intervention is, not surprisingly, quite challenging. In a systematic review of 173 plastic surgery RCTs comparing surgical interventions, blinding the surgeon was found to be impractical in almost every RCT (97%). In addition, when a novel surgical intervention is being evaluated, the learning curve for the novel technique may be difficult for all participating surgeons to overcome.
To address these challenges, the expertise-based randomized controlled trial (EB-RCT) study design can be implemented. Two unique sets of surgeons, possessing a relatively similar expertise with one of the two surgical interventions (surgery A or surgery B), participate and the surgeons are assigned to that respective intervention group. Thus, the patients are randomized to receive surgery A (from a group of surgeons with expertise in surgery A) or they are randomized to receive surgery B (from a group of surgeons with expertise in surgery B). This method was first proposed in 1980 by Van der Linden but has been applied sparingly since.
The EB-RCT design allows for comparable skill in performing the surgical technique in question and may also lend itself to improved participation and compliance amongst surgeons, as some surgeons often have a strong preference for one surgical technique over another. Furthermore, by obviating the issue of learning curve, the surgeon can be more confident that differences between treatment groups are not attributable to surgeon inexperience. This design also avoids the potential bias introduced by unblinded surgeons in conventional RCTs as concerns of surgeons personally believing in the superiority of a specific surgical procedure and potentially introducing bias, either consciously or subconsciously, is not an issue.
Multicenter trials can potentially overcome another common challenge which exists for plastic surgery RCTs: small sample size. In a systematic review of 173 plastic surgery RCTs comparing surgical interventions, the majority (141 of 173, 82%) of RCTs randomized fewer than 100 patients, and the median number of patients randomized was 43 (mean = 73). Therefore, half of the RCTs in this systematic review had fewer than 22 patients in each treatment arm. The distinguishing feature of an RCT, that makes it the gold standard for comparing interventions, is the randomization of patients to an intervention. However, with small sample sizes, it is unlikely that randomization will be able to serve its function, which is to balance both known (e.g., patient comorbidities) and unknown prognostic factors evenly between treatment groups. Small RCTs can be misleading, and if they are underpowered and poorly designed, they can be hazardous as the simple association with the RCT study design provides them with tenuous credibility.
When it is anticipated that it will be unlikely to enroll an adequate number of patients in an RCT, investigators should endeavor to collaborate with other sites, and use a multicenter trial design to increase the sample size. A multicenter RCT carries the same requirements as a single-center RCT and each site must follow the same protocol, with identical inclusion and exclusion criteria, randomization strategies, interventions, and methods for collecting and evaluating the outcomes. While RCTs comparing surgical interventions have been identified, in general, to have small sample sizes in plastic surgery, there is a great opportunity for the field to engage in further collaboration and produce larger, more definitive multicenter trials.
RCTs are not appropriate to answer clinical questions about harm. Plastic surgeons often encounter patients who face potentially harmful exposures either to surgical or medical interventions or environmental agents. For example, do all types of textured breast implants cause breast implant-associated anaplastic large cell lymphoma (BIA-ALCL)? There are three reasons why RCTs are not ideal for determining whether a potentially harmful intervention truly has deleterious effects. First, it would be unethical to randomize patients to an intervention that we suspect has a harmful effect. Second, using an RCT study design to explore a harmful event that occurs at a rate any less than 1 in 100 patients is logistically difficult, and prohibitively expensive due to the extraordinarily large sample size, and long-term follow-up required. Third, RCTs often fail to adequately report information on harm.
Given the limitations of RCTs to answer questions about harm, there are alternative strategies used to minimize bias, which involve observational study designs. The two main types of observational studies are cohort and case–control. The cohort design is similar to an RCT but without randomization. Instead, it is patient or physician preference, or sometimes happenstance, that determines whether the patient received the exposure of interest. Although there is widespread belief that observational studies are less valid, often overestimating the magnitude of treatment effects, there are at least several reports suggesting that good observational studies can provide the same level of internal validity as randomized trials.
Cohort studies evaluate a group or cohort at risk for disease. The evaluation may occur over time or at a single point in time. The latter is further described as a cross-sectional cohort study and is often used to collect data on the prevalence of disease. The second common use of cohort studies is to analyze the exposures that put subjects at risk for disease or interventions that reduce that risk. This requires an evaluation of the cohort at a minimum of two points in time. The study may be designed prospectively ( Fig. 11.3 ) or retrospectively ( Fig. 11.4 ). In the prospective cohort study, the investigator develops a hypothesis about variables that may impact the outcome under investigation. The investigator then identifies exposed and nonexposed groups of patients, each being a cohort, collects data about their risk factors, and then follows the cohorts forward in time, monitoring for development of the outcome of interest. For example, a prospective cohort study was used to compare the incidence of lymphedema post mastectomy in women with or without breast reconstruction. The results support the conclusion that breast reconstruction does not increase the risk of lymphedema after 10 years follow-up.
While prospective studies lend credence to causal associations, they are expensive and generally require years of follow-up. Rare outcomes or those that take a long time to develop affect the feasibility of cohort studies. Retrospective studies have similar goals, but identify the cohort at risk in the present and investigate the risk factors or exposures that occurred in the past. This is often done by subject recall or chart analysis, both of which are flawed when compared to prospective data collection. Data collected in this manner are more likely to be incomplete, inaccurate, inconsistent, or subject to recall bias. On the other hand, the major advantages of retrospective studies are that they can be done in a relatively short timeframe and are much less costly than prospective studies.
Case–control studies also assess associations between exposures and outcomes. Case–control studies ( Fig. 11.5 ) are an alternative design from cohort studies and rely on the initial identification of cases. The first group (cases) is selected by the presence of disease (or outcome) and the second (control) are individuals who do not have the disease (or outcome) of interest. The major strength of the case–control study is in the investigation of rare conditions or outcomes that take an extended duration to occur. Its weakness lies in the inability to assess incidence or prevalence of disease and its increased susceptibility to bias. Controls may be concurrent or historic, matched (on key variables such as age and sex) or unmatched.
As defined above, cases are simply a collection of subjects that have been identified by the presence of a disease or outcome. In our literature, these are frequently reported based on a specific intervention. If the collection of subjects is large and consecutive (case series), it may provide valuable information about indications and contraindications for surgery and expected outcomes. If the cases are nonconsecutive (case reports), it raises concerns about sampling bias, such as patient self-selection or surgeon’s recall bias, that may dilute the ability of the study to capture the true indications and outcomes. The value of case reports lies in their ability to present a new idea, a new surgical approach, and refinement of existing surgical techniques, and to communicate rare adverse events.
From the practical point of view, although most researchers would like to jump directly to a RCT as the best study design, we are hindered by what previous work has been done and the evidence that exists. If there is no available evidence on a topic, investigators should first start with (i) a case report. This should then progress to (ii) a case series and, if the new innovation is adopted by a large number of surgeons, further research should eventually lead to (iii) comparative studies (i.e., cohort studies or preferably RCTs).
Become a Clinical Tree membership for Full access and enjoy Unlimited articles
If you are a member. Log in here