Data, statistics and clinical trials


Statistics is the science of learning from data – from collection and organisation through to analysis, presentation and dissemination. Like all sciences, it has its own vocabulary and can sometimes appear somewhat impenetrable to the uninitiated. This chapter gives an overview of statistical processes and methods, but readers are advised to consult more detailed texts on medical statistics for further information.

Whenever data are collected, in a more or less systematic fashion, statistics can be produced: How many things? What size? How old? The science of statistics is concerned with turning this information into something useful. Generally, this is either to describe the things we are measuring or to make some inference or prediction from them. Often within medicine there is a question attached – commonly of the form, ‘Is one group somehow different from another group?’

Types of data

The type of data collected makes a big difference to what can be done with them using statistics.

At the most basic level, it is fairly straightforward to count things. How many patients died? How many were sick after surgery? How many of the patients who were sick were women? Sometimes these are simple categories with no order or value. Apples and oranges are not usually described as better or larger than the other. They are just categories or names of fruit. Male/female; blood groups A/B/O – these are all simple categorical or nominal data. The categories may be somewhat arbitrarily defined, with the possibility of overlap. Clear rules are therefore needed to define what goes in which category.

Sometimes the categories may have an order – mild/moderate/severe pain has a natural order as does easy/difficult/impossible mask ventilation. These are called ordinal data. Some of these data lend themselves to having numbers attached, but these numbers are no more than labels for ordinal data. The Glasgow Outcome Scale has five categories, from 1 (dead) through to 5 (good recovery). Clearly, 1 is a worse outcome than 5, but there is no suggestion that the intervals between the numbers are the same. However, because the data have an order, the median value has some meaning. The special cases of rating scales (such as pain and anxiety scales) are discussed later.

Data that correspond to measurements of physical constructs are usually amenable to the use of interval and ratio scales. An interval scale is an ordered sequence of numbers in which there is a constant interval between each point in the scale. For instance, the difference in temperature between 1°C and 2°C is the same as between 101°C and 102°C. However, the zero point is arbitrary, so ratios are not appropriate – 100°C is not twice as hot as 50°C. A ratio scale is a type of interval scale where there is a true zero – negative numbers cannot exist. For example, there is no temperature below 0 Kelvin, and there is no such thing as a negative length. This does allow ratios to be used; 100 Kelvin is twice as hot as 50 Kelvin, and someone who is 2 m tall is twice the height of someone who is 1 m tall. Interval and ratio data can be described using the mean, although this may not always be appropriate.

Summarising data

When describing data, it is often helpful to have some idea of a representative value – the average , in common parlance. In statistical terms, this is a value that describes the central tendency of a set of data. In general there are three types of average: mean, median and mode. Within this chapter the term average is used commonly and deliberately to encompass any of these.

  • Mode is simply the commonest value. It is used for categorical data.

  • Median is the middle value (or halfway between the two middle values if there is an even number of data points). It is used for ordinal data.

  • Mean is the representative value used for interval data. The arithmetic mean is most commonly used, but it is only one of three Pythagorean means, the other two being the geometric and harmonic means.

Arithmetic mean (AM) is the sum of all the values divided by the number (n) of values ( x 1 …x n ). In algebraic notation:


A M = 1 n i = 1 n x i = x 1 + x 2 + + x n n

The geometric mean (GM) is found by multiplying all the numbers together and then taking the nth root.


G M = i = 1 n x i n = x 1 x 2 x n n

The geometric mean is used when factors have a multiplicative effect and we want to find the average effect of these. It is most commonly used in finance (average interest rates) but is used in medicine when the mean of logarithmically transformed values is used.

The harmonic mean (HM) is the rather wordy reciprocal of the arithmetic mean of the reciprocals of the values.


H M = 1 1 n i = 1 n 1 x i = n 1 x + 1 x 2 + 1 x n

It is the most appropriate mean for comparing rates (such as speeds) or ratios. It is also used when calculating the effect of parallel resistances.

There are some other special means, of which the root mean square (RMS) or quadratic mean (QM) is perhaps the most widely quoted.


Q M = 1 n ( x 1 2 + x 2 2 + + x i 2 )

This is used to describe the average value of a varying quantity such as sine waves. Most notably it is the method used to describe the average voltage of an AC current.

The means give some idea of the typical value, but it is usually helpful to have some idea of the spread of the values. At the simplest level, the range, or maximum and minimum, give an idea of the spread. Similarly, the interquartile range (25th and 75th centiles) describes the middle 50% of the dataset. The standard deviation (SD) is a useful description of the variation around the mean because it can be manipulated in statistical tests. It is defined as the square root of the variance. If we have data from the whole population (which is relatively uncommon), then the population SD is:


σ = 1 N i = 1 N ( x i μ ) 2

where N is the number of items in the population, and µ is the population mean.

If we only have a sample, then the sample SD is given by:


s = 1 N 1 i = 1 N ( x i x ¯ ) 2

where
is the sample mean.

The use of N − 1 rather than N is known as Bessel's correction and is needed to account for the fact that the sample SD is less accurate than that of the whole population. Clearly, as the sample becomes larger, the effect of N − 1 becomes smaller.

Sampling

It is relatively unusual to measure the whole population of interest because it is usually impractical, expensive and unnecessary. In most situations a sample is taken from the population and the items of interest recorded. Generally it is hoped that the sample represents the total population as closely as possible. If the sample is a truly random selection from the population, then quite robust inferences about the whole population can be made. The randomness of the selection is very important; if the sample is not truly random, then the statistical models used generally do not work well.

Probability

If an event or measurement is variable, then whenever we measure it, it could have one of a range of values. Probability is simply the proportion of times that the value (or range of values) occurs. Probability is the chance of something occurring and always lies between 1 (always occurs) and 0 (never occurs). If one out of every hundred people is allergic to penicillin, then the probability of meeting someone allergic to penicillin is 1 in 100, or 0.01.

The probabilities of exclusive events are additive, and the sum of all mutually exclusive probabilities must always equal 1. In other words, if the probability of the patients on the emergency theatre list being from gynaecology wards is 0.3 and the probability of them being from general surgical wards is 0.5, then the probability of them being from neither specialty is 0.2 ( 1 – ( 0.3 + 0.5 )). If probabilities are independent (i.e. the probability of one event is not affected by the outcome of another), then they can be multiplied. For instance, the probability of a general surgical patient being female might be 0.6, in which case the probability of meeting a female general surgical patient on the list would be 0.5 × 0.6 = 0.3. Sometimes we are interested in relative probabilities; what is the chance of something occurring in one group compared with the chance of something occurring in another group? There are various methods used to describe this, which are discussed later in this chapter.

Data distributions

Often, there is apparently random variation in events or processes. If we toss a coin, it falls randomly either heads or tails; similarly, a dice will land on any number between one and six at random. If we measure the weights of children attending for surgery, they will vary at random around some central value. The probability of any one value, or range of values, is described by the probability distribution . Although medicine often refers to the normal , or Gaussian , distribution, this is only one of many possible probability distributions.

Uniform distributions

The mean value from a six-sided dice throw is 3.5; if you throw the dice many times and add up the total score and divide by the number of throws, it will be close to 3.5. However, the probability of throwing numbers close to the mean (3 or 4) is the same as the probability of throwing numbers at the extremes (1 or 6). In fact, the probability of any number is the same (1 in 6) – a uniform distribution . Uniform distributions are used to generate random numbers – the chance of any particular value is the same. In theory the probability of being on call on a particular day is also a uniform distribution, provided there are no special rules.

Non-uniform distributions

Most biological data come from non-uniform distributions. Often, these are centred on the average, and the probability of a particular value is greater if it is closer to the average. Extreme values are possible but less likely. The most well-known of these non-uniform distributions is the normal distribution – so called because most normal events or processes approximate to it (or can be transformed in some way to approximate it). It is also known as the Gaussian distribution after the polymath Carl Friedrich Gauss (also of Gauss lines in MRI). This distribution is defined mathematically based on two values (parameters) – the mean and standard deviations. The probability frequency distribution is a bell-shaped curve. The mode, median and mean are identical, and the degree of spread is governed by the ratio of the SD to the mean. It is important to understand that the normal distribution is only one of an infinite number of bell-shaped curves. A bell-shaped probability distribution is not necessarily normal.

The normal distribution has some useful properties, not least that it is possible to calculate the probability of finding a range of values based solely on the SD ( Fig. 2.1 ). The 97.5th centile of the normal distribution is 1.96 SD away from the mean. Therefore the probability of finding a value more extreme than this (there are two sides to the distribution) is 5%. Conversely, 95% of values, if selected at random, would be expected to be found within ±1.96 SD of the mean. Similarly, 68.3% of values would be expected to be within 1 SD on either side of the mean.

Fig. 2.1, A normal distribution curve, with a mean of zero and standard deviation of one. The unshaded areas which are greater than 1.96 standard deviations above and below the mean each encompass 2.5% of the population. The shaded area (mean ± 1.96 standard deviations) therefore covers 95% of the population.

Sometimes, distributions are skewed – the mean, median and mode are not identical. If the long tail is to the right, this is termed a right-skew (or positive skew), and pulled to the left is a left-skew (negative skew). In general, for a left-skew distribution, the mean is less than the median and both are less than the mode. The converse holds for right-skewed distributions ( Fig. 2.2 ).

Fig. 2.2, Different types of frequency distribution curve. ________ right/positive skew distribution.

Skew distributions are not so easy to handle with statistical tests, but often the data can be transformed to create a normal distribution. Commonly, taking the logarithm of the data will transform mildly skewed data into a normal distribution.

Chi-squared distribution

The chi-squared distribution (χ 2 ) is most commonly seen when used as the basis for comparing proportions of observed versus expected events. However, it is also fundamental to the t -test and analysis of variance (ANOVA). It is defined by only a single number; k is the number of degrees of freedom in the data.

Inferring information from a sample

If we were to measure some aspect of a complete population (e.g. the weight of every member of the anaesthetics department), we would be able to state with absolute confidence what the average and spread of that value were. If we measured everyone but one, we would be very confident, but there would be some error in our estimate of the average. If we measured only a few (selected at random), then we would still have an estimate of the average, but we would be less certain still about exactly what it was. Using various statistical tests (see later) we can quantify the degree of confidence that we have in our estimate of the population average.

Bias

All of the previously stated assumptions about sampling from a population assume that the sample is taken at random and is therefore representative of the whole population. However, there are many sources of bias which can invalidate this assumption. Some may be caused by the design of the experiment, some by the behaviour (conscious or unconscious) of the investigator or the subject of the investigation (e.g. a patient or volunteer). No study has ever been conducted without bias somewhere. The role of investigators, regulators, research community, funders and end users of research is to minimise this bias and account for it as far as possible.

Selection bias

Topic selection

The questions asked by researchers are a complex function of their interests and skills, the resources available and their ability to attract sufficient funding. It is widely recognised that there is a distortion of research funding. Pharmaceutical companies have a legitimate interest in research in their product areas, but these may not be the most beneficial for patients overall; charities target their resources, and government funders may sometimes follow political rather than healthcare imperatives.

Population selection

Some patient groups are easier to study than others, but the findings in one patient population may not be applicable to others. Within anaesthesia, this is perhaps most evident in the relative lack of pharmacological studies in the very young and the very old. Similarly, most clinical studies are run from large teaching hospitals, whereas most patients are treated in smaller hospitals. Outcomes are not necessarily the same for these groups – although not always better in the larger hospitals.

Inclusion/exclusion bias

Even if the appropriate population is studied, there is always a risk that the sample itself will be unrepresentative of the population. Some individuals or groups of patients may be more likely to be approached for involvement in a research study, and some may be more likely to consent or refuse.

Methodological bias

Head-to-head comparisons may be deliberately or accidentally set up to favour one group over another. A study of an adequate dose of a new oral opioid compared with a small dose of paracetamol (acetaminophen) is likely to demonstrate better analgesia with the opioid. Other more subtle biases are common in many anaesthesia and pain research studies.

Outcome bias

Detection bias

If an investigator (or patient) has an opinion about the relationship between group membership and outcome, then an outcome may be sought, or reported, more readily in one group than another. This may be conscious or unconscious. Most anaesthetists believe that difficult intubation is more common in pregnant women. A simple survey of difficult intubation is likely to reinforce this finding because this is the group in which cases are most likely to be sought. Similarly, because of preconceptions about the relative effectiveness of regional anaesthesia, a patient who has received regional anaesthesia may be more likely to report good analgesia than one allocated to receive oral analgesics.

Missing outcomes

It is not possible to measure every outcome in a study, so the investigator has to make a decision about which ones to choose. If an important variable is not measured, this may lead to a biased perception of the effects of a treatment.

Reporting bias

Research with positive results is more likely to be published in high-quality journals, and negative studies are less likely to be published at all. Some of this is bias from the journals, and some of it is bias by researchers who choose not to submit negative findings. Occasionally, commercial organisations restrict publication of studies which do not portray a favourable view of their product. The effect of this is that there is a bias in the literature in favour of positive studies. To use a coin tossing example, if researchers only ever published data when they got six or more heads in a row, the literature would soon be awash with data suggesting that the coins were biased. Subtler is the non-reporting of measured outcomes. A study which finds a positive effect in a relatively minor outcome may fail to report a neutral or negative effect on a major outcome. This is extremely hard to detect because it relies on transparency from investigators about what they measured. Outright fraud is still thought to be relatively rare, but there are several high-profile cases of researchers fabricating or manipulating data to fit their beliefs and even publishing studies that never took place.

Testing

Within medicine and anaesthesia, professionals strive to achieve the best outcomes possible for patients. It is therefore very common to ask a question of the form ‘Is the outcome in group A better than in group B?’ We already know that if we take a sample from a population (e.g. the total population of group A) we will be able to estimate the true population average. If we do the same for group B, this will provide an estimate for population B. Because of simple random variation, the average value for A and B will always be different, provided we measure them with sufficient precision. What we really want to know is how confident we are that any differences that we see are not just due to chance. As shown in Fig. 2.3 , if the degree of separation of the two groups is small or the spread of either is relatively large, then there is a reasonable chance that the average estimated for group B could have come from population A. Conversely, if the separation is larger or the spread is smaller, the chance of the estimated average for B being found in population A is small (but never zero).

Fig. 2.3, The effect of changing mean and standard deviation on overlap of frequency distributions. (A) Solid line – mean 10, standard deviation 1; dotted line – mean 20, standard deviation 1. (B) Solid line – mean 10, standard deviation 4; dotted line – mean 20, standard deviation 4. (C) Solid line – mean 10, standard deviation 1; dotted line – mean 12, standard deviation 1.

This is the fundamental principle behind most statistical testing: What is the probability that the result found has occurred simply by chance? Note that this does not mean that the result could definitely not have occurred by chance, just that it is sufficiently unlikely to support the hypothesis that the groups really are different. By way of a simple example, the probability of tossing six heads in a row with an unbiased coin is 0.5 6 (0.0156), or 1 in 64. This is by definition unlikely, so you would be suspicious of the coin being biased towards heads. However, it is clearly not impossible and does occur (1 in 64 times, on average).

An important principle with statistical testing is the concept of paired tests. There is inherent variability between things being measured – people, times, objects. This variability between things increases the spread of values we might measure, making it harder to demonstrate a difference between groups. However, if we measure the difference within an individual, then the difference may be easier to find. To take a trivial example, if we want to see whether fuel consumption is better with one fuel compared to another, we might take 20 cars with one fuel and compare them with 20 cars with another – an unpaired test. However, the 40 cars will all have other factors influencing fuel consumption, and this variability will hinder our ability to detect a difference. If we take 20 cars and test them with both fuels (choosing at random which fuel to test first), then we have a much better chance of demonstrating a difference. A paired test is one in which the results of some intervention are tested within the individuals in the group, not between groups. Within anaesthesia, paired tests are relatively unusual because we do not normally have situations in which we can test more than one thing on one individual. There are some examples, such as physiological experiments studying the effects of drugs during anaesthesia.

When data (or their transforms) can be legitimately modelled as coming from a defined probability distribution, they are described as being from a parametric distribution. Most commonly, this is the normal distribution, but binomial, Poisson, and Weibull are all defined distributions. Data from undefined probability distributions are non-parametric. In general, parametric statistical tests are more powerful than non-parametric tests and so should be used if appropriate. Non-parametric tests rely on far fewer assumptions and are considered more robust. They can also be used on parametric data.

Before performing any statistical tests, there are a few simple rules and questions which reduce the likelihood of applying the wrong test or misinterpreting the results.

  • What question do you want to answer?

  • What type of data do you have? Categorical data require a different approach to ordinal or interval data. Survival analysis, comparison of measurement techniques, and others will require different approaches.

  • Plot the data using scatter plots and frequency histograms. Summary statistics may completely hide a skewed distribution or bimodal data.

  • Are there any obvious erroneous data? Transcription errors are fairly common, so always check that the data are accurate.

  • Are the data paired or unpaired?

  • Can a parametric test be used? If not, can the data be transformed so that a parametric test can be used?

  • Is there an element of multiple testing?

A flow chart suggesting rules to guide selection of tests is shown in Fig. 2.4 . A brief summary of the tests described is given in the next section. The flow chart is not an exhaustive list; there are many other tests and situations not covered, but researchers should always explain if they feel the need to use more obscure approaches.

Fig. 2.4, Flow chart suggesting rules to guide selection of statistical tests. ANOVA, Analysis of variance; ROC, receiver operative characteristic.

Chi-squared test

The χ 2 test compares the expected number of events with the actual numbers. Data are tabulated in a contingency table ; 2 × 2 tables are the simplest, but larger tables can be used. In general, the number of degrees of freedom is the number of columns in the table minus 1 multiplied by the number of rows minus 1 (i.e. for a 2 × 2 table, the degrees of freedom would be (2 − 1) × (2 − 1) = 1).


x 2 = i = 1 n ( O i E i ) 2 E i

If the expected frequencies are not known (e.g. 50/50 for male/female), the expected value for each cell (E i ) is determined by the number of observed events in that cell's row multiplied by the number of events in that cell's column, divided by the total number of events. An example is shown in Table 2.1 . The expected frequency for cell A is (A + B) × (A + C)/N.

Table 2.1
Example of a contingency table for calculation of χ 2
With treatment Without treatment Row marginals
Good outcome A B E = A + B
Bad outcome C D F = C + D
Column marginals G = A + C H = B + D N (total) = A + B + C + D

The calculated χ 2 statistic is then compared with a table of probability values for χ 2 for each degree of freedom. This gives a probability that the observed frequencies came from the same population as the expected frequencies.

A worked example is shown in Table 2.2 . The expected columns are inserted to show the calculations. The basic table is 2 × 2. χ 2 is the sum of the (O – E) 2 /E values for each cell:


{ ( 15 10.9 ) 2 / 10.9 } + { ( 40 44.1 ) 2 / 44.1 } + { ( 6 10.1 ) 2 / 10.1 } + { ( 45 40.9 ) 2 / 40.9 } = 4.0

Table 2.2
A worked example of a contingency table for calculation of χ 2
WONDER DRUG PLACEBO Row marginals
Observed Expected Observed Expected
Lived 15 10.9
(21 × 55)/106
6 10.1
(21 × 51)/106
21
Died 40 44.1
(85 × 55)/106
45 40.9
(85 × 51)/106
85
Column marginals 55 51 106

The critical value from the χ 2 tables for P = .05 and one degree of freedom is 3.84. The χ 2 is greater than this, making it unlikely that the distribution of data comes from a single population.

There are some standard assumptions with χ 2 tests. There should be sufficient samples; the expected cell counts should all be >5 in 2 × 2 tables or >5 in at least 80% of cells in larger tables. If these conditions are not satisfied, then alternative approaches are used, such as Yates’ continuity correction or Fisher's exact test.

Rank tests

If data are ranked in order, then if two groups are sufficiently different, we would expect the sum of these ranks to be much smaller for one group than the other. The bigger the groups, the smaller the difference in the sum of these ranks which we would accept. This concept is the basis for the Wilcoxon signed rank and Mann–Whitney U tests. Essentially the sum of the ranks, corrected for the sample sizes, is compared with a table of probabilities calculated from the U distribution. The Wilcoxon signed rank and Mann–Whitney U (or Wilcoxon rank sum) tests are non-parametric tests used for paired or independent (unpaired) samples, respectively. Because these rank tests do not perform any statistical tests on the values themselves, they are robust to the presence of outliers. It does not matter how extreme a particular value is; only its rank is important.

t -Tests

Provided that the data approximate closely enough to a normal distribution, t -tests provide a powerful method for assessing differences between two groups. Unlike the rank tests, the t distribution uses the data values themselves – the calculation uses the mean, standard deviations and sample sizes of the two groups. Again, although it can be done by hand, it is more robust to use one of the myriad statistical software packages available.

Rating scales are extremely popular in anaesthetics research – pain, nausea, satisfaction and anxiety are all commonly measured using verbal, numerical or visual analogue scales. The safest way to analyse these is using non-parametric statistical tests because these avoid any assumptions about the interval between points. However, in practice, many researchers assume the data behave as though normally distributed and analyse them using t -tests.

Multiple testing

Sometimes we may want to look for differences between multiple groups or at multiple times within groups. In this case the obvious answer might be to perform several tests. However, this leads to a problem about probability. If we had three groups (A, B, C), there would be three comparisons: A–B, A–C and B–C. The chance of obtaining a positive finding purely by chance is now rather greater. As the number of comparisons increases, the number of chance findings will inevitably increase. If we perform 20 experiments, we would expect one of these to be positive with P < .05 purely by chance.

There are numerous approaches to this problem. One is simply to correct the overall P value, so that it remains correct, by adjusting the P values for the individual tests. This is the principle of the Bonferroni correction. This is simple to calculate (the P value for each comparison is approximated by the overall P value divided by the number of comparisons), but it may be too conservative in some situations. An alternative approach, which is widely used (and abused), is to employ the family of statistical tests known as ANOVA (ANalysis Of VAriance). When properly constructed, ANOVA can provide information about the likelihood of significant variation between or within groups. ANOVA requires various assumptions about the distribution of the data, including normality. The non-parametric equivalent is the Kruskal–Wallis test. When applied to two groups, ANOVA is a t -test and Kruskal–Wallis a Mann–Whitney U . If these tests suggest a significant difference, there are various post hoc tests used to identify where the difference lies without falling foul of the multiple testing issues described earlier. Tukey's honestly significant difference is commonly used in medical research.

You're Reading a Preview

Become a Clinical Tree membership for Full access and enjoy Unlimited articles

Become membership

If you are a member. Log in here