Statistical methodologies in laboratory medicine: Analytical and Clinical Evaluation of Laboratory Tests


Abstract

Background

The careful selection and evaluation of laboratory tests are key steps in the process of implementing new measurement procedures in the laboratory for clinical use. Method evaluation in the clinical laboratory is complex and in most countries is a regulated process guided by various professional recommendations and quality standards on best laboratory practice.

Content

This chapter deals with the statistical aspects of both analytical and clinical evaluations of laboratory assays, tests, or markers. After a short overview on basic statistics, aspects such as accuracy, precision, trueness, limit of detection, and selectivity are considered in the first part. After dealing with comparison of assays in detail, including using difference plots and regression analysis, the focus is on quantification of the (added) diagnostic value of laboratory assays or tests. First, the evaluation of tests in isolation is outlined, which corresponds to simple diagnostic scenarios, when only a single test result is decisive (e.g., in the screening context). Subsequently, the chapter addresses the more common clinical situation in which a laboratory assay or test is considered as part of a diagnostic workup and thus a test’s added value is at issue. This involves use of receiver operating characteristic (ROC) areas, reclassification measures, predictiveness curves, and decision curve analysis. Finally, principles for considering the clinical impact of diagnostic tests on actual decision making and patient outcomes are discussed.

Assay selection overview

The introduction of new or revised laboratory tests, markers, or assays is a common occurrence in the clinical laboratory. Test selection and evaluation are key steps in the process of implementing new measurement procedures ( Fig. 2.1 ). A new or revised test must be selected carefully and its analytical and clinical performance evaluated thoroughly before it is adopted for routine use in patient care (see later in this chapter and Chapter 10 ). Establishment of a new or revised laboratory test may also involve evaluation of the features of the automated analyzer on which the test will be implemented. When a new test is to be introduced to the routine clinical laboratory, a series of technical or analytical evaluations is commonly conducted. Assay imprecision is estimated, and comparison of the new assay versus an existing one is commonly undertaken. The allowable measurement range is assessed with estimation of the lower and upper limits of quantification. Interferences and carryover are evaluated when relevant. Depending on the situation, a limited verification of manufacturer claims may be all that is necessary, or, in the case of a newly developed test or assay, a full validation may be carried out. Subsequent subsections provide details for all these test evaluations. With regard to evaluation of reference intervals or medical decision limits, readers are referred to Chapter 9 .

FIGURE 2.1, A flow diagram that illustrates the process of introducing a new assay into routine use.

Evaluation of tests, markers, or assays in the clinical laboratory is influenced strongly by guidelines and accreditation or other regulatory standards. The Clinical and Laboratory Standards Institute (CLSI, formerly the National Committee for Clinical Laboratory Standards [NCCLS]) has published a series of consensus protocols (Clinical Laboratory Improvement Amendments [CLIAs]) for clinical chemistry laboratories and manufacturers to follow when evaluating methods (see the CLSI website at http://www.clsi.org ). The International Organization for Standardization (ISO) has also developed several documents related to method evaluation (ISOs). In addition, meeting laboratory accreditation requirements has become an important aspect in the evaluation process with accrediting agencies placing increased focus on the importance of total quality management and assessment of trueness and precision of laboratory measurements. An accompanying trend has been the emergence of an international nomenclature to standardize the terminology used for characterizing laboratory test or assay performance.

This chapter presents an overview of considerations in and methods for the evaluation of laboratory tests. This includes explanation of graphical and statistical methods that are used to aid in the test evaluation process; examples of the application of these methods are provided, and current terminology within the area is summarized. Key terms and abbreviations are listed in Box 2.1 .

BOX 2.1
Abbreviations and Vocabulary Concerning Technical Validation of Assays

Abbreviations
CI Confidence interval
CV Coefficient of variation (=SD/ x , where x is the concentration)
CV% = CV × 00%
CV A Analytical coefficient of variation
CV G Between-subject biological variation
CV I Within-subject biological variation
CV RB Sample-related random bias coefficient of variation
DoD Distribution of differences (plot)
ISO International Organization for Standardization
IUPAC International Union of Pure and Applied Chemistry
OLR Ordinary least-squares regression analysis
SD Standard deviation
SEM Standard error of the mean (5SD/ )
SD A Analytical standard deviation
SD RB Sample-related random bias standard deviation
xm Mean
xmv Weighted mean
WLR Weighted least-squares regression analysis

Vocabulary a

Analyte Compound that is measured.

Bias Difference between the average (strictly the expectation) of the test results and an accepted reference value (ISO 3534-1). Bias is a measure of trueness.

Certified reference material (CRM) is a reference material, one or more of whose property values are certified by a technically valid procedure, accompanied by or traceable to a certificate or other documentation that is issued by a certifying body.

Commutability Ability of a material to yield the same results of measurement by a given set of measurement procedures.

Limit of detection The lowest amount of analyte in a sample that can be detected but not quantified as an exact value. Also called lower limit of detection or minimum detectable concentration (or dose or value).

Lower limit of quantification (LLOQ) The lowest concentration at which the measurement procedure fulfills specifications for imprecision and bias (corresponds to the lower limit of determination mentioned under Measuring interval ).

Matrix All components of a material system except the analyte.

Measurand The “quantity” that is actually measured (e.g., the concentration of the analyte). For example, if the analyte is glucose, the measurand is the concentration of glucose. For an enzyme, the measurand may be the enzyme activity or the mass concentration of enzyme.

Measuring interval Closed interval of possible values allowed by a measurement procedure and delimited by the lower limit of determination and the higher limit of determination. For this interval, the total error of the measurements is within specified limits for the method. Also called the analytical measurement range.

Primary measurement standard Standard that is designated or widely acknowledged as having the highest metrologic qualities and whose value is accepted without reference to other standards of the same quantity.

Quantity The amount of substance (e.g., the concentration of substance).

Random error Arises from unpredictable variations in influence quantities. These random effects give rise to variations in repeated observations of the measurand.

Reference material (RM) A material or substance, one or more properties of which are sufficiently well established to be used for the calibration of a method or for assigning values to materials.

Reference measurement procedure Thoroughly investigated measurement procedure shown to yield values having an uncertainty of measurement commensurate with its intended use, especially in assessing the trueness of other measurement procedures for the same quantity and in characterizing reference materials.

Selectivity or specificity Degree to which a method responds uniquely to the required analyte.

Systematic error A component of error that, in the course of a number of analyses of the same measurand, remains constant or varies in a predictable way.

Traceability “The property of the result of a measurement or the value of a standard whereby it can be related to stated references, usually national or international standards, through an unbroken chain of comparisons all having stated uncertainties.” This is achieved by establishing a chain of calibrations leading to primary national or international standards, ideally (for long-term consistency) the Système International (SI) units of measurement.

Uncertainty A parameter associated with the result of a measurement that characterizes the dispersion of values that could reasonably be attributed to the measurand. More briefly, uncertainty is a parameter characterizing the range of values within which the value of the quantity being measured is expected to lie.

Upper limit of quantification (ULOQ) The highest concentration at which the measurement procedure fulfills specifications for imprecision and bias (corresponds to the upper limit of determination mentioned under Measuring interval ).

a A listing of terms of relevance in relation to analytical methods is displayed. Many of the definitions originate from Dybkær with statement of original source where relevant (e.g., International Organization for Standardization document number). Others are derived from the Eurachem/Citac guideline on uncertainty. In some cases, slight modifications have been performed for the sake of simplicity.

Medical need and quality goals

The selection of the appropriate clinical laboratory assays is a vital part of rendering optimal patient care. Advances in patient care are frequently based on the use of new or improved laboratory tests or measurements. Ascertainment of what is necessary clinically from a new or revised laboratory test is the first step in selecting the appropriate candidate test. Key parameters, such as desired turnaround time and necessary clinical utility for an assay, are often derived by discussions between laboratorians and clinicians. When new diagnostic assays are introduced, for example, reliable estimates of its diagnostic performance (e.g., predictive values, sensitivity and specificity) must be considered. With established analytes, a common scenario is the replacement of an older, labor-intensive test with a new, automated assay that is more economical in daily use. In these situations, consideration must be given to whether the candidate assay has sufficient precision, accuracy, analytical measurement range, and freedom from interference to provide clinically useful results (see Fig. 2.1 ).

Analytical performance criteria

In evaluation of a laboratory test, (1) trueness (formerly termed accuracy), (2) precision, (3) analytical range, (4) detection limit, and (5) analytical specificity are of prime importance. The sections in this chapter on laboratory test evaluation and comparison contain detailed outlines of these concepts. Estimated test performance parameters should be related to analytical performance specifications that ensure acceptable clinical use of the test and its results. For more details related to the recommended models for setting analytical performance specifications, readers are referred to Chapters 6 and 8 . From a practical point of view, the “ruggedness” of the test in routine use is of importance and reliable performance, when used by different operators and with different batches of reagents over long time periods, is essential.

When a new laboratory analyzer is at issue, various instrumental parameters require evaluation, including (1) pipetting, (2) specimen-to-specimen carryover, (3) reagent lot-to-lot variation, (4) detector imprecision, (5) time to first reportable result, (6) onboard reagent stability, (7) overall throughput, (8) mean time between instrument failures, and (9) mean time to repair. Information on most of these parameters should be available from the instrument manufacturer; the manufacturer should also be able to furnish information on what studies should be conducted in estimating these parameters for an individual analyzer. Assessment of reagent lot-to-lot variation is especially difficult for a user, and the manufacturer should provide this information.

Other criteria

Various categories of laboratory tests may be considered. New tests may require “in-house” development. (Note: Such a test is also referred to as a laboratory-developed test [LDT].) Commercial kit assays, on the other hand, are ready for implementation in the laboratory, often in a “closed” analytical system on a dedicated instrument. When prospective assays are reviewed, attention should be given to the following:

  • 1.

    Principle of the test or assay, with original references

  • 2.

    Detailed protocol for performing the test

  • 3.

    Composition of reagents and reference materials, the quantities provided, and their storage requirements (e.g., space, temperature, light, humidity restrictions) applicable both before and after the original containers are opened

  • 4.

    Stability of reagents and reference materials (e.g., their shelf lives)

  • 5.

    Technologist time and required skills

  • 6.

    Possible hazards and appropriate safety precautions according to relevant guidelines and legislation

  • 7.

    Type, quantity, and disposal of waste generated

  • 8.

    Specimen requirements (e.g., conditions for collection and transportation, specimen volume requirements, the necessity for anticoagulants and preservatives, necessary storage conditions)

  • 9.

    Reference interval of the test and its results, including information on how such interval was derived, typical values obtained in both healthy and diseased individuals, and the necessity of determining a reference interval for one’s own institution (see Chapter 9 for details on how to generate a reference interval of a laboratory test.)

  • 10.

    Instrumental requirements and limitations

  • 11.

    Cost-effectiveness

  • 12.

    Computer platforms and interfacing with the laboratory information system

  • 13.

    Availability of technical support, supplies, and service

Other questions concerning placement of the new or revised test in the laboratory should be taken into account. They include:

  • 1.

    Does the laboratory possess the necessary measuring equipment? If not, is there sufficient space for a new instrument?

  • 2.

    Does the projected workload match the capacity of a new instrument?

  • 3.

    Is the test repertoire of a new instrument sufficient?

  • 4.

    What is the method and frequency of (re)calibration?

  • 5.

    Is staffing of the laboratory sufficient for the new technology?

  • 6.

    If training the entire staff in a new technique is required, is such training worth the possible benefits?

  • 7.

    How frequently will quality control (QC) samples be run?

  • 8.

    What materials will be used to ensure QC?

  • 9.

    What approach will be used for proficiency testing?

  • 10.

    What is the estimated cost of performing an assay using the proposed method, including the costs of calibrators, QC specimens, and technologists’ time? Questions applicable to implementation of new instrumentation in a particular laboratory may also be relevant. Does the instrument satisfy local electrical safety guidelines? What are the power, water, drainage, and air conditioning requirements of the instrument? If the instrument is large, does the floor have sufficient load-bearing capacity?

A qualitative assessment of all these factors is often completed, but it is possible to use a value scale to assign points to the various features weighted according to their relative importance; the latter approach allows a more quantitative test evaluation process. Decisions are then made regarding the assays that best fit the laboratory’s requirements and that have the potential for achieving the necessary analytical quality for clinical use.

Basic statistics

In this section, fundamental statistical concepts and techniques are introduced in the context of typical analytical investigations. The basic concepts of (1) populations, (2) samples, (3) parameters, (4) statistics, and (5) probability distributions are defined and illustrated. Two important probability distributions—Gaussian and Student t —are introduced and discussed.

Frequency distribution

A graphical device for displaying a large set of laboratory test results is the frequency distribution, also called a histogram . Fig. 2.2 shows a frequency distribution displaying the results of serum gamma-glutamyltransferase (GGT) measurements of 100 apparently healthy 20- to 29-year-old men. The frequency distribution is constructed by dividing the measurement scale into cells of equal width; counting the number, n i , of values that fall within each cell; and drawing a rectangle above each cell whose area (and height because the cell widths are all equal) is proportional to n i . In this example, the selected cells were 5 to 9, 10 to 14, 15 to 19, 20 to 24, 25 to 29, and so on, with 60 to 64 being the last cell (range of values, 5 to 64 U/L). The ordinate axis of the frequency distribution gives the number of values falling within each cell. When this number is divided by the total number of values in the data set, the relative frequency in each cell is obtained.

FIGURE 2.2, Frequency distribution of 100 gamma-glutamyltransferase (GGT) values.

Often, the position of the value for an individual within a distribution of values is useful medically. The nonparametric approach can be used to directly determine the percentile of a given subject. Having ranked N subjects according to their values, the n -percentile, Percn, may be estimated as the value of the [ N ( n /100) + 0.5] ordered observation. In the case of a noninteger value, interpolation is carried out between neighbor values. The 50th percentile is the median of the distribution.

Population and sample

It is useful to obtain information and draw conclusions about the characteristics of the test results for one or more target populations. In the GGT example, interest is focused on the location and spread of the population of GGT values for 20- to 29-year-old healthy men. Thus a working definition of a population is the complete set of all observations that might occur as a result of performing a particular procedure according to specified conditions.

Most target populations of interest in clinical chemistry are in principle very large (millions of individuals) and so are impossible to study in their entirety. Usually a subgroup of observations is taken from the population as a basis for forming conclusions about population characteristics. The group of observations that has actually been selected from the population is called a sample . For example, the 100 GGT values make up a sample from a respective target population. However, a sample is used to study the characteristics of a population only if it has been properly selected. For instance, if the analyst is interested in the population of GGT values over various lots of materials and some time period, the sample must be selected to be representative of these factors, as well as of age, sex, and health factors of the individuals in the targeted population. Consequently, exact specification of the target population(s) is necessary before a plan for obtaining the sample(s) can be designed. In this chapter, a sample is also used as a specimen, depending on the context.

Probability and probability distributions

Consider again the frequency distribution in Fig. 2.2 . In addition to the general location and spread of the GGT determinations, other useful information can be easily extracted from this frequency distribution. For instance, 96% (96 of 100) of the determinations are less than 55 U/L, and 91% (91 of 100) are greater than or equal to 10 but less than 50 U/L. Because the cell interval is 5 U/L in this example, statements such as these can be made only to the nearest 5 U/L. A larger sample would allow a smaller cell interval and more refined statements. For a sufficiently large sample, the cell interval can be made so small that the frequency distribution can be approximated by a continuous, smooth curve, similar to that shown in Fig. 2.3 . In fact, if the sample is large enough, we can consider this a close representation of the “true” target population frequency distribution . In general, the functional form of the population frequency distribution curve of a variable x is denoted by f ( x ).

FIGURE 2.3, Population frequency distribution of gamma-glutamyltransferase (GGT) values.

The population frequency distribution allows us to make probability statements about the GGT of a randomly selected member of the population of healthy 20- to 29-year-old men. For example, the probability Pr( x > x a ) that the GGT value x of a randomly selected 20- to 29-year-old healthy man is greater than some particular value x a is equal to the area under the population frequency distribution to the right of x a . If x a = 58, then from Fig. 2.3 , Pr( x > 58) = 0.05. Similarly, the probability Pr( x a < x < x b ) that x is greater than x a but less than x b is equal to the area under the population frequency distribution between x a and x b . For example, if x a = 9 and x b = 58, then from Fig. 2.3 , Pr(9 < x < 58) = 0.90. Because the population frequency distribution provides all information related to probabilities of a randomly selected member of the population, it is called the probability distribution of the population. Although the true probability distribution is never exactly known in practice, it can be approximated with a large sample of observations, that is, test results.

Parameters: Descriptive measures of a population

Any population of values can be described by measures of its characteristics. A parameter is a constant that describes some particular characteristic of a population. Although most populations of interest in analytical work are infinite in size, for the following definitions, we shall consider the population to be of finite size N , where N is very large.

One important characteristic of a population is its central location . The parameter most commonly used to describe the central location of a population of N values is the population mean (μ):


μ = x i N

An alternative parameter that indicates the central tendency of a population is the median, which is defined as the 50th percentile, Perc 50 .

Another important characteristic is the dispersion of values about the population mean. A parameter very useful in describing this dispersion of a population of N values is the population variance σ 2 (sigma squared):


σ 2 = ( x i - μ ) 2 N

The population standard deviation (SD) σ, the positive square root of the population variance, is a parameter frequently used to describe the population dispersion in the same units (e.g., mg/dL) as the population values. For a Gaussian distribution, 95% of the population of values are located within the mean ±1.96 σ. If a distribution is non-Gaussian (e.g., asymmetric), an alternative measure of dispersion based on the percentiles may be more appropriate, such as the distance between the 25th and 75th percentiles (the interquartile interval).

Statistics: Descriptive measures of the sample

As noted earlier, clinical chemists usually have at hand only a sample of observations (i.e., test results) from the overarching targeted population. A statistic is a value calculated from the observations in a sample to estimate a particular characteristic of the target population. As introduced earlier, the sample mean x m is the arithmetical average of a sample, which is an estimate of μ. Likewise, the sample SD is an estimate of σ, and the coefficient of variation (CV) is the ratio of the SD to the mean multiplied by 100%. The equations used to calculate x m , SD, and CV, respectively, are as follows:


x m = x i N

SD = ( x i - x m ) 2 N - 1 = x i 2 - ( x i ) 2 N N - 1

CV = SD x m × 100 %

where x i is an individual measurement and N is the number of sample measurements.

The SD is an estimate of the dispersion of the distribution. Additionally, from the SD, we can derive an estimate of the uncertainty of x m as an estimate of μ (see later discussion).

Random sampling

A random sample of individuals from a target population is one in which each member of the population has an equal chance of being selected. A random sample is one in which each member of the sample can be considered to be a random selection from the target population. Although much of statistical analysis and interpretation depends on the assumption of a random sample from some population, actual data collection often does not satisfy this assumption. In particular, for sequentially generated data, it is often true that observations adjacent to each other tend to be more alike than observations separated in time.

The Gaussian probability distribution

The Gaussian probability distribution, illustrated in Fig. 2.4 , is of fundamental importance in statistics for several reasons. As mentioned earlier, a particular test result x will not usually be equal to the true value μ of the specimen being measured. Rather, associated with this particular test result x will be a particular measurement error ε = x − μ, which is the result of many contributing sources of error. Pure measurement errors tend to follow a probability distribution similar to that shown in Fig. 2.4 , where the errors are symmetrically distributed, with smaller errors occurring more frequently than larger ones, and with an expected value of 0. This important fact is known as the central limit effect for distribution of errors: if a measurement error e is the sum of many independent sources of error, such as ε 1 , ε 2 , ..., ε k , several of which are major contributors, the probability distribution of the measurement error ε will tend to be Gaussian as the number of sources of error becomes large.

FIGURE 2.4, The Gaussian probability distribution.

Another reason for the importance of the Gaussian probability distribution is that many statistical procedures are based on the assumption of a Gaussian distribution of values; this approach is commonly referred to as parametric. Furthermore, these procedures usually are not seriously invalidated by departures from this assumption. Finally, the magnitude of the uncertainty associated with sample statistics can be ascertained based on the fact that many sample statistics computed from large samples have a Gaussian probability distribution.

The Gaussian probability distribution is completely characterized by its mean μ and its variance σ 2 . The notation N (μ, σ 2 ) is often used for the distribution of a variable that is Gaussian with mean μ and variance σ 2 . Probability statements about a variable x that follows an N (μ, σ 2 ) distribution are usually made by considering the variable z,


z = x - μ σ

which is called the standard Gaussian variable . The variable z has a Gaussian probability distribution with μ = 0 and σ 2 = 1, that is, z is N (0, 1). The probably that x is within 2 σ of μ [i.e., Pr(| x − μ| < 2 σ) =] is 0.9544. Most computer spreadsheet programs can calculate probabilities for all values of z .

Student t probability distribution

To determine probabilities associated with a Gaussian distribution, it is necessary to know the population SD σ. In actual practice, σ is often unknown, so we cannot calculate z . However, if a random sample can be taken from the Gaussian population, we can calculate the sample SD, substitute SD for σ, and compute the value t:


t = x - μ SD

Under these conditions, the variable t has a probability distribution called the Student t distribution . The t distribution is really a family of distributions depending on the degrees of freedom (df) ν (= N − 1) for the sample SD. Several t distributions from this family are shown in Fig. 2.5 . When the size of the sample and the df for SD are infinite, there is no uncertainty in SD, so the t distribution is identical to the standard Gaussian distribution. However, when the sample size is small, the uncertainty in SD causes the t distribution to have greater dispersion and heavier tails than the standard Gaussian distribution, as illustrated in Fig. 2.5 . At sample sizes above 30, the difference between the t -distribution and the Gaussian distribution becomes relatively small and can usually be neglected. Most computer spreadsheet programs can calculate probabilities for all values of t , given the df for SD.

FIGURE 2.5, The t distribution for v = 1, 10, and ∞∞.

The Student t distribution is commonly used in significance tests, such as comparison of sample means, or in testing conducted if a regression slope differs significantly from 1. Descriptions of these tests can be found in statistics textbooks. Another important application is the estimation of confidence intervals (CIs). CIs are intervals that indicate the uncertainty of a given sample estimate. For example, it can be proved that X m ± t alpha (SD/ N 0.5 ) provides an approximate 2 alpha -CI for the mean. A common value for alpha is 0.025 or 2.5%, which thus results in a 0.95% or 95% CI. Given sample sizes of 30 or higher, t alpha is ca. 2. (SD/ N 0.5 ) is called the standard error (SE) of the mean. A CI should be interpreted as follows. Suppose a sampling experiment of drawing 30 observations from a Gaussian population of values is repeated 100 times, and in each case, the 95% CI of the mean is calculated as described. Then, in 95% of the drawings, the true mean μ is included in the 95% CI. The popular interpretation is that for an estimated 95% CI, there is 95% chance that the true mean is within the interval. According to the central limit theorem, distributions of mean values converge toward the Gaussian distribution irrespective of the primary type of distribution of x . This means that the 95% CI is a robust estimate only minimally influenced by deviations from the Gaussian distribution. In the same way, the t -test is robust toward deviations from normality.

Nonparametric statistics

Distribution-free statistics, often called nonparametric statistics, provides an alternative to parametric statistical procedures that assume data to have Gaussian distributions. For example, distributions of reference values are often skewed and so do not conform to the Gaussian distribution (see Chapter 9 on reference intervals). Formally, one can carry out a goodness of fit test to judge whether a distribution is Gaussian or not. A commonly used test is the Kolmogorov-Smirnov test, in which the shape of the sample distribution is compared with the shape presumed for a Gaussian distribution. If the difference exceeds a given critical value, the hypothesis of a Gaussian distribution is rejected, and it is then appropriate to apply nonparametric statistics. A special problem is the occurrence of outliers (i.e., single measurements highly deviating from the remaining measurements). Outliers may rely on biological factors and so be of real significance (e.g., in the context of estimating reference intervals or be related to clerical errors). Special tests exist for handling outliers.

Given that a distribution is non-Gaussian, it is appropriate to apply nonparametric descriptive statistics based on the percentile or quantile concept. As stated under the earlier section Frequency Distribution, the n -percentile, Perc n , of a sample of N values may be estimated as the value of the [ N ( n /100) + 0.5] ordered observation. In the case of a noninteger value, interpolation is carried out between neighbor values. The median is the 50th percentile, which is used as a measure of the center of the distribution. For the GGT example mentioned previously, we would order the N = 100 values according to size. The median or 50th percentile is then the value of the [100(50/100) + 0.5 = 50.5] ordered observation (the interpolated value between the 50th and 51st ordered values). The 2.5th and 97.5th percentiles are values of the [100(2.5/100) + 0.5 = 3] and [100(97.5/100) + 0.5 = 98] ordered observations, respectively. When a 95% reference interval is estimated, a nonparametric approach is often preferable because many distributions of reference values are asymmetric. Generally, distributions based on the many biological sources of variation are often non-Gaussian compared with distributions of pure measurement errors that usually are Gaussian.

The nonparametric counterpart to the t -test is the Mann-Whitney test, which provides a significance test for the difference between median values of the two groups to be compared. When there are more than two groups, the Kruskal-Wallis test can be applied.

Categorical variables

Hitherto focus has been on quantitative variables. When dealing with qualitative tests and in the context of evaluating diagnostic testing, categorical variables that only take the value positive or negative come into play. The performance is here given as proportions or percentages, which are proportions multiplied by 100. For example, the diagnostic sensitivity of a test is the proportion of diseased subjects who have a positive result. Having tested, for example, 100 patients, 80 might have had a positive test result. The sensitivity then is 0.8 or 80%. We are then interested in judging how precise this estimate is. Exact estimates of the uncertainty can be derived from the so-called binomial distribution, but for practical purposes, an approximate expression for the 95% CI is usually applied as the estimated proportion P ± 2SE, where the SE in this context is derived as:

SE =[ P (1 − P )/ N ] 0.5

where P is here a proportion and not a percentage. In the example, the SE equals 0.0016 and so the 95% CI is 0.77 to 0.83 or 77 to 83%. The applied approximate formula for the SE is regarded as reasonably valid when NP and N (1 − P ) both are equal to or higher than 5.

POINTS TO REMEMBER

  • Statistics as means, SDs, percentiles, proportions, and so on are computed from a sample of values drawn from a population and provide estimates of the unknown population characteristics.

  • Whereas parametric statistics rely on the assumption of a Gaussian population of values, which typically applies for measurement errors, nonparametric statistics is a distribution-free approach that apply to, for example, asymmetric distributions often observed for biologic variables.

  • The Gaussian distribution is characterized by the mean and the SD, and other types of distributions are described by the median and the percentile (quantile) values.

  • Distributions of categorical variables are characterized by proportions or percentages and their SEs.

Technical validity of analytical assays

This section defines the basic concepts used in this chapter: (1) calibration, (2) trueness and accuracy, (3) precision, (4) linearity, (5) limit of detection (LOD), (6) limit of quantification, (7) specificity, and (8) others (see Box 2.1 for definitions).

Calibration

The calibration function is the relation between instrument signal (y) and concentration of analyte (x), that is,

y = f ( x )

The inverse of this function, also called the measuring function, yields the concentration from response:

x = f −1 ( y )

This relationship is established by measurement of samples with known quantities of analyte (calibrators). One may distinguish between solutions of pure chemical standards and samples with known quantities of analyte present in the typical matrix that is to be measured (e.g., human serum). The first situation applies typically to a reference measurement procedure that is not influenced by matrix effects; the second case corresponds typically to a routine method that often is influenced by matrix components and so preferably is calibrated using the relevant matrix. Calibration functions may be linear or curved and, in the case of immunoassays, may often take a special form (e.g., modeled by the four-parameter logistic curve). This model (logistic in log x ) has been used for immunoassay techniques and is written in several forms ( Table 2.1 ). An alternative, model-free approach is to estimate a smoothed spline curve, which often is performed for immunoassays; however, a disadvantage of the spline curve approach is that it is insensitive to aberrant calibration values, fitting these just as well as the correct values. If the assumed calibration function does not correctly reflect the true relationship between instrument response and analyte concentration, a systematic error or bias is likely to be associated with the analytical method. A common problem with some immunoassays is the “hook effect,” which is a deviation from the expected calibration algorithm in the high-concentration range. (The hook effect is discussed in more detail in Chapter 26 .)

TABLE 2.1
The Four-Parameter Logistic Model Expressed in Three Different Forms
Algebraic Form Variables a Parameters b
y = ( a d )/[1 + ( x /c) b ] + d ( x , y ) a, b, c, d
R = R 0 + K c /[1 + exp(−{a + b log[ C ]})] ( C , R ) R 0 , K c , a, b
y = y 0 + ( y ¥ y 0 )( x d )/(b + x d ) ( x , y ) y 0 , y ¥ , b, d

a Concentration and instrument response variables shown in parentheses.

b Equivalent letters do not necessarily denote equivalent parameters.

The precision of the analytical method depends on the stability of the instrument response for a given quantity of analyte. In principle, a random dispersion of instrument signal (vertical direction) at a given true concentration transforms into dispersion on the measurement scale (horizontal direction), as is shown schematically ( Fig. 2.6 ). The detailed statistical aspects of calibration are complex, , but in the following sections, some approximate relations are outlined. If the calibration function is linear and the imprecision of the signal response is the same over the analytical measurement range, the analytical SD (SD A ) of the method tends to be constant over the analytical measurement range (see Fig. 2.6 ). If the imprecision increases proportionally to the signal response, the analytical SD of the method tends to increase proportionally to the concentration (x), which means that the relative imprecision (CV = SD/ x ) may be constant over the analytical measurement range if it is assumed that the intercept of the calibration line is zero.

FIGURE 2.6, Relation between concentration (x) and signal response (y) for a linear calibration function. The dispersion in signal response (σ y ) is projected onto the x -axis and is called assay imprecision [σ x (=σ A )].

With modern, automated clinical chemistry instruments, the relation between analyte concentration and signal can in some cases be very stable, and where this is the case, calibration is necessary relatively infrequently (e.g., at intervals of several months). Built-in process control mechanisms may help ensure that the relationship remains stable and may indicate when recalibration is necessary. In traditional chromatographic analysis (e.g., high-performance liquid chromatography [HPLC]), on the other hand, it is customary to calibrate each analytical series (run), which means that calibration is carried out daily.

Trueness and accuracy

Trueness of measurements is defined as closeness of agreement between the average value obtained from a large series of results of measurements and the true value.

The difference between the average value (strictly, the mathematical expectation) and the true value is the bias, which is expressed numerically and so is inversely related to the trueness. Trueness in itself is a qualitative term that can be expressed, for example, as low, medium, or high. From a theoretical point of view, the exact true value for a clinical sample is not available; instead, an “accepted reference value” is used, which is the “true” value that can be determined in practice. Trueness can be evaluated by comparison of measurements by the new test and by some preselected reference measurement procedure, both on the same sample or individuals.

The ISO has introduced the trueness expression as a replacement for the term accuracy, which now has gained a slightly different meaning. Accuracy is the closeness of agreement between the result of a measurement and a true concentration of the analyte. Accuracy thus is influenced by both bias and imprecision and in this way reflects the total error. Accuracy, which in itself is a qualitative term, is inversely related to the “uncertainty” of measurement, which can be quantified as described later ( Table 2.2 ).

TABLE 2.2
An Overview of Qualitative Terms and Quantitative Measures Related to Method Performance
Qualitative Concept Quantitative Measure
Trueness Bias
Closeness of agreement of mean value with “true value” A measure of the systematic error
Precision Imprecision (SD)
Repeatability (within run) A measure of the dispersion of random errors
Intermediate precision (long term)
Reproducibility (inter-laboratory)
Accuracy Error of measurement
Closeness of agreement of a single measurement with “true value” Comprises both random and systematic influences
SD, Standard deviation.

In relation to trueness, the concepts recovery, drift, and carryover may also be considered. Recovery is the fraction or percentage increase in concentration that is measured in relation to the amount added. Recovery experiments are typically carried out in the field of drug analysis. One may distinguish between extraction recovery, which often is interpreted as the fraction of compound that is carried through an extraction process, and the recovery measured by the entire analytical procedure, in which the addition of an internal standard compensates for losses in the extraction procedure. A recovery close to 100% is a prerequisite for a high degree of trueness, but it does not ensure unbiased results because possible nonspecificity against matrix components (e.g., an interfering substance) is not detected in a recovery experiment. Drift is caused by instrument or reagent instability over time, so that calibration becomes gradually biased. Assay carryover also must be close to zero to ensure unbiased results. Carryover can be assessed by placing a sample with a known, low value after a pathological sample with a high value, and an observed increase can be stated as a percentage of the high value. Drift or carryover or both may be conveniently estimated by multifactorial evaluation protocols (EPs). ,

Precision

Precision has been defined as the closeness of agreement between independent replicate measurements obtained under stipulated conditions. The degree of precision is usually expressed on the basis of statistical measures of imprecision, such as SD or CV (CV = SD/ x, where x is the measurement concentration), which is inversely related to precision. Imprecision of measurements is solely related to the random error of measurements and has no relation to the trueness of measurements.

Precision is specified as follows , :

  • Repeatability: closeness of agreement between results of successive measurements carried out under the same conditions (i.e., corresponding to within-run precision)

  • Reproducibility : closeness of agreement between results of measurements performed under changed conditions of measurements (e.g., time, operators, calibrators, reagent lots). Two specifications of reproducibility are often used: total or between-run precision in the laboratory, often termed intermediate precision, and interlaboratory precision (e.g., as observed in external quality assessment schemes [EQAS]) (see Table 2.2 ).

The total SD (σ T ) may be divided into within-run and between-run components using the principle of analysis of variance of components (variance is the squared SD):

σ 2 T = σ 2 Within-run + σ 2 Between-run

It is not always clear in clinical chemistry publications what is meant by “between-run” variation. Some authors use the term to refer to the total variation of an assay, but others apply the term between-run variance component as defined earlier. The distinction between these definitions is important but is not always explicitly stated.

In laboratory studies of analytical variation, estimates of imprecision are obtained. The more observations, the more certain are the estimates. It is important to have an adequate number so that that analytical variation is not underestimated. Commonly, the number 20 is given as a reasonable number of observations (e.g., suggested in the CLSI guideline for manufacturers). To verify method precision by users, it has been recommended to run internal QC samples for five consecutive days in five replicates. If too few replications are applied, it is likely that the analytical variation will be underestimated.

To estimate both the within-run imprecision and the total imprecision, a common approach is to measure duplicate control samples in a series of runs. Suppose, for example, that a control is measured in duplicate for 20 runs, in which case 20 observations are present with respect to both components. The dispersion of the means ( x m ) of the duplicates is given as follows:


σ x m 2 = σ Within-run 2 / 2 + σ Between-run 2

From the 20 sets of duplicates, we may derive the within-run SD using the following formula:


SD Within-run = [ d i 2 / ( 2 × 20 ) 0.5 ]

where d i refers to the difference between the i th set of duplicates. When SDs are estimated, the concept df is used. In a simple situation, the number of df equals N − 1. For N duplicates, the number of df is N (2 − 1) = N . Thus both variance components are derived in this way. The advantage of this approach is that the within-run estimate is based on several runs, so that an average estimate is obtained rather than only an estimate for one particular run if all 20 observations had been obtained in the same run. The described approach is a simple example of a variance component analysis. The principle can be extended to more components of variation. For example, in the CLSI EP05-A3 guideline, a procedure is outlined that is based on the assumption of two analytical runs per day, in which case within-run, between-run, and between-day components of variance are estimated by a nested component of variance analysis approach.

Nothing definitive can be stated about the selected number of 20. Generally, the estimate of the imprecision improves as more observations become available. Exact confidence limits for the SD can be derived from the χ 2 distribution. Estimates of the variance, SD 2 , are distributed according to the χ 2 distribution (tabulated in most statistics textbooks) as follows: ( N − 1) SD 2 2 ≈ χ 2 (N−1) , where ( N − 1) is the df. Then the two-sided 95% CI is derived from the following relation:


Pr [ χ 97.5 % ( N - 1 ) 2 < ( N - 1 ) SD 2 / σ 2 < χ 2.5 % ( N - 1 ) 2 ] = 0.95

which yields this 95% CI expression:


SD × [ ( N - 1 ) / χ 2.5 % ( N - 1 ) 2 ] 0.5 < σ < SD × [ ( N - 1 ) / χ 97.5 % ( N - 1 ) 2 ] 0.5

Example

Suppose we have estimated the imprecision as an SD of 5.0 on the basis of N = 20 observations. From a table of the χ 2 distribution, we obtain the following 2.5 and 97.5 percentiles:


χ 2.5 % ( 19 ) 2 = 32.9 and χ 97.5 % ( 19 ) 2 = 8.91

where 19 within the parentheses refers to the number of df. Substituting in the equation, we get

5.0 × (19/32.9) 0.5 < σ < 5.0 × (19/8.91) 0.5

or

3.8 < σ < 7.3

A graphical display of 95% CIs at various sample sizes is shown in Fig. 2.7 . For individual variance components, the relations are more complicated.

FIGURE 2.7, Relation between factors indicating the 95% confidence intervals (CIs) of standard deviations (SDs) and the sample size. The true SD is 1, and the solid line indicates the mean estimate, which is slightly downward biased at small sample sizes.

Precision profile

Precision often depends on the concentration of analyte being considered. A presentation of precision as a function of analyte concentration is the precision profile, which usually is plotted in terms of the SD or the CV as a function of analyte concentration ( Fig. 2.8 ) . Some typical examples may be considered. First, the SD may be constant (i.e., independent of the concentration), as it often is for analytes with a limited range of values (e.g., electrolytes). When the SD is constant, the CV varies inversely with the concentration (i.e., it is high in the lower part of the range and low in the high range). For analytes with extended ranges (e.g., hormones), the SD frequently increases as the analyte concentration increases. If a proportional relationship exists, the CV is constant. This may often apply approximately over a large part of the analytical measurement range. Actually, this relationship is anticipated for measurement error that arises because of imprecise volume dispensing. Often a more complex relationship exists. Not infrequently, the SD is relatively constant in the low range, so that the CV increases in the area approaching the lower limit of quantification (LLOQ). At intermediate concentrations, the CV may be relatively constant and perhaps may decline somewhat at increasing concentrations. A square root relationship can be used to model the relationship in some situations as an intermediate form of relation between the constant and the proportional case. The relationship between the SD and the concentration is of importance (1) when method specifications over the analytical measurement range are considered, (2) when limits of quantification are determined, and (3) in the context of selecting appropriate statistical methods for method comparison (e.g., whether a difference or a relative difference plot should be applied, whether a simple or a weighted regression analysis procedure should be used) (see the “Relative Distribution of Differences Plot” and “Regression Analysis” sections later).

FIGURE 2.8, Relations between analyte concentration and standard deviation (SD) /coefficient of variation (CV) . A, The SD is constant, so that the CV varies inversely with the analyte concentration. B, The CV is constant because of a proportional relationship between concentration and SD. C, A mixed situation with constant SD in the low range and a proportional relationship in the rest of the analytical measurement range.

Linearity

Linearity refers to the relationship between measured and expected values over the analytical measurement range. Linearity may be considered in relation to actual or relative analyte concentrations. In the latter case, a dilution series of a sample may be examined. This dilution series examines whether the measured concentration changes as expected according to the proportional relationship between samples introduced by the dilution factor. Dilution is usually carried out with an appropriate sample matrix (e.g., human serum [individual or pooled serum] or a verified sample diluent).

Evaluation of linearity may be conducted in various ways. A simple, but subjective, approach is to visually assess whether the relationship between measured and expected concentrations is linear. A more formal evaluation may be carried out on the basis of statistical tests. Various principles may be applied here. When repeated measurements are available at each concentration, the random variation between measurements and the variation around an estimated regression line may be evaluated statistically (by an F- test). This approach has been criticized because it relates only the magnitudes of random and systematic error without taking the absolute deviations from linearity into account. For example, if the random variation among measurements is large, a given deviation from linearity may not be declared statistically significant. On the other hand, if the random measurement variation is small, even a very small deviation from linearity that may be clinically unimportant is declared significant. When significant nonlinearity is found, it may be useful to explore nonlinear alternatives to the linear regression line (i.e., polynomials of higher degrees).

Another commonly applied approach for detecting nonlinearity is to assess the residuals of an estimated regression line and test whether positive and negative deviations are randomly distributed. This can be carried out by a runs test (see “Regression Analysis” section). An additional consideration for evaluating proportional concentration relationships is whether an estimated regression line passes through zero or not. The presence of linearity is a prerequisite for a high degree of trueness. A CLSI guideline suggests procedure(s) for assessment of linearity.

Analytical measurement range and limits of quantification

The analytical measurement range (measuring interval, reportable range) is the analyte concentration range over which measurements are within the declared tolerances for imprecision and bias of the method. Taking drug assays as an example, there exist (arbitrary) requirements of a CV% of less than 15% and a bias of less than 15%. The measurement range then extends from the lowest concentration (LLOQ) to the highest concentration (upper limit of quantification [ULOQ]) for which these performance specifications are fulfilled.

The LLOQ is medically important for many analytes. Thyroid-stimulating hormone (TSH) is a good example. As assay methods improved, lowering the LLOQ, low TSH results could be increasingly distinguished from the lower limit of the reference interval, making the test increasingly useful for the diagnosis of hyperthyroidism.

The LOD is another characteristic of an assay. The LOD may be defined as the lowest value that confidently exceeds the measurements of a blank sample. Thus the limit has been estimated on the basis of repeated measurements of a blank sample and has been reported as the mean plus 2 or 3 SDs of the blank measurements. In the interval from LOD up to LLOQ, one should report a result as “detected” but not provide a quantitative result. More complicated approaches for estimation of the LOD have been suggested.

Analytical sensitivity

The LLOQ of an assay should not be confused with analytical sensitivity. That is defined as ability of an analytical method to assess small differences in the concentration of analyte. The smaller the random variation of the instrument response and the steeper the slope of the calibration function at a given point, the better is the ability to distinguish small differences in analyte concentrations. In reality, analytical sensitivity depends on the precision of the method. The smallest difference that will be statistically significant equals 2 SD A at a 5% significance level. Historically, the meaning of the term analytical sensitivity has been the subject of much discussion.

Analytical specificity and interference

Analytical specificity is the ability of an assay procedure to determine the concentration of the target analyte without influence from potentially interfering substances or factors in the sample matrix (e.g., hyperlipemia, hemolysis, bilirubin, antibodies, other metabolic molecules, degradation products of the analyte, exogenous substances, anticoagulants). Interferences from hyperlipemia, hemolysis, and bilirubin are generally concentration dependent and can be quantified as a function of the concentration of the interfering compound. In the context of a drug assay, specificity in relation to drug metabolites is relevant, and in some cases, it is desirable to measure the parent drug, as well as metabolites. A detailed protocol for evaluation of interference has been published by the CLSI.

POINTS TO REMEMBER

  • Technical validation of analytical methods focuses on (1) calibration, (2) trueness and accuracy, (3) precision, (4) linearity, (5) LOD, (6) limit of quantification, (7) specificity, and (8) others.

  • The difference between the average measured value and the true value is the bias , which can be evaluated by comparison of measurements by the new test and by some preselected reference measurement procedure, both on the same sample or individuals.

  • The degree of precision is usually expressed on the basis of statistical measures of imprecision, such as SD or CV (CV = SD/ x, where x is the measurement concentration).

  • The measurement range extends from the lowest concentration (LLOQ) to the highest concentration (ULOQ) for which the analytical performance specifications are fulfilled (imprecision, bias).

  • Analytical specificity is the ability of an assay procedure to determine the concentration of the target analyte without influence from potentially interfering substances or factors in the sample matrix.

Qualitative methods

Qualitative methods, which currently are gaining increased use in the form of point-of-care testing (POCT), are designed to distinguish between results below and above a predefined cutoff value. Note that the cutoff point should not be confused with the detection limit. These tests are assessed primarily on the basis of their ability to correctly classify results in relation to the cutoff value.

Diagnostic accuracy measures

The probability of classifying a result as positive (exceeding the cutoff) when the true value indeed exceeds the cutoff is called sensitivity. The probability of classifying a result as negative (below the cutoff) when the true value indeed is below the cutoff is termed specificity. Determination of sensitivity and specificity is based on comparison of test results with a gold standard. The gold standard may be an independent test that measures the same analyte, but it may also be a clinical diagnosis determined by definitive clinical methods (e.g., radiographic testing, follow-up, outcomes analysis). Determination of these performance measures is covered later on in the diagnostic testing part. Sensitivity and specificity may be given as a fraction or as a percentage after multiplication by 100. SEs of estimates are derived as described for categorical variables. The performance of two qualitative tests applied in the same groups of nondiseased and diseased subjects can be compared using the McNemar’s test, which is based on a comparison of paired values of true and false-positive (FP) or false-negative (FN) results.

One approach for determining the recorded performance of a test in terms of sensitivity and specificity is to determine the true concentration of analyte using an independent reference method. The closer the concentration is to the cutoff point, the larger the error frequencies are expected to be. Actually, the cutoff point is defined in such a way that for samples having a true concentration exactly equal to the cutoff point, 50% of results will be positive, and 50% will be negative. Concentrations above and below the cutoff point at which repeated results are 95% positive or 95% negative, respectively, have been called the “95% interval” for the cutoff point for that method, which indicates a grey zone where the test does not provide reliable results ( Fig. 2.9 ). ,

FIGURE 2.9, Cumulative frequency distribution of positive results. The x- axis indicates concentrations standardized to zero at the cutoff point (50% positive results) with unit standard deviation.

Agreement between qualitative tests

As outlined previously, if the outcome of a qualitative test can be related to a true analyte concentration or a definitive clinical diagnosis, it is relatively straightforward to express the performance in terms of clinical specificity and sensitivity. In the absence of a definitive reference or “gold standard,” one should be cautious with regard to judgments on performance. In this situation, it is primarily agreement with another test that can be assessed. When replacement of an old or expensive routine assay with a new or less expensive assay is considered, it is of interest to know whether similar test results are likely to be obtained. If both assays are imperfect, however, it is not possible to judge which test has the better performance unless additional testing by a reference procedure is carried out.

In a comparison study, the same individuals are tested by both methods to prevent bias associated with selection of patients. Basically, the outcome of the comparison study should be presented in the form of a 2 × 2 table, from which various measures of agreement may be derived ( Table 2.3 ). An obvious measure of agreement is the overall fraction or percentage of subjects tested who have the same test result (i.e., both results negative or positive):

Overall percent agreement = (a + d)/(a + b + c + d)×100%

TABLE 2.3
2 × 2 Table for Assessing Agreement Between Two Qualitative Tests
TEST 1
+
Test 2 + a b
c d
Total a + c b + d

If agreement differs with respect to diseased and healthy individuals, the overall percent agreement measure becomes dependent on disease prevalence in the studied group of subjects. This is a common situation; accordingly, it may be desirable to separate this overall agreement measure into agreement concerning negative and positive results:

Percent agreement given test 1 positive: a/(a + c)
Percent agreement given test 1 negative: b/(b + d)

For example, if there is a close agreement with regard to positive results, overall agreement will be high when the fraction of diseased subjects is high; however, in a screening situation with very low disease prevalence, overall agreement will mainly depend on agreement with regard to negative results.

A problem with the simple agreement measures is that they do not take agreement by chance into account. Given independence, expected proportions observed in fields of the 2 × 2 table are obtained by multiplication of the fraction’s negative and positive results for each test. Concerning agreement, it is excess agreement beyond chance that is of interest. More sophisticated measures have been introduced to account for this aspect. The most well-known measure is kappa, which is defined generally as the ratio of observed excess agreement beyond chance to maximum possible excess agreement beyond chance. We have the following:

Kappa = ( I o I e )/(1− I e )

where I o is the observed index of agreement and I e is the expected agreement from chance. Given complete agreement, kappa equals +1. If observed agreement is greater than or equal to chance agreement, kappa is larger than or equal to zero. Observed agreement less than chance yields a negative kappa value.

Example

Table 2.4 shows a hypothetical example of observed numbers in a 2 × 2 table. The proportion of positive results for test 1 is 75/(75 + 60) = 0.555, and for test 2, it is 80/(80 + 55) = 0.593. Thus by chance, we expect the ++ pattern in 0.555 × 0.593 × 135 = 44.44 cases. Analogously, the —pattern is expected in (1 − 0.555) × (1 − 0.593) × 135 = 24.45 cases. The expected overall agreement percent by chance I e is (44.44 + 24.45)/135 = 0.51. The observed overall percent agreement is I o = (60 + 40)/135 = 0.74. Thus we have

Kappa = (0.74 − 0.51)/(1− 0.51) = 0.47

TABLE 2.4
2 × 2 Table With Example of Agreement of Data for Two Qualitative Tests
TEST 1 Total
+
Test 2 + 60 20 80
15 40 55
Total 75 60 135

Generally, kappa values greater than 0.75 are taken to indicate excellent agreement beyond chance, values from 0.40 to 0.75 are regarded as showing fair to good agreement beyond chance, and values below 0.40 indicate poor agreement beyond chance. An SE for the kappa estimate can be computed. Kappa is related to the intraclass correlation coefficient, which is a widely used measure of interrater reliability for quantitative measurements. The considered agreement measures, percent agreement, and kappa can also be applied to assess the reproducibility of a qualitative test when the test is applied twice in a given context.

Various methodological problems are encountered in studies on qualitative tests. An obvious mistake is to let the result of the test being evaluated contribute to the diagnostic classification of subjects being tested (circular argument). This is also termed incorporation bias . , Another problem is partial as opposed to complete verification. When a new test is compared with an existing, imperfect test, a partial verification is sometimes undertaken, in which only discrepant results are subjected to further testing by a perfect test procedure. On this basis, sensitivity and specificity are reported for the new test. This procedure (called discrepant resolution ) leads to biased estimates and should not be accepted. The problem is that for cases with agreement, both the existing (imperfect) test and the new test may be wrong. Thus only a measure of agreement should be reported, not specificity and sensitivity values. In the biostatistical literature, various procedures have been suggested to correct for bias caused by imperfect reference tests, but unrealistic assumptions concerning the independence of test results are usually put forward.

Assay comparison

Comparison of measurements by two assays is a frequent task in the laboratory. Preferably, parallel measurements of a set of patient samples should be undertaken. To prevent artificial matrix-induced differences, fresh patient samples are the optimal material. A nearly even distribution of values over the analytical measurement range is also preferable. In an ordinary laboratory, comparison of two routine assays is the most frequently occurring situation. Less commonly, comparison of a routine assay with a reference measurement procedure is undertaken. When two routine assays are compared, the focus is on observed differences. In this situation, it is not possible to establish that one set of measurements is the correct one and thereby know by how much measurements deviate from the presumed correct concentrations. Rather, the question is whether the new assay can replace the existing one without a systematic change in result values. To address this question, the dispersion of observed differences between paired measurements may be evaluated by these assays. To carry out a formal, objective analysis of the data, a statistical procedure with graphics display should be applied. Various approaches may be used: (1) a frequency plot or histogram of the distribution of differences (DoD) with measures of central tendency and dispersion (DoD plot), (2) a difference (bias) plot, which shows differences as a function of the average concentration of measurements (Bland-Altman plot), or (3) a regression analysis. In the following, a general error model is presented, and some typical measurement relationships are considered. Each of the statistical approaches mentioned is presented in detail along with a discussion of their advantages and disadvantages.

Basic error model

The occurrence of measurement errors is related to the performance characteristics of the assay. It is important to distinguish between pure, random measurement errors, which are present in all measurement procedures, and errors related to incorrect calibration and nonspecificity of the assay. Whereas a reference measurement procedure is associated only with pure, random error, a routine method, additionally, is likely to have some bias related to errors in calibration and limitations with regard to specificity. Whereas an erroneous calibration function gives rise to a systematic error, nonspecificity gives an error that typically varies from sample to sample. The error related to nonspecificity thus has a random character, but in contrast to the pure measurement error, it cannot be reduced by repeated measurements of a sample. Although errors related to nonspecificity for a group of samples look like random errors, for the individual sample, this type of error is a bias. Because this bias varies from sample to sample, it has been called a sample-related random bias. In the following section, the various error components are incorporated into a formal error model.

Measured value, target value, modified target value, and true value

Upon taking into account that an analytical method measures analyte concentrations with some random measurement error, one has to distinguish between the actual, measured value and the average result we would obtain if the given sample was measured an infinite number of times. If the assay is a reference assay without bias and nonspecificity, we have the following, simple relationship:

x i = X True i + ε i

where x i represents the measured value, X True i is the average value for an infinite number of measurements, and ε i is the deviation of the measured value from the average value. If we were to undertake repeated measurements, the average of ε i would be zero and the SD would equal the analytical SD (σ A ) of the reference measurement procedure. Pure, random, measurement error will usually be Gaussian distributed.

In the case of a routine assay, the relationship between the measured value for a sample and the true value becomes more complicated:

x i = X True i + Cal-Bias + Random-Bias i + ε i

The cal-bias term (calibration bias) is a systematic error related to the calibration of the method. This systematic error may be a constant for all measurements corresponding to an offset error, or it may be a function of the analyte concentration (e.g., corresponding to a slope deviation in the case of a linear calibration function). The random-bias i term is a bias that is specific for a given sample related to nonspecificity of the method. It may arise because of codetermination of substances that vary in concentration from sample to sample. For example, a chromogenic creatinine method codetermines some other components with creatinine in serum. Finally, we have the random measurement error term ε i .

If we performed an infinite number of measurements of a specific sample by the routine method, the random measurement error term ε i would be zero. The cal-bias and the random-bias i , however, would be unchanged. Thus the average value of an infinite number of measurements would equal the sum of the true value and these bias terms. This average value may be regarded as the target value ( X Target i ) of the given sample for the routine method. We have:

X Target i = X True i + Cal-Bias + Random-Bias i

As mentioned, the calibration bias represents a systematic error component in relation to the true values measured by a reference measurement procedure. In the context of regression analysis, this systematic error corresponds to the intercept and the slope deviation from unity when a routine method is compared with a reference measurement procedure (outlined in detail later). It is convenient to introduce a modified target value expression ( X Target i ) for the routine method to delineate this systematic calibration bias, so that:

X Target i = X True i + Cal-Bias

Thus for a set of samples measured by a routine method, the X Target i values are distributed around the respective X Target i values with an SD, which is called σ RB .

If the assay is a reference method without bias and nonspecificity, the target value and the modified target value equal the true value, that is,

X Target i = X Target i = X True i

The error model is outlined in Fig. 2.10 .

FIGURE 2.10, Outline of basic error model for measurements by a routine assay. A, The distribution of repeated measurements of the same sample, representing a normal distribution around the target value ( X Targeti ) (vertical line) of the sample with a dispersion corresponding to the analytical standard deviation, σ A . B, Schematic outline of the dispersion of target value deviations from the respective true values for a population of patient samples. A distribution of an arbitrary form is displayed. The standard deviation equals σ RB . The vertical line indicates the mean of the distribution. C, The distance from zero to the mean of the target value deviations from the true values represents the calibration bias (mean bias = cal-bias) of the assay.

Calibration bias and random bias

For an individual measurement, the total error is the deviation of x i from the true value, that is,

Total error of x i = Cal-Bias + Random-Bias i + ε i

Estimation of the bias terms requires parallel measurements between the method in question and a reference method as outlined in detail later. With regard to calibration bias, one should be aware of the possibility of lot-to-lot variation in analytical kit sets. The manufacturer should provide documentation on this lot-to-lot variation because often it is not possible for the individual laboratory to investigate a sufficient number of lots to assess this variation. Lot-to-lot variation shows up as a calibration bias that changes from lot to lot.

The previous exposition defines the total error in somewhat broader terms than is often seen. A traditional total error expression is :

Total error = Bias + 2 SD A

which often is interpreted as the calibration bias plus 2 SD A . If a one-sided statistical perspective is taken, the expression is modified to Bias + 1.65 SD A , indicating that 5% of results are located outside the limit. If a lower percentage is desired, the multiplication factor is increased accordingly, supposing a normal distribution. Interpreting the bias as identical with the calibration bias may lead to an underestimation of the total error.

Random bias related to sample-specific interferences may take several forms. It may be a regularly occurring additional random error component, perhaps of the same order of magnitude as the analytical error. In this context, it is natural to quantify the error in the form of an SD or CV. The most straightforward procedure is to carry out a method comparison study based on a set of patient samples in which one of the methods is a reference method, as outlined later. Krouwer formally quantified sample-related random interferences in a comparison experiment of two cholesterol methods and found that the CV of the sample-related random interference component exceeded the analytical CV. Another form of sample-related random interference is more rarely occurring gross errors, which typically are seen in the context of immunoassays and are related to unexpected antibody interactions. Such an error usually shows up as an outlier in method comparison studies. A well-known source is the occurrence of heterophilic antibodies. Outliers should not just be discarded from the data analysis procedure. Rather, outliers must be investigated to identify their cause, which may be an important limitation in using a given assay. Supplementary studies may help clarify such random sample-related interferences and may provide specifications for the assay that limit its application in certain contexts (e.g., with regard to samples from certain patient categories).

You're Reading a Preview

Become a Clinical Tree membership for Full access and enjoy Unlimited articles

Become membership

If you are a member. Log in here