Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Background
Psychological assessment is a professional consultation service aimed at providing clinicians with a more complex and empirically-based picture of their patients across a number of relevant clinical domains (e.g., intellectual and neurocognitive functioning, psychopathology and clinical diagnosis, personality style).
The evolution of psychological assessment over the last century has resulted in improved reliability and clinical validity, which in turn has enhanced patient care and clinical outcomes.
History
Psychometrics is a broad field of study devoted to the development and application of new instruments designed to measure different psychological constructs.
Historically, there have been multiple approaches to psychological assessment, using both rational and empirically-driven models for the development of psychological instruments.
Psychological assessment is a complex process that incorporates a multi-method approach to data collection and integration, as a means of maximizing clinical utility.
Clinical and Research Challenges
Psychological instruments must first undergo rigorous evaluations of reliability and validity prior to implementation, which should also be replicated across settings and populations.
Practical Pointers
One of the benefits of psychological assessment is its ability to quantify and clarify confusing psychiatric presentations.
It is increasingly common for patients to receive feedback about the test findings directly from the evaluating psychologist.
Psychological assessment with patient feedback has been demonstrated to facilitate the treatment process.
Psychological assessment is a consultation service that has great potential to enhance and improve clinicians' understanding of their patients and facilitate the treatment process. In spite of this, psychological assessment consultations are underutilized in the current mental health care environment. This is unfortunate given strong evidence that psychological testing generally produces reliability and validity coefficients similar to many routine diagnostic medical tests. This chapter will provide a detailed review of what a psychological assessment is comprised by and discuss the potential benefits of an assessment consultation. This will be accomplished by reviewing the methods used to construct valid psychological instruments, the major categories of psychological tests (including detailed examples of each category), and the application and utility of these instruments in clinical assessment. Issues relating to the ordering of psychological testing and the integration of information from an assessment report into the treatment process will also be presented.
Psychometrics is a broad field of study devoted to the development and application of new instruments designed to measure different psychological constructs (e.g., depression, impulsivity, personality style). Traditionally, there have been three general test development strategies employed to guide test construction: rational, empirical, and construct validation methods.
Rational test construction relies on a theory of personality or psychopathology (e.g., the cognitive theory of depression) to guide the construction of a psychological test. The process of item and scale development is conducted in a fashion to operationalize the important features of a theory. The Millon Clinical Multiaxial Inventory (MCMI) is an example of a test that was originally developed using primarily a rational test construction process.
Empirically-guided test construction, in contrast, begins with a large number of items (called an item pool) and then employs various statistical methods to determine which items differentiate known clinical groups of subjects (a process termed empirical keying ). The items that successfully distinguish one group from another are organized to form a scale without regard to their thematic content or “face validity.” The Minnesota Multiphasic Personality Inventory (MMPI) is an example of test developed using this method.
The construct validation method combines aspects of both the rational and the empirical test construction methodologies. Within this framework, a large pool of items is written to reflect a theoretical construct (e.g., impulsivity); then these items are empirically evaluated to determine whether they actually differentiate subjects who are expected to differ on the construct (impulsive vs. non-impulsive subjects). Items that successfully differentiate known clinical groups and that meet other psychometric criteria (i.e., they have adequate internal consistency) are retained for the scale. In addition, if theoretically-important items do not differentiate between known groups, this finding may lead to a revision in the theory. The construct validation methodology is considered the most sophisticated strategy for test development. The Personality Assessment Inventory (PAI) is an example of a test developed with a construct validation approach.
To be meaningfully employed in research and/or clinical contexts, psychological tests must meet the minimum psychometric standards for reliability and validity. Reliability represents the repeatability, stability, or consistency of a subject's test score, and it is usually represented as some form of a correlation coefficient (ranging from 0 to 1.0). Research instruments can have reliability scores as low as .70, whereas clinical instruments should have reliability scores in the high .80s to low .90s. This is because research instruments are interpreted aggregately as group measures, whereas clinical instruments are interpreted for a single individual and thus require a higher level of precision. A number of reliability statistics are available for evaluating a test: internal consistency (the degree to which the items in a test perform in the same manner), test-retest reliability (the consistency of a test score over time, which typically ranges from a few days to a year), and inter-rater reliability (as seen on observer-judged rating scales). The kappa statistic is considered the best estimate of inter-rater reliability, because it reflects the degree of agreement between raters after accounting for chance scoring. Factors that affect reliability (the amount of error present in a test score) can be introduced by variability in the subject (subject changes over time), in the examiner (rater error, rater bias), or in the test itself (such as when given with different instructions).
Validity is a more difficult concept to understand and to demonstrate than is reliability. The validity of a test reflects the degree to which the test actually measures the construct it was designed to measure (also known as construct validity). This is often demonstrated by comparing the test in question with an already established measure (or measures). As with reliability, validity measures are usually represented as correlation coefficients (ranging from 0 to 1.0). Validity coefficients are typically squared (reported as R 2 ) to reflect the amount of variance shared between two or more scales. Multiple types of data are needed before a test can be considered valid. Content validity assesses the degree to which an instrument covers the full range of the target construct (e.g., a test of depression that does not include items covering disruptions in sleep and appetite would have limited content validity). Predictive validity refers to how effective a test is in predicting future occurrences of the construct, while concurrent validity shows how well it correlates with other existing measures of the same construct. Convergent validity and divergent validity refer to the ability of scales with different methods (interview vs. self-report) to measure the same construct (convergent validity), while also having low or negative correlations with scales that measure unrelated traits (divergent validity). Taken together, the convergent and divergent correlations indicate the specificity with which the scale measures the intended construct. It is important to realize that despite the amount of affirmative data for a given test, psychological tests are not themselves considered valid. Rather, it is the scores from tests that are valid under specific situations for making specific decisions.
There are a myriad of techniques available to facilitate making a psychiatric diagnosis and informing treatment planning, but they do not necessarily qualify as psychological tests. A psychological test is defined as a measurement tool that is made up of a series of standard stimuli (i.e., questions or visual stimuli), which are administered in a standardized manner. Responses to the stimuli are then recorded and scored according to a standardized methodology (ensuring that a given response is always scored the same way) and the patient's test results are interpreted against a representative normative sample.
Alfred Binet (1857–1911) is credited with developing the first true measure of intelligence. Binet and Theodore Simon were commissioned by the French School Board to develop a test to identify students who might benefit from special education programs. Binet's 1905 and 1908 scales form the basis of our current intelligence tests. In fact, it was the development of Binet's 1905 test that marked the beginning of modern psychological testing. His approach was practical and effective, as he developed a group of tests with sufficient breadth and depth to separate underachieving children with normal intellectual ability from those who were underachieving because of lower intellectual ability. In addition to mathematic and reading tasks, Binet also tapped into other areas (such as object identification, judgment, and social knowledge). About a decade later at Stanford University, Lewis Terman translated Binet's test into English, added additional items, and made some scoring revisions. Terman's test is still in use today and is called the Stanford-Binet Intelligence Scales.
David Wechsler, to help assess recruits in World War I, combined what essentially were the Stanford-Binet verbal tasks with his own tests to form the Wechsler-Bellevue test (1939). Unlike the Stanford-Binet test, The Wechsler-Bellevue test produced a full-scale intelligence quotient (IQ) score, as well as measures of verbal and non-verbal intellectual abilities, respectively. The use of three scores for describing IQ became popular with clinicians, and the Wechsler scales were widely adopted. To this day, the Wechsler scales continue to be the dominant measure of intellectual capacity used in the United States.
Intelligence is a hard construct to define. Wechsler wrote that “intelligence, as a hypothetical construct, is the aggregate or global capacity of the individual to act purposefully, to think rationally, and to deal effectively with the environment.” This definition helps clarify what the modern IQ tests try to measure (i.e., adaptive functioning) and why intelligence or IQ tests can be important aids in clinical assessment and treatment planning. If an IQ score reflects aspects of effective functioning, then IQ tests measure aspects of adaptive capacity. The Wechsler series of instruments for assessing intellectual functioning cover the majority of the human age range, and begin with the Wechsler Preschool and Primary Scale of Intelligence (ages 4–6 years), followed by the Wechsler Intelligence Scale for Children-IV (6–16 years), and the Wechsler Adult Intelligence Scale-IV (16–90 years). More recently, the Wechsler Abbreviated Scale of Intelligence-II (WASI-II; Wechsler, 2011) was also developed to provide briefer (yet reliable) measures of overall intelligence, in addition to verbal and non-verbal abilities, respectively.
Over time, the Wechsler series has evolved from providing three over-arching measures of intellectual functioning (the Full Scale IQ, Verbal IQ, and Performance IQ) into a more nuanced model of cognitive functioning (Verbal Comprehension, Perceptual Reasoning, Working Memory, and Processing Speed). As with their predecessors, the current Wechsler scales are scored to have a mean of 100 and a standard deviation (SD) of 15 in the general population, which then allows for a patient to be compared against a normative standard. Additionally, this approach to scoring also allows for clinicians to note any meaningful discrepancies between verbal and non-verbal functioning, where in many cases a difference of about 15 points (or one standard deviation) can be considered both statistically significant and clinically meaningful. Table 7-1 presents an overview of IQ categories.
Full-Scale IQ Score | Intelligence (IQ) Categories | Percentile in Normal Distribution |
---|---|---|
≥130 | Very superior | 2.2 |
120–129 | Superior | 6.7 |
110–119 | High average | 16.1 |
90–109 | Average | 50.0 |
80–89 | Low average | 16.1 |
70–79 | Borderline | 6.7 |
≤69 | Extremely low | 2.2 |
IQ scores do not represent a patient's innate, unchangeable intelligence. Rather, it is most accurate to view IQ scores as representing a patient's ordinal position, or percentile ranking, on the test relative to the normative sample at any given time. In other words, a score at the 50th percentile is higher than 50% of the individuals in a patient's age bracket. Clinically, IQ scores can be thought of as representing the patient's current level of adaptive function. Furthermore, because IQ scores contain some degree of measurement and scoring error, they should be reported with confidence intervals indicating the range of scores in which the subject's true IQ is likely to fall.
The Wechsler IQ tests are composed of 10 to 15 subtests designed to measure more discrete domains of cognitive functioning including: Verbal Comprehension (VCI: Similarities, Vocabulary, Information, Comprehension); Perceptual Reasoning (PRI: Block Design, Matrix Reasoning, Visual Puzzles, Figure Weights, Picture Completion); Working Memory (WMI: Digit Span, Arithmetic, Letter-Number Sequencing); and Processing Speed (PSI: Symbol Search, Coding, and Cancellation). Subtests are scored to have a mean of 10 and standard deviation of 3, which again allows for different interpretations to be made about someone's level of functioning based on score variability. It is also important to note here that all Wechsler scores are adjusted for age.
One of the initial strategies for interpreting a patient's WAIS performance is to review the consistency of scores. For example, an IQ of 105 falls within the average range and by itself would not raise any “red flags.” However, this score may occur in situations where all composite index scores fall in the average range (reflecting minimal variability), or in cases where verbal scores are quite high and non-verbal scores are quite low (indicating a higher variability in functioning). The clinical implications in these two scenarios are quite different and would lead to very different interpretations, and thus an examination of discrepancies is essential when interpreting the profile. However, the existence of a discrepancy does not always necessitate an abnormality. In fact, the occurrence of small to medium discrepancies is not uncommon even in the general population. Typically, discrepancies of between 12 and 15 points are needed before they can be considered significant, and they should be noted in the report.
In sum, although all measures of intelligence are highly intercorrelated, intelligence is best thought of as a multifaceted phenomenon. In keeping with Binet's original intent, IQ tests should be used to assess individual strengths and weaknesses relative to a normative sample. Too often, mental health professionals become overly focused on the Full-Scale IQ score and fall into the proverbial problem of missing the trees for the forest. To counter this error, knowledge of the subtests and indexes of the WAIS-IV is essential to understanding the complexity of an IQ score.
Modern objective personality assessment (more appropriately called self-reports) has its roots in World War I when the armed forces turned to psychology to help assess and classify new recruits. Robert Woodworth was asked to develop a self-report test to help assess the emotional stability of new recruits in the Army. Unfortunately, his test, called the Personal Data Sheet, was completed later than anticipated and it had little direct impact on the war effort. However, the methodology used by Woodworth would later influence the development of the most commonly used personality instrument, the MMPI.
Hathaway and McKinley (1943) published the original version of the MMPI at the University of Minnesota. (Although the original version of the MMPI was produced in 1943, the official MMPI manual was not published until 1967. ) The purpose of the test was to be able to differentiate psychiatric patients from normal individuals, as well as to accurately place patients in the proper diagnostic group. A large item pool was generated, and hundreds of psychiatric patients were interviewed and asked to give their endorsement on each of the items. The same was done with a large sample of people who were not receiving psychiatric treatment. The results of this project showed that while the item pool did exceptionally well in differentiating normals from clinical groups, differentiating one psychiatric group from another was more difficult. A major confounding factor was that patients with different conditions tended to endorse the same items; this led to scales with a high degree of item overlap (i.e., items appeared on more than one scale). This method of test development, known as empirical keying (described earlier), was innovative for its time because most personality tests preceding it were based solely on items that test developers theorized would measure the construct in question (rational test development). The second innovation introduced with the MMPI was the development of validity scales that were intended to identify the response style of test takers. In response to criticisms that some items contained outdated language and that the original normative group was considered a “sample of convenience,” the MMPI was revised in 1989. The MMPI-2 is the result of this revision process, and it is the version of the test currently used today.
Become a Clinical Tree membership for Full access and enjoy Unlimited articles
If you are a member. Log in here