INTRODUCTION

Global health measures are patient-reported evaluations that are used to assess a patient’s overall perspective on his or her health. They contrast with domain-specific patient-reported outcome measures, which target narrower health concepts, such as fatigue, anxiety, or mobility. Given their brevity and broad scope, global health measures are often utilized by health services researchers to assess a patient’s overall perspective on his or her health. Global health perceptions are predictive of future health care utilization and mortality.1,2 Inclusion of global health is particularly important in comparative effectiveness research (CER) studies of patients with multiple chronic conditions.3 Examples of these measures include the Patient Reported Outcomes Measurement Information System Global Health Scale (PROMIS Global Health)4 and the Veterans RAND 12-Item Health Survey (VR-12).510

Administering global health measures allows clinical investigators to measure health change beyond the direct target of the intervention. When treating back pain, for example, it may be useful to know whether global perceptions of mental health also improved and by how much (relative to physical health). In addition, because global health measures are typically brief, their routine administration allows clinicians to compare self-reported health across diverse patient groups.

A problem, arises, however, when we wish to compare patient scores on different global health instruments. Comparison is challenging because of differences in the question (or item) content, the response options, and scoring rules. We need a common metric to compare scores. Converting scores from each sample to percentile ranks or z-scores produces instruments with comparable units, but the resulting metric would be sensitive to the peculiarities of each sample, such as the size, restricted range, and the distribution of scores (e.g., skewed vs. normal).11 The instruments can also differ in terms of coverage with respect to the levels of health and dysfunction being assessed. For these reasons, it is desirable to conduct a “linking” study in which a large sample of participants responds to questions from both instruments. With such a design, it may be possible to align the scores on different measures to a common metric with greater accuracy.12

Recognizing this issue, the National Cancer Institute funded the PROsetta Stone® project (www.prosettastone.org) to align scores from many outcome instruments.13 We report here on the results conducted as part of this project to link the VR-12 scores to the PROMIS Global Health metric. PROMIS Global Health is a 10-item questionnaire that assesses self-reported overall health.4 , It was recently developed as part of the PROMIS initiative (www.nihpromis.org), which utilized state-of-the-science qualitative and quantitative methods.14,15 An example of a PROMIS Global Health item is, “In general, how would you rate your mental health, including your mood and your ability to think?”

The VR-12 is a short-form version of the Veterans RAND 36-Item Health Survey (VR-36), which was developed and modified from the RAND-36 v1.0 (MOS SF-36).58,10,16 The VR-12 is freely available from the principal developers and is included in an ongoing evaluation of the CMS Medicare Advantage program.17 The VR-12 is also included in the Ambulatory Care Survey of Healthcare Experiences of Patients (SHEP), sponsored by the Veterans Health Administration.10 The VR-12 and the longer VR-36 have been administered approximately 7 million times over the past 15 years, and are represented in the literature by over 150 published articles.9

We linked items from the VR-12 on the PROMIS Global Health scale. Once linked, cross-walk charts for these measures were created such that scores on one instrument were matched with corresponding scores on the other.

METHOD

Measures

PROMIS Global Health

The PROMIS Global Health instrument produces two scores—the Global Physical Health (GPH) and Global Mental Health (GMH)—which are each based on four items. The GPH scale comprises items on physical health, physical functioning, pain intensity, and fatigue. The GMH scale includes items on overall quality of life, mental health, satisfaction with social activities and relationships, and emotional problems. GPH and GMH items were calibrated using the graded response model (GRM).18 The GRM estimates parameters representing item location (level of health) and discrimination (ability to distinguish people at different levels of health). The scores were centered on the 2000 US Census with respect to age, sex, education, and race-ethnicity.4,19 The scores are on a T-score metric that has a mean of 50 and standard deviation of 10. Two PROMIS Global items (general health and social roles) are not used to score the GPH or GMH.

Veterans RAND 12-Item Health Survey (VR-12)

The VR-12 is a patient-reported instrument from which physical and mental health component summary scores (PCS and MCS) are derived. The VR-12 items assess physical functioning, role limitations due to physical or mental health problems, pain, energy, mental health, social functioning, and general health. The VR-12 modified some of the role limitation questions to improve the instrument’s performance and to make it easier to administer; there are formulas available to convert VR-12 scores to MOS SF scores.7,8 PCS and MCS scores are derived using an algorithm that is referenced to a metric centered at 50.0 using the 2000–2002 US Medical Expenditure Panel Survey population.10 The algorithm imputes missing responses using a modified regression estimate. The model is based on extraction of component scores and an orthogonal rotation in order to minimize the correlation between subscale scores. Responses to each item are used to create the PCS and MCS summary scores. The item weights account for differences in the strength of relationships between individual items and the PCS and MCS. As with PROMIS Global Health, VR-12 scores are standardized using a T-score metric with a mean of 50 and a standard deviation of 10.

Sample

Participants were recruited and data collected by an Internet survey company (www.op4g.com) that maintains a panel of respondents from the general population. To ensure adequate demographic representation, we imposed minimum enrollment requirements for age, gender, race, ethnicity, and education. Only those who indicated that they were 18 years of age or older were allowed to complete the survey. In addition to providing sociodemographic and clinical information and responding to questions on other health domains, participants completed the items of the PROMIS Global Health and the VR-12. Table 1 shows the demographic characteristics of the sample of 2025 participants who completed both measures: 49 % were male, and the mean age was 46 years (SD = 18, range 18 to 100). The sample mean score on the VR-12 was 45.2 (SD = 10.0) for the physical component and 46.6 (SD = 11.1) for the mental component. On PROMIS Global Health, the mean score was 48.3 (SD = 9.0) for physical health and 48.5 (SD = 10.0) for mental health. To check the robustness of the analyses, the sample was split into training and validation halves and analyzed separately, but the results were very similar to those of the entire sample, so only the entire sample results are presented.

Table 1 Demographic Characteristics of Participants Linking VR-12 to PROMIS Global Health (N = 2025)

Analyses

In the PROsetta Stone project, each pair of instruments was typically subjected to multiple linking methods.13,20 This approach includes methods based on IRT and one commonly used non-IRT method (equipercentile linking).21,22 IRT is a family of mathematical models that estimate properties of each item response relative to a single dimension that is measured by all the items.23,24 Previous studies in several health domains have demonstrated that results are highly congruent across methods.13,20,25 Here we report only the results of the fixed IRT calibration, consistent with other PROsetta reports. We fit the data to GRM,18 the standard IRT model for the calibration of PROMIS instruments.26 Details on methods are provided in Appendix A; we report on the accuracy of linking in Appendix C.

Linking to VR-12 Algorithmic and Sum Scores

Most available linking methods, including all IRT-based links, use individual item scores (or parameters) as the basis for the link. By contrast, equipercentile linking is based on scale scores.21,22 In the present context, there are both strengths and weaknesses to item score and scale score linking. The VR-12 item-weighted scale scores are amenable to equipercentile linking but not to item score-based methods. In addition, the VR-12 can also be linked using individual item scores. In order to use the cross-walk tables derived from such links, however, researchers would need access to item-level response data. We elected to conduct linking both on PCS/MCS summary scores and item-level response data. Fixed-parameter IRT linking was conducted using the GRM,18 and incorporated the established PROMIS calibrations.4

VR-12 MCS scores were linked to PROMIS GMH scores; VR-12 PCS were linked to PROMIS GPH scores. For each component (mental or physical), we provide two sets of score cross-walk tables. One cross-walk table associates VR-12 MCS and PCS algorithmic summary scores with PROMIS GMH and GPH T-scores; the other table associates the summed item scores for each VR-12 subscale with PROMIS GPH and GMH T-scores.

Linking Assumptions

There are a number of assumptions made prior to linking, although the single-group design obviates these to some extent. The first assumption is that the linked instruments are measuring essentially the same concept.12,27 We tested this by inspecting item content, calculating the item–total correlations and estimating the proportion of general factor variance of the combined set of items. A second linking assumption is that scores of the measures to be linked are highly correlated. We estimated correlations between PROMIS Global Health and VR-12 scores (algorithm- and sum-score based). In addition to linking assumptions, we tested the unidimensionality assumption of IRT using both confirmatory and exploratory factor analyses. Since IRT calibrations require that the combined item response data are essentially unidimensional, we conducted these analyses only on the combined items (e.g., all physical health items from PROMIS and VR-12). For details, see Appendices B and D.

Score Cross-walk Tables

While cross-walk tables are readily derived from equipercentile results, an extra step is necessary following IRT-based linking. We used the item parameter estimates derived from the fixed-parameter calibration to construct a cross-walk table by applying expected a posteriori (EAP) summed scoring.28,29 Cross-walk tables map simple raw summed scores from each legacy instrument to T-score values on the PROMIS Global Health metric.

RESULTS

Item Content Overlap

We found substantial overlap between the VR-12 and the PROMIS Global Health item content, but also some differences (see Table 2). Each mental health scale includes an item that targets social activities (an additional PROMIS question on social roles is not used to score any subscale). Both assess feelings or emotion: PROMIS has one question, while the VR-12 has two. While PROMIS GMH has two questions that reference “mental health” and “quality of life,” this wording is not included in the VR-12. Unlike PROMIS, the VR-12 MCS has two questions on how emotional problems interfere with personal productivity.

Table 2 Mental and Physical Health Items (Questions) that were Combined in Order to Create a Cross-Walk Table of Scores

Each physical health scale has a question on pain intensity. Both measures assess physical function: PROMIS has one question, while the VR-12 has two. The VR-12 has an item on “energy,” while PROMIS has an analogous “fatigue” question. Although both measures have a general health question, the VR-12 version contributes mostly to the PCS, while the PROMIS general health question is not used to score any subscale. Finally, unlike PROMIS, the VR-12 PCS also has two questions that ask about role limitations due to physical health.

Correlations and Item Statistics

Pearson correlations were generally below the ideal range for linking (r < 0.80).13,20,30 The correlations between the PROMIS Global Health scales and the VR-12 raw sum scores were higher than correlations between the PROMIS Global Health scales and the VR-12 algorithmic scores (0.80 vs. 0.69 for physical health and 0.69 vs. 0.63 for mental health). Factor analytic results were mixed, with some evidence of multidimensionality, and yet also the presence of a large general factor. Despite less than optimal conditions, we proceeded, because our single sample design allowed us to directly test the accuracy of the link by comparing linked scores to actual scores. For details, please see Appendix B (factor analysis) and Appendix E (Table 1, correlations).

Score Cross-walk Tables

To construct cross-walk tables for IRT-based links, we used the item parameter estimates derived from the fixed-parameter calibrations.28,29 Tables 3, 4, 5 and 6 show four cross-walk tables that associate VR-12 component scores with PROMIS Global Health scores. For both mental and physical health, we provide a summed VR-12 and an algorithmic VR-12 cross-walk.

Table 3 Sum-Based VR-12 Mental Health Component Scores Associated with PROMIS Global Mental Health T-Scores
Table 4 Algorithm-Based VR-12 Mental Health Component Scores Associated with PROMIS Global Mental Health T-Scores
Table 5 Sum Score-Based VR-12 Physical Health Component Scores Associated with PROMIS Global Physical Health T-Scores
Table 6 Algorithm-Based VR-12 Physical Health Component Scores Associated with PROMIS Global Physical Health T-Scores

Inspection of the algorithmic cross-walk tables shows that the population-based mean T-scores are very close for both components. A VR-12 MCS score of 50 is paired with a 50.3 on PROMIS GMH, while a VR-12 PCS score of 50 is paired with a score of 50.6 on PROMIS GPH. This result is expected and confirming, since both measures were centered on a US general population and both use the T-score metric. Given the different times when the metrics were set (approximately a decade apart), small differences were expected. The linking relationships are approximately linear until a T-score of about 55 (see Appendix E, Fig. 7), at which point the slope decreases. This may be due to the different methodologies used in instrument development, differences in item content, or lack of precision in measuring respondents at upper ranges.

DISCUSSION

This study provides tables that allow clinical researchers to convert VR-12 scores to PROMIS Global scores and vice versa. We did this by administering both measures to a large sample (N = 2025) and then co-calibrating the item responses. These links permit the conversion of the VR-12 regardless of whether the scoring is based on the algorithmic or the sum score methods. We favor the latter approach, as the sum score linking tables do not require complex computer algorithms or specialized software, and they are easily adapted. For users who have item-level data, we recommend summing VR-12 items (according to Table 2) and obtaining linked PROMIS T-scores from Tables 3 and 5. The correlations between the VR-12 sum scores and PROMIS Global Health scores were higher, and their deviations smaller, than those obtained after linking based on algorithm-derived component scores. This was particularly the case for the mental health scales.

There are several important clinical applications of this work. For example, our tables can be used to migrate performance improvement data to a single reporting metric. Clinical decision-makers at medical centers where the VR-12 is routinely administered may wish to replace it with the newer PROMIS Global Health instrument. Using our conversion tables, they can take advantage of historical patient data collected with the VR-12. Likewise, individual clinicians or departments may wish to integrate PROMIS Global Health into their practice, given its brevity and simplicity.4,31 Our results allow users to leverage the accumulated evidence in support of a “legacy” instrument (such as the VR-12) to provide benchmark data and reference points,10 though further validation of such points would be required.

Our results also facilitate CER. In CER, the accurate aggregation of results from multiple outcome studies is important. The individual studies that such meta-analyses comprise often use different instruments. Our results will improve the accuracy of such pooling, because individual results can be converted from the VR-12 to the PROMIS metric (and vice versa). Ultimately, this will allow clinical researchers to better understand the effects of particular treatments on global health.

While these results make linking possible and justifiable for group-level data, they are not as robust as results from similar work with questionnaires measuring single concepts like depression13 and fatigue.25 The raw score Pearson correlations (0.63 to 0.80) were lower than suggested thresholds for linking.13,30 This is likely due to the breadth of the concept of global health, and the observed differences in item content, wording, and format. In addition, the scoring of these questionnaires is more complex than the more typical approach based upon raw sum scoring. PROMIS Global Health is scored using IRT-based parameters as applied to each of the four items that contribute to the total score. Unlike the PROMIS Global Health scales, the VR-12 components are weighted according to a factor analytic technique that forces the components to be orthogonal (unrelated), which will lead to inconsistent results compared to simple sum scores.32 The impact of this difference in scoring between PROMIS Global Health and the VR-12 is evident in the generally weaker justification for linking scores based on the algorithmic approach.

There are some general limitations inherent in the linking process that should be noted. First, scores linked to the PROMIS metric based on the VR-12 may have more error than scores obtained directly from the PROMIS Global Health measure and vice versa (i.e., linking error is added to measurement error). Note, however, that the impact of this linking error can be reduced when users convert scores for a sample and compute its mean. For example, if one converts the VR-12 scores of 50 patients into PROMIS T-scores, the mean of those T-scores will be much more likely to reflect the “true” mean T-score relative to the conversion of 10 patient scores or a single one (see Appendix C.) Also, our linking tables should be used with the knowledge that concordance between the scores of any two instruments may be sensitive to population differences.12 This potential limitation, however, is mitigated by our relatively large sample size.

The linking tables provided in this article are an important tool for clinical researchers who rely on the new PROMIS Global Health measures but are interested in comparing findings of contemporary work with many studies previously conducted with the VR-12. Data collected previously using the VR-12 can also be translated for comparison with current PROMIS Global Health scores. Future linking will advance the field even further by bringing together other measures of global health into one common metric.