Article Text

Download PDFPDF

Appropriateness criteria: a useful tool for the cardiologist
  1. Paul G Shekelle
  1. Dr P G Shekelle, The RAND Corporation, 1776 Main Street, Santa Monica, CA 90407, USA; shekelle{at}rand.org

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

An increasing number of publications have described appropriateness criteria for the use of diagnostic and therapeutic procedures in cardiology. In this editorial I will discuss how appropriateness criteria are developed, and the evidence for their usefulness in the practice of cardiology.

The Appropriateness Method was developed as a pragmatic solution to the problem of assessing for which patients certain surgical and medical procedures are “appropriate”. In this context, “appropriate” means that the benefits sufficiently exceed the risks so that the procedure is worth doing. Twenty-five years ago geographical variations in the use of procedures were considerable. A hypothesis at that time was that any rate of use of a procedure higher than the lowest identified rate probably represented “inappropriate” overuse of the procedure. Investigators at RAND and UCLA set out to test this hypothesis. They hoped to compare the rate of appropriate and inappropriate use at high- and low-use geographical sites, assuming that “appropriate” use for a procedure could be determined from a review of published medical reports. Having been disabused of this notion after examining published reports on six procedures, the investigators were left with the problem of how to make such assessments.

Several fundamental concepts helped to shape their solution:

  • Medical publications alone are insufficient for judgments to be made about appropriateness for most potential indications for any procedure. Clinical judgment is needed to “fill in the gaps”.

  • All the clinical disciplines involved in the care of a certain condition have something to contribute to the determination of appropriateness. Clinical input should come from a multidisciplinary group.

  • Patients need to be described in sufficient clinical detail such that what is being rated is sufficiently homogeneous with respect to risks and benefits that the rating can be meaningfully applied to all people who meet the criteria.

  • The results should be comprehensive and applicable to most clinical situations for which a patient may seek, or be offered, the procedure. Thus, a very large number of clinical situations needs to be considered.

  • Resources needed to apply the method must be available.

DEVELOPMENT OF THE RAND–UCLA APPROPRIATENESS METHOD

With these concepts in mind, and incorporating elements of established group process methods, such as the Delphi and the Nominal Group Technique, the investigators developed the RAND–UCLA Appropriateness Method. Originally, a nine-member mixed panel of clinical specialists carried out the procedure, but currently panels comprise between six and 15 members. The panel members are provided with a “state-of-the-art literature review” describing what is known and not known about the risks and benefits of the procedure. On a nine-point scale (1 being lowest and 9 highest) they rate the “appropriateness” of performing the procedure for a comprehensive set, often many hundreds, of different specific clinical scenarios. “Appropriateness” is defined as meaning “the health benefits exceed the health risks by a sufficiently wide margin that the procedure is worth doing”. A rating of 5 means that the risks and benefits are about equal. Panel members are asked to use the clinical literature and their best clinical judgment to assess the appropriateness of performing the procedure by an average clinician. Additionally, panellists are given a set of specific definitions for any terms with potentially ambiguous definitions, such as “a positive exercise treadmill test”, or “having failed medical therapy”. This is done for two reasons—first, so that all the panellists use the same frame of reference, and second so that the results can be applied to real cases in a reproducible manner.

Panellists rate for appropriateness the various clinical scenarios using the nine-point scale and then attend a group meeting. At this meeting there is a discussion of ratings led by a moderator, concentrating on those clinical scenarios where there is a wide range of scores reflecting disagreement among the panellists. Each panelist has before them the first round ratings, so for each clinical scenario the panelist can see what the group ratings were and also an indication of what his or her own rating was. Panellists are given the opportunity to change their ratings after the discussion. Disagreement usually stems from one of three sources: (a) the definitions of clinical terms were not clear or not understood by all panel members; (b) there are new studies about which only some panel members are aware; (c) panellists’ interpretation of the studies or their own clinical experience, or both. The moderator seeks to resolve differences in the first two circumstances while not seeking to force agreement on the third.

The ratings that emerge from the group meeting are used for the analysis. The first determination is whether or not the panel “disagreed” about the rating for a specific scenario. Different definitions have been proposed and used for disagreement, but the most commonly used definition (for a nine-member panel) is three panellists rating a scenario in the lowest tertile of appropriateness (1-2-3) and three panellists rating the same scenario in the upper tertile (7-8-9). All scenarios for which there is disagreement are classified as of “uncertain” appropriateness. This is usually <5% of all scenarios. In the absence of disagreement, the median panel rating is used to make the classification. Median panel ratings in the lowest tertile are classified as “inappropriate”, those in the upper tertile are classified as “appropriate”, and those with a median rating of 4-5-6 (together with all for which there is “disagreement”) are classified as “uncertain”. An additional step is sometimes taken, using a third round of ratings, to further divide the “appropriate” scenarios into those which are “necessary”, meaning that it would be unacceptable not to offer the procedure, and “appropriate but not necessary”, meaning that it is acceptable to perform the procedure but not improper not to perform it. Analyses in research studies frequently assess only the extremes of these scales—those cases that are classified as “inappropriate” (for procedures that were delivered but should not have been), and those classified as “necessary” for which patients were not offered the procedure (for procedures that should have been delivered but were not).

CONCERNS ABOUT THE APPROPRIATENESS METHOD EXAMINED

Given the somewhat arbitrary nature of key elements of the method and the “globally implicit” nature of the decision-making process, it is no surprise that the Appropriateness Method has been viewed with considerable suspicion by many observers.13 These critiques have centred on the potential variability in the process due to the composition of the panel, the role of the moderator, the possibility of misclassification bias of individual scenarios, a lack of specificity about what outcomes are being considered for individual scenarios and worry that the ratings reflect nothing more than codifying existing clinical dogma. A substantial amount of methodological research has been done to try to assess these concerns, and much if it is relevant to the practice of cardiology.

Appropriateness criteria resemble diagnostic tests, in that they are used to classify patients into categories with different expected responses to treatment. Important attributes of diagnostic tests include reliability, sensitivity and specificity. Most of the concern about reliability concerns the influence on the outcome due to the selection of panel members. It is well established that the specialty of the panellists significantly influences the outcome, with panels comprising all the same specialty rating procedures as more appropriate than panels with a mix of specialties.4 But even with the composition fixed, there is concern about the influence of the particular panellists chosen. To investigate this, we conducted an experiment in which we randomised experts to serve on one of three different panels to determine appropriateness criteria for coronary revascularisation.5 Each panel had the same composition: three cardiac surgeons, three interventional cardiologists, one non-interventional cardiologist and two primary care practitioners. Each panel had a separate moderator, but otherwise they received the same publications and rated the same scenarios. A comparison of the results of the three panels with the actual care received by several thousand cases in New York state showed that the κ statistic (a measure of reliability that varies from 0 to 1.0) was between 0.4 and 0.6 for the determination of inappropriate overuse (of revascularisation), while κ was between 0.82 and 0.85 for the determination of inappropriate underuse (of revascularisation). κ Statistics between 0.4 and 0.6 are usually considered to represent “moderate” agreement, while κ statistics above 0.8 are considered to be “near-perfect” agreement.

For comparison, the κ statistic for a thallium scintigram treadmill has been estimated at 0.45 and 0.66,6 7 while the κ statistic for the interpretation of coronary angiography has been estimated at 0.53.8 So, while the use of appropriateness criteria to identify potentially inappropriate overuse of revascularisation is far from perfect, it is about as reliable as the diagnostic tests commonly used in cardiology, while the use of appropriateness criteria to identify potentially inappropriate underuse of revascularisation has a measure of reliability uncommon in medicine.

We later used these same data to estimate the sensitivity and specificity of the appropriateness criteria at identifying inappropriate underuse and overuse of care.9 The sensitivity of identifying inappropriate underuse was estimated as 94% with a specificity of 97%, while the sensitivity of identifying inappropriate overuse was estimated as 68% with a specificity of 99%. Again, for comparison purposes, the sensitivity and specificity of thallium scintigraphy has been reported as 84% and 87%, respectively.10

Probably, most cardiologists are interested in the predictive validity of appropriateness ratings—will patients fare better if they are treated according to the ratings than if they are not? Here, the studies are all observational. It is probably unethical to randomise patients to receive or not receive an interventional procedure when a panel of experts has concluded that for that particular clinical presentation the benefits exceed the risks to an extent suggesting that the procedure should be done. In other words, the criterion for “clinical equipoise” is absent in studies of appropriateness ratings, and data on predictive validity will need to come from observational studies.

Richard Kravitz and colleagues assessed the predictive validity of appropriateness ratings for identifying underuse of coronary revascularisation by assessing mortality and chest pain outcomes in 671 patients who had undergone coronary angiography and who were rated appropriate for revscularisation.11 One hundred and sixty-eight (25%) patients who did not receive revascularisation and who were treated medically had 16.7% mortality at 1 year compared with 9.7% mortality for patients who received coronary artery bypass grafting (CABG) (p = 0.04). No survival advantage was seen for patients who were rated necessary for percutaneous transluminal coronary angioplasty (PTCA) and who received PTCA (as compared with those treated medically).

In a similar study, Selby and colleagues assessed variation and outcomes in the use of coronary angiography in 1189 patients admitted with acute myocardial infarction to one of 16 Kaiser hospitals in northern California.12 Of these patients, 440 met the criteria for necessary coronary angiography, and among these, the actual receipt of angiography was associated with markedly lower death from heart disease or any heart disease event at 1–4 years of follow-up in comparison with patients not receiving necessary angiography, and after adjusting for a number of clinical variables (hazard ratio (HR) = 0.29 (95% CI 0.15 to 0.55) and HR = 0.36 (95% CI 0.26 to 0.49), respectively).

Hemingway and colleagues performed a study where they (a) developed their own appropriateness criteria for the use of coronary revascularisation; (b) used the criteria to judge the appropriateness of care for 2552 patients presenting to London hospitals and undergoing coronary angiography; (c) prospectively followed up these patients for a median of 30 months, and assessed the relationship between patient outcomes and treatment received.13 Their key findings were that 34% and 26% of patients with clinical presentations judged appropriate for percutaneous coronary revascularisation or CABG were instead treated medically, with adverse consequences for symptoms and, in the case of CABG, for coronary outcomes as well. Indeed, a particularly notable feature of this study was the strong, graded relationship seen between appropriateness score and improvement in outcome for patients judged appropriate for CABG who were treated with CABG instead of medical treatment (fig 1).

Figure 1

Reproduced with permission of the copyright holder from Hemingway et al.13 Copyright © 2001 Massachusetts Medical Society. All rights reserved.

Recently, Hemingway and colleagues performed a similar study assessing the appropriateness of coronary angiography. For this study, two panels independently created appropriateness criteria for coronary angiography. Agreement between panels was found to have a κ of 0.58, similar to that reported earlier for agreement on criteria for the overuse of coronary revascularisation. The ratings were used to judge the care of 9356 patients with recent onset of chest pain in whom stable angina was suspected. Underuse of angiography was common, being 57% and 71% depending on which panel’s ratings were used. Again, as seen in all prior studies, not receiving the procedure when it was judged appropriate was associated with worse outcomes, in this case a hazard ratio of about 2.6 for the composite outcome of death from coronary heart disease, admissions to hospital owing to acute myocardial infarction or unstable angina.14

Drawing inferences about causality from observational studies can be risky, but I interpret the size, consistency and dose–response nature of the data to mean that it is more likely that appropriateness criteria are identifying candidates for coronary angiography and revascularisation who will benefit from the procedure than the alternate hypothesis that appropriateness criteria have no discriminative ability and the observed effects have all been due to unmeasured confounders.

THE FUTURE FOR APPROPRIATENESS CRITERIA

Can the methods for appropriateness criteria be improved? Certainly. Every diagnostic test can be improved. Better specification of outcome, more use of the rich observational cardiology datasets to augment our understanding of how clinical presentation and treatment choice influence outcomes, working out how to incorporate costs, and use of information technology to help keep appropriateness criteria current are four ways I have advocated for improving the method.

Are existing criteria ready for use today? I believe the data presented above indicate that a wider use of existing appropriateness criteria would lead to better outcomes. How should existing criteria be used? I believe the correct use is as an additional piece of information that the cardiologist should consider when evaluating a patient and formulating a treatment plan. Will there be situations where the appropriateness criteria indicate a treatment plan the cardiologist thinks is inappropriate? Of course, just as there are cases where patients with two-vessel coronary disease might be inappropriate for revascularisation and patients with a raised B-type natriuretic peptide are inappropriate for treatment with pharmacotherapy for heart failure. No diagnostic test can replace the reasoned thought of an experienced clinician. There are numerous reasons why a clinician may recommend a different course of treatment—examples include comorbidities, patient preference, or additional clinical details of a particular patient that are not captured in the appropriateness criteria. However, I do believe that the appropriateness criteria category for a particular patient probably represents the “default” treatment option, and that should the cardiologist decide on a different course it is incumbent on the doctor to be explicit about why this choice is made—and to document it.

REFERENCES

Footnotes

  • Competing interests: The author has conducted research on the Appropriateness Method and continues to use it in his own research, but has no financial interest in the method.