Abstract
Background Depression is more likely in patients with chronic physical illness, and is associated with increased rates of disability and mortality. Effective treatment of depression may reduce morbidity and mortality. The use of two stem questions for case finding in diabetes and coronary heart disease is advocated in the Quality and Outcomes Framework, and has become normalised into primary care.
Aim To define the most effective tool for use in consultations to detect depression in people with chronic physical illness.
Method The following data sources were searched: CENTRAL, CINAHL, Embase, HMIC, MEDLINE, PsycINFO, Web of Knowledge, from inception to July 2009. Three authors selected studies that examined identification tools and used an interview-based ICD (International Classification of Diseases) or DSM (Diagnostic and statistical Manual of Mental Disorders) diagnosis of depression as reference standard. At least two authors independently extracted study characteristics and outcome data and assessed methodological quality.
Results A total of 113 studies met the eligibility criteria, providing data on 20 826 participants. It was found that two stem questions, PHQ-9 (Patient Health Questionnaire), the Zung, and GHQ-28 (General Health Questionnaire) were the optimal measures for case identification, but no method was sufficiently accurate to recommend as a definitive case-finding tool. Limitations were the moderate-to-high heterogeneity for most scales and the facts that few studies used ICD diagnoses as the reference standard, and that a variety of methods were used to determine DSM diagnoses.
Conclusion Assessing both validity and ease of use, the two stem questions are the preferred method. However, clinicians should not rely on the two-questions approach alone, but should be confident to engage in a more detailed clinical assessment of patients who score positively.
INTRODUCTION
Depression is one of the leading causes of disability and disease burden.1 It is associated with the most years lost to disability of all diseases worldwide. Identifying depression in patients with chronic physical health problems is important for several reasons. First, a number of studies suggest depression is approximately two to three times as prevalent in such populations, including patients with cancer,2 chronic heart disease,3,4 and chronic obstructive pulmonary disease (COPD).5 Secondly, there appears to be greater disease burden, in terms of healthcare use and functional disability, in people with comorbid depression compared with those with physical health problems alone.6,7 Thirdly, mortality is greater in several medical conditions when depression is present — heart disease,8 COPD,9 stroke,10 cancer11 — and in medically ill older adults.12 Furthermore, morbidity and mortality may diminish with effective treatment of depression.13,14
There is convincing evidence that many cases of depression go unrecognised in the general population and in primary care.15–17 Reasons for under-recognition include a low rate of mood problems as the presenting complaint, infrequent specific enquiry from clinicians, and uncertainty about diagnostic criteria.18,19 Identifying depression in people with chronic physical health problems may be even more complex, and primary care physicians may be less likely to diagnose depression in this population.20,21 Reasons for difficulties in raising the issue of depression in consultations are complex.22 In addition, depressed individuals presenting with somatic complaints are less likely to be detected.23–26
Improving case identification for depression has received much attention. For example, the US Preventive Services Task Force recommended screening for depression for all people in primary care (whether they had a physical illness or not), along with the necessary treatment resources for those subsequently identified.27 In the UK, through the Quality and Outcomes Framework (QOF), GPs are incentivised to ask the case-identification questions of people with diabetes and coronary heart disease.28 This approach is also advocated in the National Institute for Health and Clinical Excellence (NICE) guidelines.29 However, there is much debate in the literature concerning the effectiveness of screening and case identification.30 Gilbody and colleagues have shown untargeted screening was not effective in improving the recognition of depression in primary care and general hospital settings.30 There is also much debate concerning the terminology used in the field. The present study proposes to separate overall accuracy (case identification) into more clinically understandable rule-in and rule-out performance. Rule-in accuracy (positive predictive value) is the ability to correctly identify those with the disorder with minimal false positives, whereas rule-out accuracy (negative predictive value) is the ability to correctly identify those without the disorder with minimal false negatives (missed cases). In order to differentiate from untargeted screening approaches, which appear to be ineffective, this data synthesis will focus on case identification in a population at higher risk of depression (that is, people with chronic physical health problems). This is vital before further case finding is advocated by the QOF for patients with other physical problems.
How this fits in
There is strong evidence that the prevalence of depression is raised among patients with long-term conditions and that this comorbidity is associated with adverse outcomes. Inadequate and inaccurate identification of depression has been documented in both primary care and general medical settings. This meta-analysis provides evidence that several brief and feasible depression case-finding approaches can be used as a first assessment for patients with chronic physical health problems, and that two stem questions referring to core depression features appear the most efficient initial approach.
There are a large number of scales used both in clinical practice and in research studies, few of which have been originally developed for the physically ill. In addition, there are no existing definitive meta-analyses across a comprehensive range of measures. Therefore, a diagnostic accuracy meta-analysis was conducted to assess the sensitivity and specificity of the most widely used case-identification instruments in people who are physically ill.
METHOD
Data sources and searches
The full review protocol can be found in the guideline on depression in people with chronic physical health problems, which was commissioned by NICE.31 Briefly, a search for studies assessing the validity of case-identification instruments was made using seven electronic bibliographic databases (CENTRAL, CINAHL, Embase, HMIC, MEDLINE, PsycINFO, Web of Knowledge). Each database was searched from inception to October 2009. Additional papers were found by searching the references of retrieved articles, tables of contents of relevant journals, previous systematic reviews and meta-analyses of case identification for depression, written requests to experts, and suggestions made by the members of the Guideline Development Group (comprising clinicians, academics, and service users with expertise in depression and chronic physical health problems).
Study selection
The study included validation studies of mood questionnaires agreed by the authors (see Appendix 1 for further details). The reference standard was diagnoses according to the Diagnostic and Statistical Manual of Mental Disorders (DSM) of the American Psychiatric Association (for example DSM-IV)32 or International Classification of Diseases (ICD) (for example ICD-10)33 of the World Health Organization criteria. Studies that did not clearly state the comparator to be DSM or ICD diagnosis of depression, or that did not provide sufficient data to be extracted in the meta-analysis were excluded.
Data extraction and quality assessment
All published studies that met the eligibility criteria were assessed for methodological quality using the Scottish Intercollegiate Guidelines Network (SIGN) checklist for diagnostic studies.29 Data were extracted independently by at least two authors, and 2×2 tables were constructed, from which the primary outcomes were calculated: that is sensitivity, specificity, and likelihood ratios.
To maximise the available data, the most consistently reported and recommended cut-off points were extracted for each of the scales. There are limitations to this approach, as noted by Furukawa and colleagues,34 ;who found that the optimal cut-offs for the General Health Questionnaire (GHQ)-12 and GHQ-28 differed according to the prevalence of depression, and it is likely there are similar problems for most other scales. However, a Bayesian approach makes allowance for variations according to prevalence (see below), therefore seeking to take into account this potential limitation.
Data synthesis and analysis
A bivariate diagnostic accuracy meta-analysis was conducted using Stata (version 10) with the metandi35 commands, to obtain pooled estimates of sensitivity, specificity, and likelihood ratios. This method was originally developed as a mixed effects regression model for meta-analysis of trials, and modified more recently for studies of diagnostic accuracy.36,37 Between-study heterogeneity was assessed using the I2 statistic.38 In addition, publication bias was assessed by visual inspection of funnel plots, and formal use of Egger's test.39
A Bayesian curve analysis was also undertaken; this plots post-test conditional probabilities from all possible pre-test probabilities (prevalence). The area under the Bayesian curve (AUC) for positive results can be used as a statistical comparison of rule-in success and 1 — AUC for negatives results used as an indicator of rule-out success. An area of more than 0.75 can be interpreted as ‘satisfactory’ and more than 0.80 interpreted as ‘good’. If a test achieved more than 0.90 in a rule-in capacity, this was considered sufficient for a recommendation that this tool could be used on its own for case finding.
Additional meta-regression analyses were planned to assess differences in diagnostic accuracy for disease groups. Such analyses were conducted on a scale when there were a minimum of four studies for at least two disease groups.
RESULTS
A total of 113 studies on 20 826 participants met the eligibility criteria of the review (see Figure 1 for full details on study flow information). These studies were both on populations specifically targeted for a chronic physical health problem (such as cancer, heart disease, and stroke), and in general medical settings where all were physically ill and a substantial proportion had a chronic physical health problem. In total, 83 studies specifically targeted people with chronic physical health problems in any setting (Appendix 2). The mean prevalence of depression was 0.25 (95% confidence interval [CI] = 0.05 to 0.61). A further 30 studies were on people in general medical settings, with a mean prevalence of depression of 0.24 (95% CI = 0.04 to 0.52).
Figure 1 Study flow diagram.
Studies recruiting for chronic physical health problem
Sensitivity and specificity
Table 1 provides an evidence summary for the various scales on people recruited for specific chronic physical health problems. There was moderate to high sensitivity for most scales. The tools with the highest sensitivity were the two stem questions (0.98; 95% CI = 0.85 to 0.99), followed by the GHQ-28, Patient Health Questionnaire (PHQ)-9, Beck Depression Inventory (BDI), and BDI non-somatic (Table 1). Sensitivity was lowest for the one-item measure.
Table 1 Evidence summary of scales in studies recruiting for chronic physical illness
The Zung Self Rating Depression Scale had the highest specificity 0.92 (95% CI = 0.68 to 0.98). This was followed by the two stem questions, the Hamilton Depression Rating Scale (HDRS), PHQ-9 and the Centre for Epidemiologic Studies Depression Scale (CES-D); all had high specificity. The lowest specificity was found for the one-item measure and the GHQ-12.
Rule-in (positive predictive value) and rule-out accuracy (negative predictive value)
Using Bayesian plots of conditional probabilities to examine rule-in and rule-out performance, only three tools had less than satisfactory rule-in performance, namely the single question: the Geriatric Depression Scale (GDS-30) and GHQ-12. The optimal single tool was the Zung, although it did not reach the a priori standard for recommendation when applied alone. For rule-out performance, four methods were not satisfactory. These were the single queston, the Hospital Anxiety and Depression Scale (HADS), GDS-30, and GHQ-12. The optimal tools were the two stem questions and GHQ-28. Overall accuracy was best for the two stem questions, Zung, PHQ-9, and GHQ-28. However, it should be noted that data for the Zung scale were based on just four studies and a relatively small total sample size (n = 190).
Meta-regression comparing the diagnostic accuracy for different disease groups was only possible for the BDI and HADS-D. There was no evidence of difference in sensitivity (beta = 0.93, P = 0.34) and specificity (beta = 1.56, P = 0.35) of the HADS between stroke and cancer patients. There was no evidence of difference in sensitivity (beta = 1.49, P = 0.60), but some evidence for differences in specificity (beta = 1.20, P = 0.02) of the BDI between heart disease and cancer patients.
Studies in general medical settings
Table 2 summarises the results for general medical settings. There were only three scales that provided sufficient data for analyses. All these scales performed equally well in this setting as compared to populations specifically targeted for chronic physical health problems with a large overlap in confidence intervals.
Table 2 Evidence summary of scales in general medical settings
Sensitivity and specificity
Sensitivity was relatively high in all measures but particularly high in the GDS-15 (0.89; 95% CI = 0.84 to 0.92). Specificity was very similar for the GDS-30, GDS-15, and HADS when used in general medical settings (Table 2).
Rule-in and rule-out accuracy
Using the same methodology for each measure in general medical settings and correcting for prevalence using a Bayesian analysis, the GDS-15 was most successful and the HADS least successful. No method came close to the a priori; standard for rule-in performance when applied alone. For rule-out accuracy, the HADS was significantly less accurate than the GDS-15 (Area HADS = 0.71, 95% CI = 0.68 to 0.74 versus Area GDS-15 = 0.78, 95% CI = 0.75 to 0.82).
DISCUSSION
Most of the scales performed adequately as case-identification measures for depression, with modest differences in validity coefficients. Most studies targeted chronic physically ill populations rather than general medical settings such as primary care. In order to detect depression in those with chronic physically ill health, the most sensitive instruments appear to be two stem questions, PHQ-9, and GHQ-28. The most specific measure was the Zung. Overall, optimal accuracy was achieved by the two stem questions, Zung, PHQ-9, and GHQ-28. However, it should be noted that estimates on the Zung and GHQ-28 analysis were based on a relatively small sample size; therefore, it is possible that conclusions regarding these scales may change with further data. No method came close to the a priori standard for case-finding recommendation when applied alone.
Another important factor to consider when comparing the different measures is the ease of implementation. The Zung is a 20-item scale and therefore is more resource intensive and less likely to be implemented in primary care compared to shorter measures. Taking into account both the psychometric properties and ease of implementation, it would appear the two stem questions may be the preferred measure for case identification in patients with chronic physical health problems. From these data, the authors do not recommend relying upon a single question alone, and recommend two questions as a minimum initial enquiry. This is consistent with previous pooled data in primary care40 and cancer settings.41
In general medical settings, there were fewer studies, and analysable data were only available for the GDS, GDS-15, and HADS-D. Specificity was similar for all three scales but sensitivity was highest in the GDS-15. Further research is needed to confirm whether the optimal tools in the chronically ill (two stem questions, PHQ-9, and the Zung) perform equally well in general medical samples.
There are several limitations to the results of this systematic review. First, there was moderate to high heterogeneity for most measures. Secondly, there is a paucity of validity studies using the ICD-10 as the criterion standard compared with the DSM-IV, which may favour tools using DSM items, and therefore the authors recommend future examination using this outcome. Thirdly, there were widely used or potentially useful scales that had few or no studies in the physically ill; these include the Montgomery–Asberg Depression Scale (MADRS),42 and the Clinically Useful Outcome Depression Scale (CUDOS).43 Further research is needed on these scales for people with chronic physical health problems. Fourthly, there were a number of different semi-structured methods used to determine the interview-based diagnosis, including the Schedules for Clinical Assessment in Neuropsychiatry (SCAN),44 the Composite International Diagnostic Interview (CIDI),45 the Structured and Clinical Interview for DSM-III-R (SCID),46 and the Diagnostic Interview schedule (DIS),47 all of which may vary in diagnostic accuracy. A further limitation is the lack of cost-effectiveness analyses assessing the cost impact of false positives associated with the use of case-identification measures. However, it should be noted that the cost-effectiveness of case identification is very complex to model and requires a number of assumptions concerning probabilities assigned to events in the depression treatment care pathway, and explicit values of treatment outcomes.48 Therefore, such issues were considered beyond the scope of this paper.
It should also be acknowledged that the use of case-identification tools may not be translated into real benefit in clinical practice. Case identification may bring limited benefit if there are no effective assessment and treatment services in place, as professionals may be reluctant to make a diagnosis of depression if they have limited resources on which to call.49 The aim of the NICE guideline for which this review was conducted,31 is to promote the commissioning of such services. The impact of case finding on the individual consultation may be important, since the use of the PHQ-9 severity questionnaire can cause a tension within the consultation, with GPs struggling to manage formal assessment versus personal enquiry.50
From this data synthesis, it appears that there are a number of instruments for the case identification of depression in the medically ill that have similar accuracy. A consideration of both accuracy and acceptability suggests that the two stem questions may be the most efficient initial method, although further validation is needed. We do not recommend the use of a single question used alone. GPs and practice nurses should not rely on the case-finding questions alone; they should be confident to complete an assessment of the patient's mental state and risk, and a pathway within the practice should be in place (particularly when it is the practice nurse who has done the case finding). Resources within the practice should be available to support patients who have depression and a chronic physical health problem, and primary care practitioners should have well-defined links with local primary care mental health services, which should offer appropriate interventions for such patients, including a collaborative care approach as recommended by NICE.31
Acknowledgments
Thank you to the NICE guideline development groups on Depression and Chronic Physical Health Problems and Depression in Adults for their input during the development of this systematic review.
Appendix
Appendix 1. Full list of instruments considered
Beck Depression Inventory (BDI):
Patient Health Questionnaire (PHQ):
Two stem questions:55
General Health Questionnaire:56
Centre of Epidemiological Studies - Depression (CES-D)57
Geriatric Depression Scale (GDS)
Zung Self Rating Depression Scale60
Hospital Anxiety and Depression Scale - Depression61
Hamilton Depression Rating Scale HDRS62
Montgomery-Asberg Depression Rating Scale (MADRS)63
Clinically Useful Depression Outcome Scale64
One-item measures of depression
Edinburgh Postnatal Depression Scale178
Appendix
Appendix 2 Summary characteristics of included studies
Notes
Funding
Stephen Pilling received financial support from the National Institute for Health and Clinical Excellence.
Additional information
The Bayesian plots of conditional probabilities of scale are available on request fromthe authors.
Provenance
Freely submitted; externally peer reviewed.
Competing interests
David Goldberg developed the General Health Questionnaire. The other authors have declared no competing interests.
- Received November 19, 2010.
- Revision received January 13, 2011.
- Accepted March 21, 2011.
- © British Journal of General Practice 2011