Abstract
Background The 2004 National Institute for Health and Clinical Excellence (NICE) guidelines highlight the importance of assessing severity of depression in primary care.
Aim To assess the psychometric properties of the Patient Health Questionnaire (PHQ-9) and the depression subscale of the Hospital Anxiety and Depression Scale (HADS-D) for measuring depression severity in primary care.
Design of study Psychometric assessment.
Setting Thirty-two general practices in Grampian, Scotland.
Method Consecutive patients referred to a primary care mental health worker completed the PHQ-9 and HADS at baseline (n = 1063) and at the end of treatment (n = 544). Data were analysed to assess reliability, robustness of factor structure, convergent/discriminant validity, convergence of severity banding, and responsiveness to change.
Results Both scales demonstrated high internal consistency at baseline and end of treatment (PHQ-9 α = 0.83 and 0.92; HADS-D α = 0.84 and 0.89). One factor emerged each for the PHQ-9 (explaining 42% of variance) and HADS-D (explaining 52% of variance). Both scales converged more with each other than with the HADS anxiety (HADS-A) subscale at baseline (P<0.001) and at end of treatment (P = 0.01). Responsiveness to change was similar: effect size for PHQ-9 = 0.99 and for the HADS-D = 1. The HADS-D and PHQ-9 differed significantly in categorising severity of depression, with the PHQ-9 categorising a greater proportion of patients with moderate/severe depression (P<0.001).
Conclusion The HADS-D and PHQ-9 demonstrated reliability, convergent/discriminant validity, and responsiveness to change. However, they differed considerably in how they catergorised severity. Given that treatment decisions are made on the basis of severity, further work is needed to assess the validity of the scales' severity cut-off bands.
INTRODUCTION
In 2004 the National Institute for Health and Clinical Excelence (NICE) guidelines on the management of depression in primary and secondary care emphasised the importance of measuring depression severity to target the condition with an appropriate intervention.1 This runs in accordance with the stepped-care approach of managing depression, which consists of five steps, beginning with the recognition of depression in primary care by a GP or practice nurse. Following this, different interventions are advocated according to severity.
NICE guidelines recommend the use of the International Classification of Diseases (ICD-10) criteria for diagnosing and assessing severity of depression. This method involves a symptom count which then falls within progressive categories (mild, moderate, and severe [with or without psychotic symptoms]), corresponding to increased numbers of symptoms identified.2 While advocating this method, NICE also acknowledges that: ‘it is doubtful whether severity can realistically be captured in a single symptom count’, and that previous history, family history, associated disability, and availability of social support should also be considered.1
The new general medical services' Quality and Outcomes Framework provides incentives to practices for making an assessment of the severity of depression at the outset of a new diagnosis of depression.3 This rationale ensures a discussion can take place with patients of the relevant treatment options, and provides a baseline from which to monitor progress. Practices are required to use a validated assessment tool for this purpose. Those endorsed are: the Patient Health Questionnaire (PHQ-9),4 the Hospital Anxiety and Depression Scale (HADS),5 and the Beck Depression Inventory, second edition (BDI-II).6 Practices are advised to choose one of these three measures.
While research has been conducted to assess the comparative validity and accuracy of questionnaires at detecting depression,7,8 the relative validity of scales at categorising severity has not been adequately assessed. With an absence of objective psychometric comparisons between measures, GPs may find it difficult to make an informed choice of measure.
As part of an audit of the Scottish Executive's ‘Doing Well by People with Depression’ programme, the authors of the current study examined the psychometric properties of the PHQ-9 and the HADS on the same sample of patients. Within the service audit, both measures were completed by patients referred to a primary care mental health worker/therapist (mental health worker) based in primary care settings in Grampian, UK. As a result, the relative reliability, validity, and responsiveness to change of the PHQ-9 and HADS depression subscale (HADS-D) can be assessed.
METHOD
Participants
A consecutive sample of adults referred by GPs to mental health workers based in 32 general practices across Grampian participated. Inclusion criteria were: adults with a mild to moderate mental health problem who GPs considered might be interested in, and able to concentrate on, a self-help approach. Exclusion criteria applied to persons: under 16 years of age; with severe or complex mental health problems (for example, psychosis, obsessive-compulsive disorder, and comorbid personality disorder); with a history of violent or threatening behaviour; admitting to suicidal ideation or recent/recurrent self-harm; who were currently misusing drugs/alcohol; or who had previously had more than one referral to clinical psychology.
Data collection was performed prospectively as part of an audit of the mental health workers' service. Unique identifier numbers were allocated to each patient by the mental health worker. The university team did not have access to information that identified individuals.
Measures
As part of a service audit, primary care patients referred to a mental health worker were asked to complete a questionnaire at baseline, which included the HADS and demographic questions. The mental health worker then conducted the PHQ-9 interview schedule at the first appointment. At the end of treatment, patients completed a further questionnaire which included the HADS and PHQ-9 self-complete version.
How this fits in
Gauging the severity of depression is an imperative in primary care if evidence-based interventions are to be offered. The Quality and Outcomes Framework of the new general medical services contract provides incentives for using one of the following depression severity assessment tools: the PHQ-9, HADS, or BDI-II. Comparison of the PHQ-9 and the HADS in this study indicates that both demonstrate acceptable reliability, convergent and discriminant validity, and responsiveness to change, but that they differ considerably in how they categorise severity. The relationship between measurement of severity using depression assessment tools and ICD-10 severity criteria remains unknown and requires further investigation.
The PH9–9 consists of nine questions designed to correspond to the nine diagnostic criteria for major depressive disorder covered in the Diagnostic and Statistical Manual of Mental Disorders (DSM–IV).9 Items are rated from 0 to 3 according to increased frequency of experiencing difficulties in each area covered. Scores are summed and can range from 0 to 27. The score can then be interpreted as indicating either no depression, minimal, mild, moderate, moderately severe, or severe depression.
The HADS consists of 14 items each rated from 0 to 3 according to severity of difficulty experienced. Eight items require reversed scoring, after which depression (HADS-D) and anxiety (HADS-A) subscale totals can be summed. Each subscale score can range from 0 to 21. The scores can then be interpreted as indicating mild, moderate, or severe difficulty.
Statistical methods
Internal consistency of both the PHQ-9 and the HADS-D was examined using Cronbach's α and item-total correlations. Principal components analysis was used to assess the homogeneity of the scales: separate principal components analyses were performed for each scale at both time points; coefficients of congruence were used to compare factor loadings across the two time points.10 Correlations of the HADS-D and the PHQ-9 with the HADS-A were calculated to assess whether the PHQ-9 and the HADS-D showed greater convergence with each other than with the HADS-A. The established severity cut-off scores for the HADS-D and the PHQ-9 were assessed for convergence using Wilcoxon signed-rank test for related samples. Responsiveness to clinical change, from baseline to end of treatment, was measured by running paired t-tests on the HADS-D and the PHQ-9. Effect size of both measures was then calculated.
Analyses were carried out using Statistical Package for the Social Sciences (SPSS version 14) and Clinimetrics Toolkit (CMT).
RESULTS
A total of 1496 patients were referred to the mental health workers' service between February 2005 and March 2006. Subsequently, 1087 (73%) attended the service and were assessed using the PHQ-9 at the initial appointment; 1063 completed baseline HADS before or on the day of first attending; 478 (45%) patients were assessed with the PHQ-9 within 3 days of completing the HADS. To ensure scale responses referred to the same time reference, this smaller sample was used for assessing relative convergent and discriminant validity, convergence of severity banding, and responsiveness to change.
At the end of treatment, 544 patients (50%) completed the HADS and PHQ-9.
Sample characteristics
Table 1 shows demographic characteristics of service attenders. Most participants were female, employed, and educated beyond minimum school age.
Table 1 Sample characteristics.
Reliability
Cronbach's α coefficients (plus 95% confidence intervals [CIs]) and item total correlations for the HADS-D and PHQ-9 at baseline and end of treatment are shown in Appendix 1. Coefficient α for both scales are acceptable and comparable across the time points (range = 0.83 to 0.92). As all-item total correlations within the HADS-D and PHQ-9 exceed 0.4, these can al be considered adequate.
Appendix 1 Cronbach's α and item-total correlations of the HADS-D and PHQ-9 baseline and end of treatment samples.
Factor structure
For HADS-D scores at baseline, the first principal component explained 52% of the variance; for the PHQ-9 the corresponding figure was 42%. The item loadings for both scales are shown in Appendix 2. Most items within each scale had a substantial loading, indicating that the HADS-D and the PHQ-9 are both factorially valid. The coefficient of congruence was 0.999 for both the HADS-D and PHQ-9, indicating that the factor structure of both measures is highly robust across time (from baseline to end of treatment).
Appendix 2 Factor analysis item loadings on the HADS-D and PHQ-9.
Convergent and discriminant validity
Intercorrelations of the HADS-A, the HADS-D, and PHQ-9 at baseline and at end of treatment are shown in Appendix 3. Correlations were all significant at the 0.01 level, as would be expected between such closely-related constructs as anxiety and depression. Using William's test, the correlations were significantly higher between the PHQ-9 and HADS-D than between either of these measures and the HADS-A. Correlations at baseline were: HADS-D with PHQ-9 (0.68) versus HADS-D with HADS-A (0.49), P<0.001; HADS-D with PHQ-9 (0.68) versus PHQ-9 with HADS-A (0.48), P<0.001. The same pattern of results was obtained at the end of treatment: HADS-D with PHQ-9 (0.81) versus HADS-D with HADS-A (0.74), P<0.001; HADS-D with PHQ-9 versus PHQ-9 with HADS-A (0.77), P = 0.01.
Appendix 3 Intercorrelations of the HADS-A, HADS-D, and PHQ-9.a
Convergence of severity banding
Table 2 shows the distribution of scores falling within PHQ-9 and HADS-D severity cut-offs. Although both measures purport to measure severity of depressive symptoms, there is a lack of concurrence of distribution within cut-off bands. These differences are significant at baseline (P<0.001) and at end of treatment (P<0.001), indicating that PHQ-9 categorises greater severity of symptoms than HADS-D.
Table 2 Distribution of participants across the HADS-D and PHQ-9 severity ratings in baseline and end of treatment samples.
Responsiveness to change
Paired t-tests from baseline to end of treatment indicated a significant change in both the PHQ-9 and the HADS-D, reflecting a reduction in depressive symptoms. The mean score on the PHQ-9 was 12.7 (standard deviation [SD] = 6.47) at baseline, and 6.25 (SD = 6.01) at the end of treatment (95% CI = 5.79 to 7.03). The mean HADS-D score was 8.85 (SD = 4.52) at baseline, and 4.31 (SD = 4.02) at the end of treatment (95% CI = 4.11 to 4.97). The effect size for change on the PHQ-9 was 0.99 compared with 1.0 for the HADS-D, indicating that the scales are comparable in terms of their sensitivity to change.
DISCUSSION
Summary of main findings
Both the HADS-D and PHQ-9 demonstrated reliability, convergent/discriminant validity, robustness of factor structure, and responsiveness to change in a sample of primary care patients referred to mental health workers. However, given that both scales purport to measure severity of depression, the level of agreement shown in this regard was disappointing. If treatment decisions are to be made on the basis of severity, this indicates that further work is needed to assess the validity of both scales' endorsed severity cut-off bands.
Strengths and limitations of the study
This study assessed the psychometric properties of two depression severity rating scales, advocated by the British Medical Association, in a UK sample of primary care patients who GPs had identified as having a mild to moderate mental health problem. Participants had therefore been drawn from the same patient group in which depression severity measurement was intended to apply. This provides useful comparison data to allow practitioners to make a more informed choice than has previously been possible. Moreover, the present analyses are the first to report the factor structure of the PHQ-9 in a UK sample.
Ideally, the study would have included the assessment of the severity bandings of these scales against a clinical ‘gold standard’ such as the Hamilton Depression Rating Scale,11 or the Structured Clinical Interview for DSM–IV (SCID);9 however, that was beyond the scope of the present assessment where data were primarily collected for purposes of an audit.
Part of the inclusion criteria for referral to a mental health worker was the identification of a ‘mild to moderate’ mental health problem. This required GPs to make their own assessment as to whether a patient fitted this criterion before initiating the referral. The scales were only completed once referral had been made. The fact that both scales categorised some of those patients as having severe depression highlights the difficulty faced when following a clinical impression of severity alone, which has previously been shown to not always be reliable.12 However, the disparity between the measures also demonstrates that either one, or both, of these measures is categorising depression severity erroneously.
At baseline, method variance may have explained some differences in severity categorisation between the PHQ-9 and HADS-D, as the PHQ-9 was conducted as an interview at this stage while the HADS-D was administered as a self-complete questionnaire. The difference in methods arose because the mental health workers found it useful to include the PHQ-9 as an interview at the first assessment. This is acceptable for the psychometric assessment because the PHQ-9 has demonstrated concurrent diagnostic validity for self-complete and interview-administered methods.13 It should therefore be possible to use these methods interchangeably.
Possible method variance may have also occurred due to a time delay of up to 3 days in completion of both scales at baseline. As the time-reference point for the scales would have overlapped, it was considered acceptable to include data collected up to 3 days apart. The difference in severity categorisation remained at the end of treatment (at which time and administration methods were the same), which further refutes the likelihood of method variance explaining the difference.
The proportion of patients completing the questionnaire at the end of treatment was only half that of patients completing at baseline. The reduced number reflects the fact that the way people disengage from services and completion of questionnaires inevitably relies on postal return. Postal reminders and reply-paid envelopes were used. The response rate was comparable with other studies requiring postal return.14
Comparison with existing literature
In keeping with previous investigations of the PHQ-9,4,15 and HADS-D,16,17 both scales exhibited good internal consistency. The factor structure of the HADS-D also reflects investigations on the scale in general,18 and clinical samples where the depression subscale emerges within the overall HADS items.19 The factor structure of the PHQ-9 has been reported in a US sample, where the variance explained by a single factor, including all the PHQ-9 items, ranged from 39% to 49% across different ethnic groups.15 This is comparable with the current analyses where 42% of the variance was explained.
The differences found between the HADS-D and the PHQ-9 in the distribution of scores by severity banding is a concern. Lowe et al assessed the relative validity of the PHQ-9, HADS-D, and the World Health Organization Well-Being Index against the SCID, focusing on cut-off points of cases/non-cases of ‘major depression’ and of ‘any depressive disorder’.20 They did not report on the comparative severity bandings within cases against the SCID results. However, their findings do deviate from the endorsed cut-off scores for cases/non-case: for major depression a cut-off point of ≥11 was recommended for the PHQ-9 and ≥9 for the HADS-D. For ‘any depressive disorder’ the cut-off for the HADS-D remained ≥8 but was ≥9 for the PHQ-9, suggesting some over-inclusion in the original cut-offs, particularly with regard to the PHQ-9.
Implications for future research or clinical practice
Further research is required to investigate the psychometric properties of the PHQ-9, HADS, and BDI-II in a UK sample, including a concurrent validation assessment against an ICD-10 clinical interview; from this, empirically-derived severity cut-offs could be established.
Although NICE guidelines emphasise the importance of considering other factors in addition to severity when looking at treatment options,1 the fear has been that some policy developments may favour a more prescriptive approach. In Grampian, the vast majority of practices have opted for the PHQ-9 for assessing depression severity. In the present sample, if the stepped-care model were to be applied rigidly,1 74% of patients assessed with the PHQ-9 would be offered an antidepressant. However, had the same sample been assessed with the HADS-D, only 37% would fal within the prescribing category. A recent Scottish Executive guide, which advocated use of the PHQ-9, indicated that patients without a history of depression and with a score <15 should not be prescribed an antidepressant.21 It is of concern that clinicians are being advised to follow such a rigid code. Presently, clinicians should exercise caution in interpreting scores according to the endorsed severity cut-offs for the HADS-D or PHQ-9.
Acknowledgments
The authors would like to thank all the patients who completed questionnaires, the nine Grampian-based mental health workers who undertook the data collection, and the 32 general practices in Grampian that participated in the ‘Doing Well by People with Depression’ pilot
Notes
Funding body
Data were collected as part of an audit of the ‘Doing Well by People with Depression’ pilot, funded by the Centre for Change and Innovation, Scottish Executive
Ethical approval
The Scientific Adviser of Grampian Research Ethics Committee advised that the present analysis of this audit data did not require ethical approval so long as the data were anonymous
Competing interests
The authors have stated that there are none
- Received March 31, 2007.
- Revision received May 25, 2007.
- Accepted July 4, 2007.
- © British Journal of General Practice, 2008.