Abstract
Background The overall clinical impression (‘clinical gestalt’) is widely used for diagnosis but its accuracy has not been systematically studied.
Aim To determine the accuracy of clinical gestalt for the diagnosis of community-acquired pneumonia (CAP), acute rhinosinusitis (ARS), acute bacterial rhinosinusitis (ABRS), and streptococcal pharyngitis, and to contrast it with the accuracy of clinical decision rules (CDRs).
Design and setting Systematic review and meta-analysis of outpatient diagnostic accuracy studies in ambulatory care.
Method PubMed and Google were searched for studies in outpatients that reported sufficient data to calculate accuracy of the overall clinical impression and that used the same reference standard. Study quality was assessed using Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2), and measures of accuracy calculated using bivariate meta-analysis.
Results The authors identified 16 studies that met the inclusion criteria. The summary estimates for the positive (LR+) and negative likelihood ratios (LR−) were LR+ 7.7, 95% confidence interval (CI) = 4.8 to 11.5, and LR− 0.54, 95% CI = 0.42 to 0.65 for CAP in adults, LR+ 2.7, 95% CI = 1.1 to 4.3 and LR− 0.63, 95% CI = 0.20 to 0.98 for CAP in children, LR+ 3.0, 95% CI = 2.1 to 4.4 and LR− 0.37, 95% CI = 0.29 to 0.46 for ARS in adults, LR+ 3.9, 95% CI = 2.4 to 5.9 and LR− 0.33, 95% CI = 0.20 to 0.50 for ABRS in adults, and LR+ 2.1, 95% CI = 1.6 to 2.8 and LR− 0.47, 95% CI = 0.36 to 0.60 for streptococcal pharyngitis in adults and children. The diagnostic odds ratios were highest for CAP in adults (14.2, 95% CI = 9.0 to 21.0), ARS in adults (8.3, 95% CI = 4.9 to 13.1), and ABRS in adults (13.0, 95% CI = 5.0 to 27.0), as were the C-statistics (0.80, 0.77, and 0.84 respectively).
Conclusion The accuracy of the overall clinical impression compares favourably with the accuracy of CDRs. Studies of diagnostic accuracy should routinely include the overall clinical impression in addition to individual signs and symptoms, and research is needed to optimise its teaching.
INTRODUCTION
The overall clinical impression, also called ‘clinical gestalt’, is an intuitive approach to decision making used by physicians to make clinical diagnoses. It takes into account multiple signs and symptoms without necessarily using an analytic approach such as a point score or algorithm, and is an inductive approach based on pattern recognition rather than a hypotheticodeductive approach. Some studies have shown that inductive pattern-recognition strategies may be more widely used and more successful than hypotheticodeductive strategies.1–3 However, proponents of evidence-based practice encourage the use of clinical decision rules (CDRs) for diagnosis, as do practice guidelines. CDRs use a formal approach such as multivariate analysis or recursive partitioning to identify signs, symptoms, and point-of-care tests that are the best independent predictors of a diagnosis or clinical outcome. They are then typically converted to a simple point score or algorithm such as the Ottawa Ankle Rules for ankle injury,4 or the Wells rule to diagnose pulmonary embolism.5 The goal of CDRs is to improve the efficiency and accuracy of clinical diagnosis and thereby reduce unnecessary testing.6
However, CDRs may be cumbersome to access and use at the point of care. As a result, CDRs are only infrequently used in real-world clinical practice.7 Instead, clinicians rely on their overall clinical impression. As the overall clinical impression can incorporate additional variables not included in the CDR, it has the potential of being more accurate. For example, while a clinical rule may categorise a patient as being at low risk for group A beta-haemolytic streptococcal (GABHS) pharyngitis, knowing that a sibling was diagnosed with GABHS pharyngitis the week before could be an important factor.
For acute respiratory tract infections, CDRs have been developed to diagnose GABHS pharyngitis,8,9 acute rhinosinusitis (ARS) and acute bacterial rhinosinusitis (ABRS),10 and community-acquired pneumonia (CAP).11 In this study, the authors performed a systematic review of the accuracy of the overall clinical impression for GABHS pharyngitis, ARS, and CAP, which has not been systematically studied before, and evaluated how its accuracy compared with that of CDRs for the same conditions.
METHOD
Search
For this systematic review, PubMed was searched for published studies using a search strategy (available from the authors), combining synonyms for overall clinical impression, the clinical diagnosis, and ambulatory care. The reference lists of all included studies were also searched to identify studies not captured by the PubMed search strategy. In addition, published systematic reviews of the clinical diagnosis of GABHS pharyngitis, CAP, and ARS or ABRS were searched for additional studies,12–16 as were the first 50 results returned by a Google search of ‘<disease> diagnosis clinical impression’ for each disease. The search was not restricted by language, country, or date of publication.
How this fits in
It is known that the overall clinical impression is widely used in clinical practice but has not been systematically studied. This study showed that in adults the overall clinical impression had good accuracy for the diagnosis of community-acquired pneumonia, for acute rhinosinusitis, and for acute bacterial rhinosinusitis. It had moderate accuracy for diagnosis of streptococcal pharyngitis and for pneumonia in children. In each case, the accuracy of the overall clinical impression was similar to or better than that for a clinical decision rule for the same conditions. Thus, the overall clinical impression has good accuracy and is an important diagnostic tool that is deserving of further study and quantification.
Inclusion and exclusion criteria
The present research was limited to prospective studies that reported diagnostic data regarding the accuracy of the overall clinical impression (clinical gestalt) to diagnose CAP, ARS, ABRS, or acute GABHS pharyngitis. ARS was defined as abnormal imaging, and ABRS as abnormal culture of antral puncture fluid. Studies were limited to the ambulatory-care setting (outpatient clinic, urgent care, or emergency department [ED]) as hospital-acquired and ventilator-associated pneumonia are separate clinical entities. All patients must have received the same acceptable reference standard: chest radiograph (CXR), lung ultrasound, or computed tomography (CT) for pneumonia; imaging or antral puncture fluid analysis for ARS; and throat culture for GABHS pharyngitis. The authors excluded studies of nosocomial infections, infections in immunocompromised persons, or studies of the diagnosis of bacteraemia or sepsis. The authors included studies of both children and adults. Studies of ARS using inspection of antral puncture fluid or bacterial culture as the reference standard were classified as also diagnosing ABRS.
Data abstraction
Each title and abstract was reviewed by two investigators to identify potential studies for inclusion. Any study identified for full-text analysis by one of the reviewers was reviewed independently by two investigators, and any discrepancies were resolved by a third reviewer (lead investigator). For studies that met the inclusion and exclusion criteria, two reviewers abstracted study characteristics, data regarding the accuracy of clinical gestalt, and study design characteristics for the quality assessment, with discrepancies resolved via consensus discussion or, if necessary, by the lead investigator. All of the included studies were reviewed a final time by the lead investigator to confirm the accuracy of data abstraction.
Where a study reported the accuracy of clinical gestalt using more than two categories (for example, ‘sure’, ‘quite sure’, and ‘unsure’), the results were collapsed into two dichotomous categories, that is, ‘sure’ versus ‘quite sure’ or ‘unsure’. The selection of category combinations was based on the combination that provided the highest diagnostic odds ratio (DOR; ratio of positive to negative likelihood ratio [LR]), a measure of discrimination. Where studies reported physician estimates of probability, >50% versus ≤50% was used. One study reported data in the form of a figure.17 The figure was enlarged, digital vertical lines drawn to determine the intercept, and a ruler was used to calculate the number of patients in each category. Data were reported separately for the three study sites in this study (Illinois, Nebraska, and Virginia), as each site enrolled a distinct population and found somewhat different sensitivity and specificity.17
Quality assessment
The Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) framework was adapted to evaluate the quality of the included studies. Studies at low risk of bias for all four domains (patient selection; index test; reference standard; and patient flow and timing) were judged to be at low risk of bias overall.18 Those with a single domain at high risk of bias were judged to be at moderate risk of bias overall, and all others were judged to be at high risk of bias.
Statistical analysis
The authors performed the meta-analysis using the Reitsma function in the mada package in R (version 3.4.3), which uses a bivariate model equivalent to the hierarchical summary receiver operating characteristic (HSROC) model of Rutter and Gatsonis.19 The authors used a summary receiver operating characteristic (ROC) curve to plot 95% confidence intervals for the summary estimates and calculated the area under the ROC curve (AUROCC), also called the C-statistic. Heterogeneity was evaluated using inspection of the summary ROC plots and confidence intervals, as I2 is not recommended for use in diagnostic meta-analysis20 or when there is a small number of primary studies.21 To facilitate comparison with a dichotomous overall clinical impression for each diagnosis, clinical decision rules were dichotomised into low or moderate versus high risk, or low risk versus moderate or high risk depending on which approach provided the highest diagnostic odds ratio (DOR).
RESULTS
The initial search identified 2109 articles, of which 54 were evaluated as full text and 15 met the inclusion criteria. A review of references of included studies identified no additional studies for full-text review. The Google search identified no additional studies, whereas the review of previous systematic reviews identified one additional study of pharyngitis22 for a final total of 16 included studies (three acute pharyngitis, nine CAP, and four ARS or ABRS). The search is summarised in Figure 1 using the Preferred Reporting Items for Systematic Reviews and Meta-analysis (PRISMA) framework.
Figure 1. PRISMA flow diagram of study search.
Characteristics of included studies
The characteristics of the included studies are summarised in Table 1. A total of six studies took place in the US, four in Sweden, and one each in Ireland, Israel, Lesotho, Norway, Spain, and a consortium of 12 European countries. Most gathered data in either a primary care clinic or the ED or a combination of those sites. Regarding age group, 11 studies enrolled only adults, four only children, and one both adults and children. All studies of pneumonia diagnosis used chest radiography as the reference standard, all studies of pharyngitis used throat culture, and studies of rhinosinusitis used either antral puncture revealing purulent fluid23–25 or sinus radiography.26 The rhinosinusitis and pneumonia studies generally included patients where there was already some clinical suspicion for these diagnoses; an exception was the study by van Vugt and colleagues that included any patient with acute cough.11 The prevalence of pneumonia varied from 5% in the van Vugt study to 44%; the median prevalence was 15%. The pharyngitis studies had broad inclusion criteria of any patient with a sore throat, with prevalence of GABHS pharyngitis ranging from 17% to 31%.
Table 1. Characteristics of included studies
Quality assessment
The assessment of study quality using the QUADAS-2 framework is summarised in Table 2. The authors judged nine studies to be at low risk of bias, six to be at moderate risk of bias, and three to be at high risk of bias. One study reported data from three sites, two of which were judged low risk of bias and one high risk of bias.17
Table 2. Assessment of study quality using the QUADAS-2 framework
Accuracy of the overall clinical impression (‘clinical gestalt’)
The accuracy of clinical gestalt as a diagnostic test for GABHS pharyngitis, ARS, and CAP is summarised in Table 3. Due to differences in the clinical presentation of pneumonia in children and adults, as well as observed heterogeneity in the summary ROC curve, results for the accuracy of CAP in adults and children with suspected pneumonia are reported separately. The summary estimates for the positive (LR+) and negative (LR−) likelihood ratios were LR+ 7.7, 95% confidence interval (CI) = 4.8 to 11.5 and LR− 0.54, 95% CI = 0.42 to 0.65 for the diagnosis of CAP in adults; LR+ 2.7, 95% CI = 1.1 to 4.3, and LR− 0.63, 95% CI = 0.20 to 0.98 for the diagnosis of CAP in children; LR+ 3.0, 95% CI = 2.1 to 4.4 and LR− 0.37, 95% CI = 0.29 to 0.46 for ARS in adults; LR+ 3.9, 95% CI = 2.4 to 5.9 and LR− 0.33, 95% CI = 0.20 to 0.50 for ABRS in adults; and LR+ 2.1, 95% CI = 1.6 to 2.8 and LR− 0.47, 95% CI = 0.36 to 0.60 for GABHS pharyngitis in both adults and children. Based on the diagnostic odds ratio, clinical gestalt was most accurate for diagnosis of CAP in adults (DOR 14.2, 95% CI = 9.0 to 21.0), ABRS in adults (DOR 13.0, 95% CI = 5.0 to 27.0), and ARS in adults (DOR 8.3, 95% CI = 4.9 to 13.1). It was less accurate for the diagnosis of CAP in children (DOR 5.5) and GABHS pharyngitis (DOR 4.6).
Table 3. Summary estimates of diagnostic accuracy of clinical gestalt for the diagnosis of common respiratory infections
The summary ROC curves are shown in Figure 2. The summary AUROCC of the overall clinical impression as a test for CAP was 0.80 in both children and adults, 0.77 for ARS in adults, 0.84 for ABRS in adults, and 0.73 for GABHS pharyngitis in adults and children. Note that the C-statistic for CAP in children was unreliable in the authors’ judgement based on the small number of studies and high heterogeneity. Inspection of the summary ROC curves in Figure 2 reveals different patterns of heterogeneity for each disease. There was good homogeneity for the diagnosis of acute pharyngitis, despite the fact that the three studies enrolled children in one, adults in another, and both in a third. For sinusitis, there was good homogeneity with regards to sensitivity (range 0.71 to 0.84) but less with regards to specificity (range 0.61 to 0.92).
Figure 2. Summary receiver operating characteristic curves (ROC) are shown for the accuracy of clinical gestalt in the diagnosis of community-acquired pneumonia (CAP) in adults, CAP in children, group A beta-haemolytic streptococcal (GABHS) pharyngitis, and acute rhinosinusitis (ARS).
For the diagnosis of CAP in adults, the ROC curve showed a pattern that was consistent with a threshold effect. That is, as sensitivity increases, specificity decreases, with the points arrayed along the ROC curve. There was also better homogeneity for studies of CAP in adults compared with studies in children, which are presented separately in the ROC curves. As noted before, most studies in this group were limited to patients with clinically suspected disease. The one study with very broad inclusion criteria of any patient with cough had the highest specificity (0.99) but among the lowest sensitivities (0.29), perhaps a consequence of the low prevalence of CAP.11
Accuracy of clinical decision rules
For comparison with the overall clinical impression, the authors determined the accuracy of CDRs for GABHS pharyngitis in children and adults,8,35 CAP,36 and acute bacterial rhinosinusitis (ABRS).10 The accuracy of the Strep Score for GABHS pharyngitis in adults and children was obtained from recent systematic reviews.13,35 The accuracy of the CDR for CAP was obtained from a large European study of outpatients with acute cough where all received a chest radiograph.36 The CDRs for ARS and ABRS were developed by the author based on a study of 175 primary care patients who all underwent CT, and antral puncture for fluid and culture if fluid was seen on CT.10 ARS was defined as abnormal CT, and ABRS as abnormal culture of antral puncture fluid, as in the clinical gestalt studies. The accuracy of the CDRs are summarised in Table 4.
Table 4. Accuracy of selected clinical decision rules for pneumonia, pharyngitis, and acute rhinosinusitis
DISCUSSION
Summary
This is the first systematic review of the accuracy of clinical gestalt or the overall clinical impression as a diagnostic test. The authors found that the overall clinical impression is an accurate diagnostic test for CAP, ARS, and ABRS in adults (DOR 14.2, 8.3, and 13.0, respectively), and is moderately accurate for the diagnosis of GABHS pharyngitis in adults and children (DOR 4.6) and for the diagnosis of CAP in children (DOR 5.5).
Clinical gestalt is more accurate than individual signs and symptoms for all three conditions, and compares well with clinical decision rules. For example, using a cut-off of three or more out of four symptoms as a positive test, the Strep Score had diagnostic odds ratios of 4.2 in adults and 2.5 in children, compared with a DOR of 4.6 for the overall clinical impression in mixed populations of adults and children. The CDR for CAP in adults had a DOR of 7.2, compared with a DOR of 14.2 for the overall clinical impression in adults. For ARS, the CDR had a DOR of 3.6 compared with 8.3 for clinical gestalt. For ABRS the CDR had a DOR 5.9, compared with 13.0 for clinical gestalt. In all cases, the overall clinical impression performed as well or better than the clinical decision rule.
Patterns of heterogeneity differed between conditions. There was good homogeneity around estimates of the accuracy of gestalt for pharyngitis, for ABRS using antral puncture as the reference standard, and for CAP in adults. A threshold effect can be observed for the diagnosis of CAP. A threshold effect is the result of a trade-off between sensitivity and specificity, and may occur when different definitions of the outcome of interest are used, such as different thresholds for diagnosis of CAP. Some physicians may prioritise sensitivity at the price of specificity, and others specificity at the price of sensitivity.
Strengths and limitations
A strength of this study is the fact that the results for the accuracy of clinical gestalt were fairly consistent for adults with CAP, ABRS, and pharyngitis based on inspection of the summary ROC curves. Other strengths of the present study include the use of modern methods for diagnostic meta-analysis, a comprehensive search, and that only three of 18 studies were judged to be at high risk of bias. This study had several limitations as well: the clinical decision rules discussed above for ARS and CAP have not been prospectively validated. However, accuracy usually suffers during prospective validation, so the fact that gestalt was as accurate as these proposed CDRs is notable. There were a fairly small number of studies, several were quite old, some were at high risk of bias, and three of the four for ARS were by the same author. There was also considerable heterogeneity with regards to inclusion criteria, the age of participants, and the reference standards used. Finally, the studies of pneumonia generally only included studies where there was already some clinical suspicion of CAP. However, only a minority in each of the nine studies had CAP diagnosed by radiography.
Comparison with existing literature
The authors conclude that clinical gestalt is either similarly accurate to or more accurate than CDRs based on usual metrics of diagnostic accuracy. Since clinical gestalt requires no calculations, no algorithm, and no computer, it is not surprising that it is far more widely used than CDRs for clinical decision making. That said, the ability to use clinical gestalt as an accurate test for pneumonia or acute rhinosinusitis is not innate. It must be developed and cultivated, as any skill, and likely requires exposure to a great many cases with a known outcome (‘patterns’) before it is fully developed and accurate. Artificial neural networks can be ‘trained’ to create a complex algorithm by exposing the network to a large number of patterns with known outcomes, eventually developing the ability to accurately make predictions for new cases.
Multivariate models and neural networks typically require several hundred or more patterns to create a predictive model. How many of these known cases or ‘patterns’ are required before the human brain is trained remains unclear. Bierema proposes a model for professional knowledge development that identifies stages of novice, beginner, competent, proficient, expert, and generative leader.37 For novice and beginner learners, CDRs can be used to hone diagnostic skills and teach them the best independent predictors of disease, providing focus and a framework for their diagnostic training. For the proficient and expert physician, the CDR moves to the background, while a physician who is a generative leader may further develop and improve CDRs.
Implications for research and practice
The authors propose that use of formal CDRs is potentially most useful for early-stage clinicians, who have not yet been exposed to a large number of patterns. As they develop their own clinical gestalt, informed by repeated use of validated CDRs, they may eventually rely less and less on the CDR. But even for experienced clinicians CDRs can serve as a back-up to their clinical gestalt. For example, if a physician judges that a patient with CAP can be treated as an outpatient, it is still worthwhile to double-check that judgement by calculating the CRB-65 prognostic score for pneumonia.38 In fact, both the clinical decision rule and clinical gestalt only identified about half of the patients with pneumonia, missing the other half. Thus, use of a CDR and clinical gestalt may be complementary and supportive of each other rather than an either/or proposition.
In conclusion, clinical gestalt is accurate for the diagnosis of CAP, ARS, and ABRS in adults, and the overall accuracy is similar to or better than that of clinical decision rules. Experienced clinicians should be confident in their use of the overall clinical impression and use clinical decision rules as a backstop to that judgement. Trainees, on the other hand, may benefit more from explicit use of CDRs until they develop their clinical skills. Further work is needed to understand how to best teach clinical gestalt to trainees.
Future studies of clinical diagnosis should primarily include an ‘overall clinical impression’ question to gather further data on the accuracy of clinical gestalt for a range of conditions, including of course non-infectious conditions such as chest pain, deep vein thrombosis, and pulmonary embolism. If found to be accurate and reliable for the diagnosis of a disease, the overall clinical impression could be built into guidelines regarding the evaluation of a range of conditions such as suspected sepsis, myocardial infarction, depression, and early diagnosis of cancer. It will also be important to consider how an overall judgement about the likelihood of disease fits with the threshold framework for decision making, such that a judgement of ‘disease is unlikely’ also falls below the test threshold for that disease.39