Identifying patients with suspected lung cancer in primary care: derivation and validation of an algorithm

Background Lung cancer has one of the lowest survival outcomes of any cancer because more then two-thirds of patients are diagnosed when curative treatment is not possible. The challenge is to help earlier diagnosis of lung cancer and hence improve prognosis. Aim To derive and validate an algorithm incorporating information on symptoms, to estimate the absolute risk of having lung cancer. Design and Cohort study of 375 UK QResearch ® general practices for development, and 189 for validation. Method Selected patients were aged 30–84 years and free of lung cancer at baseline and haemoptysis, The


INTRODUCTION
Lung cancer is the most common cancer worldwide, with 1.3 million new cases diagnosed every year. 1 It has one of the lowest survival outcomes of any cancer because over two-thirds of patients are diagnosed when curative treatment is not possible. 2 In addition to preventing lung cancer by promoting smoking cessation, the challenge is to help earlier diagnosis, since prognosis varies according to the stage of cancer at presentation. 3 Earlier diagnosis of lung cancers could be improved by a combination of systematic screening of high-risk individuals using spiral CT (computerised tomography) scanning, particularly where there is likely to be a favourable benefit-to-harm ratio, [4][5][6] and by facilitating the earlier investigation and referral of high-risk symptomatic individuals who present to their family physician. Cancer symptoms present a very real challenge for family physicians, since the symptoms can be common and nonspecific, making it difficult to reliably distinguish patients who need further investigation from those who can be reassured.
While smoking is a well-established major risk factor for lung cancer, 7-9 a significant proportion of cancers develop in non-smokers, 10 and not all long-term heavy smokers develop lung cancer, suggesting that other factors also play an important role. Evidence suggests that age, deprivation, previous diagnoses of other cancers, previous pneumonia, family history of lung cancer, and asbestos exposure also increase long-term risk independently of smoking. 8,11 In addition, 'red-flag' symptoms such as haemoptysis, loss of appetite, dyspnoea, and cough might herald an existing condition of lung cancer, 12 especially among individuals with a high baseline risk. However, an approach that focuses on individual 'red-flag' symptoms such as haemoptysis without taking account of other risk factors is likely to miss 80% of current lung cancers. 13 A variety of factors, therefore, need to be combined to develop a riskprediction algorithm to help clinicians better assess and prioritise patients at high risk of having lung cancer, for further investigation or referral. While the case for such models is accepted, and some models that estimate long-term risk have been published, 8,14,15 there are no models that combine baseline risk and symptoms.
This study aimed to develop and validate an algorithm to estimate the individualised absolute risk of having lung cancer, incorporating both symptoms and baseline risk factors, to help identify those at highest risk for further investigation or referral. QResearch ® (a large UK primary care database) was used to develop the riskprediction models, since it contains robust data on many of the relevant exposures and outcomes. It is also representative of the population where such a model is likely to be used and has been used successfully to develop and validate a range of prognostic models for use in primary care. [16][17][18][19][20] Once validated, the prediction models could be Aim To derive and validate an algorithm incorporating information on symptoms, to estimate the absolute risk of having lung cancer.

Design and setting
Cohort study of 375 UK QResearch ® general practices for development, and 189 for validation.

Method
Selected patients were aged 30-84 years and free of lung cancer at baseline and haemoptysis, loss of appetite, or weight loss in previous 12 months. Primary outcome was incident diagnosis of lung cancer recorded in the next 2 years. Risk factors examined were: haemoptysis, appetite loss, weight loss, cough, dyspnoea, tiredness, hoarseness, smoking, body mass index, deprivation score, family history of lung cancer, other cancers, asthma, chronic obstructive airways disease, pneumonia, asbestos exposure, and anaemia. Cox proportional hazards models with age as the underlying time variable were used to develop separate risk equations in males and females. Measures of calibration and discrimination assessed performance in the validation cohort.

Results
There were 3785 incident cases of lung cancer arising from 4 289 282 person-years in the derivation cohort. Independent predictors were haemoptysis, appetite loss, weight loss, cough, body mass index, deprivation score, smoking status, chronic obstructive airways disease, anaemia, and prior cancer (females only). On validation, the algorithms explained 72% of the variation. The receiver operating characteristic (ROC) statistics were 0.92 for both females and males. The D statistic was 3.25 for females and 3.29 for males. The 10% of patients with the highest predicted risks included 77% of all lung cancers diagnosed over the subsequent 2 years. integrated into clinical computer systems to help systematically identify those at high risk, and alert clinicians to those who might benefit most from further assessment or interventions. 16,18 METHOD Study design and data source A prospective cohort study was carried out in a large population of primary care patients from an open cohort study, using the QResearch database (version 30). The study included all practices in England and Wales that had been using their EMIS ® (Egton Medical Information Systems) computer system for at least a year. Twothirds of practices were randomly allocated to the derivation dataset and the remaining one-third to a validation dataset. An open cohort of patients was identified aged 30-84 years, drawn from patients registered with practices between 1 Jan 2000 and 30 September 2010. The study excluded patients without a postcode-related Townsend score, patients with a history of lung cancer at baseline, and those with a first recorded 'red-flag' symptom in the 12 months prior to baseline; that is, symptoms of haemoptysis, loss of appetite, or weight loss, which might indicate lung cancer.
Entry to the cohort was the latest of the study start date (1 Jan 2000), 12 months after the patient registered with the practice and, for those patients with incident haemoptysis, loss of appetite, or weight loss, the date of first recorded onset within the study period.

Clinical outcome definition
The study outcome was incident diagnosis of lung cancer during the subsequent 2 years, recorded either on the patient's GP record using the relevant UK diagnostic codes or on their linked Office for National Statistics (ONS) cause-of-death record, using the relevant International Classification of Diseases (ICD)-9 codes or ICD-10 diagnostic codes (codes available from the authors). A 2-year follow-up was used, since this represents the period of time during which existing lung cancers are likely to become clinically manifest. 13,21 It was assumed that where lung cancer deaths occurred within 2 years, without a recorded diagnostic code in the GP record, the cancer would have been present at the start of the 2-year period.

Predictor variables
Established predictor variables were examined, focusing on those that are likely to be recorded in the patient's electronic record and that the patient themself is likely to know. Three 'red-flag' symptoms were also included (haemoptysis, loss of appetite, and weight loss) as well as other symptoms that might herald a diagnosis of lung cancer. Separate analyses were carried out in males and females, and age was accounted for by using it as the underlying time variable in the analyses. The predictor variables examined were: • currently consulting GP with first onset of haemoptysis (yes/no); 12 • currently consulting GP with first onset of loss of appetite (yes/no); 12 • currently consulting GP with first onset of weight-loss symptom (yes/no); 12 • recently consulted GP with first onset of any of: ? cough in the past 12 months (yes/no); 12 ? dyspnoea in the past 12 months (yes/no); 12 ? tiredness in the past 12 months (yes/no); ? hoarseness in the past 12 months (yes/no); • body mass index (BMI, continuous); • chronic obstructive airways disease diagnosed ever (yes/no); 8,14 • Townsend deprivation score (continuous); 23 • family history of lung cancer (yes/no); 8,14

How this fits in
Lung cancer is the most common cancer worldwide and has poor survival, since many cancers are diagnosed late when curative treatment is not possible. Symptoms that might herald a diagnosis of lung cancer are common and non-specific, making it difficult for GPs to identify highrisk patients. The QLung ® cancer algorithm developed in this study includes age, haemoptysis, appetite loss, weight loss, cough, body mass index, deprivation score, smoking status, chronic obstructive airways disease, anaemia, and prior cancer (females only). It has good discrimination and calibration and could be used to identify those at highest risk for early referral and investigation.
Variables were included in the final model if they had a hazard ratio of <0.80 or >1.20 (for binary variables) and were statistically significant at the 0.01 level. Tests were also carried out for interactions between smoking and deprivation.
Derivation and validation of the models Multiple imputation was used to replace missing values for smoking status and BMI. 25 Fractional polynomials were used to model non-linear risk relations with BMI. 26 Cause-specific hazard models were used to account for competing risks, which involved fitting two separate Cox models -one for lung cancer and one for deaths from other causes, including the same predictor variables in both models. Patients who did not die or have lung cancer within 2 years, were censored at the earliest date of deregistration with the practice, last upload of computerised data, or after 2 years.
Age was used as the underlying time function in the Cox regression, by setting the origin as the patient's date of birth, as done elsewhere, 27 and defining a delayed entry date as the study entry date. 27 The risk for each patient over 2 years was evaluated. Separate analyses were carried out for males and females.
In order to validate the performance of each model, the algorithms were applied to the validation cohort and measures of discrimination calculated (D statistic and R 2 statistic for survival data, 28 and area under the receiver operating characteristic curve [ROC statistic]), over a 2-year period. To assess the calibration, observed risks were compared with mean predicted risks within each tenth of predicted risk over 2 years, taking account of competing risks in the calculation of observed risks.
The validation cohort was used to determine the sensitivity and positive predictive value of strategies for identifying patients at increased risk of having a diagnosis of lung cancer in the next 2 years. Confidence intervals (CIs) for sensitivity and positive predictive values were calculated using the method described by Newcombe. 29 Strategies were compared, based on absolute risk estimates generated from the algorithms, with a strategy based on investigating current or past smokers aged 40 years and over with haemoptysis, as recommended in UK National Institute for Health and Clinical Excellence (NICE) guidance on referral for suspected cancer. 30 All the available data in the derivation cohort were used to develop the model, and all the available data from the validation cohort were used to test its performance. STATA (version 11) was used for all analyses.

Overall study population
Overall, 564 QResearch practices in England and Wales met the study inclusion criteria, of which 375 were randomly assigned to the derivation dataset, with the remainder assigned to a validation cohort. A total of 2 538 615 patients aged 30-84 years were identified in the derivation cohort. The following were excluded: 124 458 patients (4.9%) without a recorded Townsend deprivation score, 18 with missing dates for the diagnoses of lung cancer, 1490 (0.1%) with a history of lung cancer, and a further 6522 patients (0.3%) with at least one 'redflag' symptom (haemoptysis, loss of appetite, or weight loss) recorded in the 12 months prior to entry to the study at baseline, leaving 2 406 127 patients for analysis.
A total of 1 342 329 patients aged 30-84 years were identified in the validation cohort, and the following were excluded: 70 847 patients (5.3%) without a recorded Townsend score, eight (<0.1%) without a recorded date of diagnosis of lung cancer, 713 (0.1%) with a history of lung cancer, and 3610 (0.3%) with at least one 'red-flag' symptom recorded in the 12 months prior to study entry, leaving 1 267 151 patients for analysis.
The baseline characteristics of each cohort were very similar, as shown in Table  1. As in previous studies, [16][17][18] the patterns of missing data supported the use of multiple imputation to replace missing values for smoking and BMI (not shown, available from the authors).
Incidence rates for 'red-flag' symptoms Overall, 13 980 patients with incident haemoptysis were identified in the derivation cohort, 11 853 with loss of appetite, and 30 937 with weight loss. Table  2 shows the incidence rates of each symptom in males and females, and how they generally increased with age.

Incidence rates of lung cancer
Overall in the derivation cohort, a total of 3785 incident cases of lung cancer were identified, arising from 4 289 282 personyears of observation, giving a rate of 88.2 per 100 000 person-years. Of these cases of lung cancer, 2794 (73.8% of 3785) were recorded on the GP record, and the remainder were identified solely from the linked ONS cause-of-death record; 62.7% of lung cancer cases occurred in males and the mean age at diagnosis was 71 years. Of the 2794 cases identified on the GP record, 1263 (45.2%) had symptoms recorded prior to diagnosis in the GP record. Of the 991 patients only identified via the linked ONS record, 248 (25.0%) had symptoms recorded prior to the death.
In the validation cohort, 2196 incident cases of lung cancer were identified, arising from 2 260 901 person-years of observation, giving a rate of 97.1 per 100 000 personyears. Of these cases of lung cancer, 1569 (71.4% of 2196) were recorded on the GP record, and the remainder were identified solely from the linked ONS cause-of-death record. The incidence of lung cancer was higher among males than females, and rose steeply with age. The age-sex incidence rates were similar to published national UK lung cancer incidence data. 31 Table 3 shows the predictor variables selected for the final models for females and males. The final model for females (which has age as the underlying time function) included BMI, Townsend score, smoking status, a prior diagnosis of another cancer, chronic obstructive airways disease, British Journal of General Practice, November 2011 e718  Hb<11 g/dl, current haemoptysis, current appetite loss, current weight loss, and recent first onset of cough in last 12 months.

Predictor variables
The risk of lung cancer in females was significantly associated with decreasing BMI, increasing deprivation, and amount smoked each day. For example, compared with nonsmokers, the risks were increased by 10.6fold for heavy smokers, 8.3-fold for moderate smokers, 6.6-fold for light smokers, and 3.4fold for ex-smokers. Risks were also elevated among females with current haemoptysis (26.5-fold higher), current appetite loss (4.1fold higher), current weigh-loss symptom (4.5-fold higher), cough in the last 12 months (1.9-fold higher), chronic obstructive airways disease (1.8 fold higher), recorded Hb<11 g/dl in the last year (1.6-fold higher), and a prior diagnosis of another cancer (1.3-fold higher). The other variables examined were not independent risk factors in females, so were not included in the final model The final model for males was similar to that for females, except that it did not include history of another cancer. Prior history of cancer was significant for males on univariate analysis (unadjusted hazard ratio = 4.3, 95% CI = 3.6 to 5.1), but not after adjustment for other factors in the model. The magnitudes of the hazard ratios were generally similar to those found for females, apart from smoking, where the hazard ratios for males were lower than those for females.

Validation
The validation statistics (Table 4) showed that the risk-prediction equations explained 71.7% (95% CI = 70.3 to 73.1) of the variation in time to diagnosis in females and 72.1% of the variation in males (95% CI = 71.0 to 73.2). The D statistic was 3.25 (95% CI = 3.15 to 3.37) for females and 3.29 (95% CI = 3.20 to 3.38) for males. The ROC statistics were 0.92 (95% CI = 0.91 to 0.93) for both females and males. Figure 1 shows the mean predicted scores and the observed risks at 2 years within each tenth of predicted risk, in order to assess the calibration of the model in the validation cohort. There was close correspondence between predicted and observed 2-year risks within each model tenth, indicating that the algorithm was well calibrated.

Individual risk assessment and thresholds
One potential use for this algorithm is within consultations with individual patients, particularly if they present with new onset of haemoptysis or unexplained anaemia. The results could help inform the decision to undertake further investigation such as a chest X-ray or spiral CT, and/or the degree of urgency for referring the patients to secondary care. Some clinical examples are shown in Box 1.
Since this is a new algorithm, there are no established thresholds for defining highrisk groups. A range of centiles of predicted risk were calculated from the validation population, to define a high-risk group (that is, the top 0.5%, 1%, 5%, and 10% at highest risk) for males and females combined. The numbers and proportion of incident cases in the validation cohort that fell within each category of risk were then determined.
The 90th centile defined a high-risk group with a 2-year risk score of >0.37% (Table 5). There were 1697 new cases of lung cancer within this group, out of 2196 new cases identified in the validation cohort over 2 years, which accounted for 77.3% of all new cases of lung cancer (sensitivity). The positive predictive value (PPV) with this threshold was 1.3%. Alternatively, using a threshold based on the top 0.5% of risk had a sensitivity of 27.4% and a PPV of 9.5%. In contrast, only 18.4% of lung cancers occurred in patients aged 40 years and over presenting with a first onset of haemoptysis, who were current or ex-smokers (in other words, the sensitivity of this approach is low and approximately 82% of cases of lung cancer cases would be missed). The PPV in this group was 9.7%. Only 23.0% of lung cancer cases occurred in patients with haemoptysis, and the PPV for haemoptysis was 6.4%.   new algorithm designed to estimate the absolute risk of having lung cancer, which is either currently present or likely to become manifest within 2 years. This can therefore be used as a prediction model to identify patients with an existing but as yet undiagnosed lung cancer. The algorithm is based on simple clinical variables that can be ascertained in clinical practice. The algorithm performed well in a separate validation sample, with good discrimination and calibration. It could identify 10% of the population in which over 76% of all new lung cancer cases arose over 2 years.

Strengths and limitations
Key strengths of the study include size, duration of follow-up, representativeness, and lack of selection, recall, and responder bias. The analysis accounts for competing risk of death from other causes, which is especially important in the older population. UK general practices have good levels of accuracy and completeness in recording clinical diagnoses and prescribed medications. 32 The authors consider that the study has good face validity, since it has been conducted in the setting where the majority of patients in the UK are assessed, treated, and followed-up. The algorithms have been developed in one cohort and validated in a separate cohort that is representative of the patients likely to be considered for preventative measures. While other risk-prediction models for lung cancer have been developed, none can be directly compared since none include symptoms. Limitations of the study include lack of formally adjudicated outcomes, information bias, and missing data. The database has linked cause of death from the UK ONS and the study is therefore likely to have picked up the majority of cases of lung cancer, thereby minimising ascertainment bias. Patients who die of lung cancer in hospital will be included on the linked cause-of-death data. Patients diagnosed with lung cancer in hospital will have the information recorded in hospital discharge letters, which are sent to the GP and this information is then entered into the patient's electronic record. The incidence rate in the study population was close to published national data, with similar patterns by age and sex. 31 While the study was reliant on the accuracy of information recorded by primary care physicians, the quality of information is likely to be good since previous studies have validated similar outcomes and exposures using questionnaire data, and found levels of completeness and accuracy in similar GP databases to be good. 33,34 For example, one systematic review reported that on average 89% of diagnoses recorded on the GP electronic record are confirmed from other data sources. 33,35 However, one significant limitation of this study is that the stage of lung cancer at diagnosis is not recorded in either the GP record or the linked cause-ofdeath record. Additional data from cancer registries would need to be linked to the GP British Journal of General Practice, November 2011 e720

Box 1. Clinical examples •
A 78-year-old female who is an ex-smoker with a BMI of 25.7 kg/m 2 and has a history of chronic obstructive airways disease, who presents to the GP with haemoptysis and has had a cough and a Hb<11 g/dl recorded in the last 12 months, has an estimated risk of 37% of having existing lung cancer as yet undiagnosed. If the patient also has loss of appetite and weight loss, the estimated risk increases to 76%. Although this patient is an ex-smoker, she is at particularly high estimated risk of having lung cancer and therefore would warrant an urgent referral for further investigation. • A 67-year-old male who is a heavy smoker with a BMI of 27.5 kg/m 2 , a history of chronic obstructive airways disease, loss of appetite, and weight loss but who has not presented to the GP with a cough or haemoptysis, has a 29% estimated risk of having existing lung cancer as yet undiagnosed. While this patient does not have the 'red-flag' symptom of haemoptysis, the other factors that are present place him into a high-risk category likely to need urgent referral or investigation. • A 40-year-old male with a BMI of 27.5 kg/m 2 who is a heavy smoker who presents with haemoptysis but no other symptoms and no evidence of anaemia, has a 0.2% estimated risk of having existing lung cancer as yet undiagnosed. • A 50-year-old male with a BMI of 22 kg/m 2 who is a non-smoker and presents to the GP with haemoptysis, loss of appetite, and weight loss, and has had a cough and Hb<11 g/dl recorded in the last 12 months, has a 28% estimated risk of having existing lung cancer as yet undiagnosed. record. This is not currently available, although work is in progress to undertake this linkage so it will be available for future versions of this tool. Also, there is no evidence from the present study about whether use of this symptom-based tool is likely to lead to earlier identification of lung cancer at a stage when curative treatment (that is, surgery) is more likely to be possible. A cluster randomised clinical controlled trial comparing use of this tool in intervention general practices against 'usual practice' in control practices could help answer such a question.
Another limitation of the study is that recording of symptoms may be less complete or accurate than diagnostic codes, since patients might not visit their GP with mild symptoms, and may not report all symptoms to their GP when they do consult, or GPs might not record all the symptoms in the electronic health record. The effect of this information or recording bias would be to overinflate the hazard ratios if they relate to more severe symptoms (for example, major loss of appetite) or underestimate the hazard ratio if patients with the symptoms do not have them recorded. Similarly, family history of lung cancer might be underrecorded, since it is not routinely assessed and recorded in GP records. Lastly, it is possible that some patients might misreport their smoking habits to their GP. For example, smoking status was defined on the basis of self-report, and the definition of an ex-smoker is a patient whose last recorded smoking status was as an exsmoker, regardless of when they stopped smoking. Some ex-smokers may consider themselves as a never-smoker after many years have elapsed. If this were to occur, then it would tend to bias the hazard ratios for ex-smoker towards one.

Comparison with existing literature
The study is based on a large representative primary care population. While other studies have examined chronic risk factors, 8 or symptoms separately, 12,13,36 to the authors' knowledge, this is the first study to produce a measure of absolute risk of current lung cancer based on a combination of symptoms (haemoptysis, appetite loss, weight loss, and cough) as well as demographic information, anaemia, BMI, smoking status, chronic obstructive airways disease, and prior cancer (in females). The significance of prior cancer as a risk factor in females but not males is of interest and deserves further study. The direction and magnitude of the hazard ratios in the present study for smoking status and history of another cancer are comparable to those reported in other studies. 37,38 The algorithm performed well in a separate validation sample, with good discrimination and calibration. It can identify the 10% of the population in which approximately 76% of all new lung cancer cases are likely to be diagnosed over the next 2 years.
Comparison of published discrimination statistics suggests the new model performs well. The ROC values were 0.92 in males and females, which is substantially higher than for the model by Spitz and coworkers, with biomarkers (ROC of 0.73), 15 and the Liverpool Lung Project (ROC value of 0.71). 8 The Bach et al model is based on a person's age, sex, and smoking history but only applies for individuals aged 50-75 years who have smoked 10-60 cigarettes/day for 25-35 years. 22 The expanded Spitz et al model includes more variablesenvironmental tobacco smoke, family history of cancer, dust exposure, prior respiratory disease, and smoking history variables -but requires genetic testing, which is unavailable in the dataset for the present study, and unlikely to be available for routine clinical use. 15 Implications for research and practice One practical mechanism to help improve clinical recording of family history and symptoms for future studies would be to introduce electronic templates into GP clinical systems, which are displayed when a 'red-flag' symptom is recorded in the patient's record. The template would then help structured data entry of other related symptoms including significant negative findings. Over time, this would improve the accuracy and completeness of the electronic record and hence the underlying data used for future versions of this algorithm.
The algorithm has a number of potential clinical applications. First, it could be used to help inform the revision of NICE guidance on the investigation and referral of patients with suspected cancer. 30 For example, current NICE guidance recommends an urgent referral for a chest X-ray in patients with persistent symptoms such as haemoptysis, chest pain, dyspnoea, cough, or weight loss, but not for appetite loss, although this study has demonstrated a four-to fivefold increase in risk of cancer with this symptom, independently of other factors. Urgent referral is recommended by NICE for persistent haemoptysis in smokers or exsmokers who are aged 40 years and older, or those whose chest X-ray is suggestive of lung e721 British Journal of General Practice, November 2011

Funding
There was no external funding for this study

Ethical approval
All QResearch ® studies are independently reviewed in accordance with the QResearch ® agreement with Trent Multi-Centre Ethics Committee (UK).

Provenance
Freely submitted; externally peer reviewed.

Web calculator
Here is a simple web calculator to implement the QCancer ® (Lung) algorithm, which is publically available alongside the paper and open source software. http://www.qcancer.org/lung.

Competing interests
Julia Hippisley-Cox is professor of clinical epidemiology at the University of Nottingham and co-director of QResearch ® -a not-for-profit organisation which is a joint partnership between the University of Nottingham and EMIS ® (leading commercial supplier of IT for 60% of general practices in the UK). Julia Hippisley-Cox is also a paid director of ClinRisk Ltd, which produces software to ensure the reliable and updatable implementation of clinical risk algorithms within clinical computer systems to help improve patient care. Carol Coupland is associate professor of medical statistics at the University of Nottingham and a consultant statistician for ClinRisk Ltd. This work and any views expressed within it are solely those of the coauthors and not of any affiliated bodies or organisations.