‘The novelties of one generation are only the resuscitated fashions of the generation before.’
George Bernard Shaw. From the preface to Three Plays for Puritans.
This quotation aptly reflects the tensions in the pursuit of a ‘Holy Grail’ ideal assessment. In the early 20th century the goal was integration. Flexner, the late 19th century American educationalist, held the firm belief that assessment must focus on a student's ability to assess in full ‘a concrete case to collect all the relevant data and to suggest the positive procedures applicable to the conditions disclosed’.1 Long cases and oral presentations were in favour. Subsequently, the logisitics of ensuring fair and equitable challenge across cases and during unstructured vivas led to an increasing focus on more objective testing methodologies (some believe at the cost of being too reductionist), such as multiple choice questions (MCQs) and objective structured clinical examinations (OSCEs), using simulation rather than reality. A century later, the international focus is on the need to assess doctors' performance, highlighted by Miller's now famous pyramid,2 within the context of their work (that is, what a doctor ‘does’). The aim is to ensure that more formative processes support the individual's needs during training. Fashion has swung almost full circle.3 We are back to searching for more integrated approaches, using real patients.
Yet much of this change has lacked evidence. We know little of the psychometrics of the original long case and orals.4 The logistics of work-based assessment are still to be explored. Two contrasting papers in this month's Journal are to be welcomed. Simpson and Ballard have investigated the content validity of the more traditional oral format used in the Membership of the Royal College of General Practitioners (MRCGP) examination.5 Swanwick and Chana6 reflect on workplace assessment as it moves forwards to take an increasing role in the licensing of doctors. There is no ideal assessment method. All are flawed in some way. The current trend is to design innovative assessment packages to drive educational agendas, acknowledging the impact this has on learning. We must be realistic and remain open to the potential application of all available tools, and ensure that the assessment programme, when completed, has addressed the ultimate goals of training.7
The overriding principle underpinning any measurement of professional performance is that of ‘content specificity’; that is, the need to assess across a broad range of contexts. An individual's performance in one situation, such as the assessment of a diabetic patient, will not necessarily mirror that of an asthmatic patient. It makes sense. We have all trained and worked in different ways. It is becoming increasingly apparent that knowledge, which is inevitably stronger in some areas than others, underpins much of clinical practice. Traditionally, medical training objectives have been subdivided into knowledge, skills and attitudes. Yet Flexner was right: the three are, in reality, intertwined. We must resist making assessment too atomistic. Attempts in OSCEs to assess communication skills as a separate item illustrate this. As a very frustrated surgeon once said in an undergraduate OSCE committee, ‘it is no good giving full marks to a student for “communicating bad news” if in actuality they've given the wrong news’. To judge professional competency, assessment has to be carefully designed to cover adequate content.
In the 20th century, when the need to address content specificity became apparent, there was an international divergence in trends. The US was quick to abandon long cases, as orals (favouring knowledge tests), which covered broad content, were reliable and legally defensible. Elsewhere, traditional methods have only gradually moved towards the more objective models. Ironically, these changes were made with little research into the psychometrics of the original traditional tests. The reality is now dawning that it does not matter which assessment method you use, be it orals, MCQs, long cases or OSCEs, provided that the test time is long enough to ensure sufficient content is examined.8 The limiting factor is the feasibility of the test. Eight to 10 long cases could be as reliable as a 3-hour OSCE, but relatively impractical to deliver. We can learn from this original divergence. There is increasing awareness of the need for research into the validity, reliability and impact of assessment methodologies and educational developments.9
We must not lose any opportunity to produce an evidence base to support change. Internationally, there is a big drive towards new assessment tools as we strive to measure performance and improve our tests of competency. In the UK, this has been accelerated by rapidly changing frameworks for postgraduate training, such as the Department of Health's proposals for Modernising Medical Careers (MMC), where the emphasis is now placed on competency-based curricula and the establishment of clear standards by the Postgraduate Medical Education Training Board (PMETB). These changes aim to provide assessment packages that enable trainees to demonstrate competencies as and when they learn them, and to avoid examination structures that unnecessarily delay progress. Some of the change is to be welcomed: in particular, more formative assessments, robust standards and lay involvement in the quality assurance of the process.
We face rapid change that is not underpinned by clear research evidence and is overshadowed by concerns that these are driven by politics rather than educational rationale. Caution is justified. Demonstration of competence may not equate with competency acquired through experience.10 We are at risk of losing the appraisal of higher skills learnt through experience if assessment becomes a tick list of ‘can do’, be it in the workplace or in an examination. It is difficult to see how these new assessments will address the all-important issue of content specificity. Trainee-led completion of competencies in the workplace will need careful planning to ensure that they are tested across a suitable range of clinical contexts. This may not be feasible, given the current pressures on clinical service delivery in both secondary and primary care. Trainers will be asked not only to teach and appraise but also to judge in the workplace. There are significant tensions between these roles.11
The new tools under development for the workplace are essentially based on old formats. Four methods are proposed for the new MMC Foundation programme, the first of which is the mini-CEX – a format modified from an observed long case by John Norcini in the US12 that takes ‘snapshots’ of the integrated assessment, focusing (among other things) on observation of history taking or examination, but not on the entire process. The other methods are: case-based discussions grounded on good oral techniques; direct observation of procedural skills (DOPS) (which are a type of OSCE in the work environment); and a mini peer assessment tool modelled on 360° appraisal. The MRCGP also faces radical change to accommodate the new focus on workplace assessment. Thus, the skills of experienced examiners trained to assess videos or conduct oral examinations remain crucial but within an entirely new and challenging framework. As Tom Stoppard observed in his play Indian Ink, ‘If an idea's worth having once, it's worth having twice’. This time round, however, we must ensure these methods are adequately appraised for validity as well as reliability.
Simpson and Ballard set a good example.5 Their study highlights how tests do not always measure what they set out to measure; in this case decision making in the orals. Swanwick and Chana raise interesting issues on the high validity of evidential, locally-based assessment, suggesting ways of enhancing the reliability.6 If we believe the published literature,11 the task in hand is challenging.
The danger of current change is the impetus with which is it taking place. The new approach may not lead us to the Holy Grail for a test that accurately predicts future unobserved practice; but we do face a major change in philosophy. Assessment programmes designed to ensure education is efficiently and appropriately delivered are replacing the examinations that are known to be reliable but are at times lacking in validity. We need well-designed research to support or refute this new rationale.
- © British Journal of General Practice, 2005.