Department of Psychiatry, University of Birmingham, Queen Elizabeth Psychiatric Hospital, Birmingham B15 2QZ, email: femi.oyebode{at}bsmht.nhs.uk
Birmingham and Solihull Mental Health NHS Trust
Greater Glasgow Primary Care NHS Trust
Department of Psychiatry, University of Birmingham, Queen Elizabeth Psychiatric Hospital, Birmingham
F.O. was Chief Examiner, Royal College of Psychiatrists 2002-2005.
|
|
|---|
The aim of the study was to investigate the interrater reliability of the clinical component of the MRCPsych part II examinations, namely the individual patient assessment and the patient management problems. In the study period, there were 1546 candidates and 773 pairs of examiners. Kappa scores for pairs of examiners in both these assessments were calculated.
RESULTS
The kappa scores for exact numerical agreement between the pairs of examiners in both individual patient assessment and patient management problems were only moderate (0.4-0.5). However, the kappa scores for agreement between pairs of examiners for the reclassified pass and fail categories were very good (0.8).
CLINICAL IMPLICATIONS
The poor reliability of thetraditional long case and oral examinations in general is one of the most potent arguments against their use. Our finding suggests that the College clinical examinations are at least not problematic from this point of view, particularly if global pass or fail judgements rather than discrete scores are applied.
|
|
|---|
The MRCPsych part II clinical examinations currently comprise the individual patient assessment (IPA), also known as the traditional long case, and the patient management problems (PMP), which is also known as the structured oral examination. The traditional long case has been established for over 150 years as a method of examining clinical skills in medicine. In recent years its value as an assessment method has come under great scrutiny. The strengths of the traditional long case include its obvious face validity, as it evaluates the performance of a doctor in an encounter with real patients whereby information is gathered and treatment plans are developed under realistic conditions. The task for the candidate is to take a history, to structure the clinical problem, synthesise the findings and formulate an appropriate management plan. For many clinicians these skills are fundamental to the practice of medicine and the authenticity of the challenge for the candidate is an intuitively correct method of assessing clinical competence. Despite these obvious strengths, the traditional long case has inherent problems. The clinical challenges posed to candidates in the long case are not identical, equal or even similar in complexity. Furthermore, it is assumed that performance on one particular type of case is predictive of performance on other types of cases, when most clinicians know that they do not necessarily perform uniformly across all patient problem types. In addition, there is concern that examiner behaviour is not reliable (the problems of interrater and intra-rater reliability; Norcini, 2002). In short, there is a conflict between validity and reliability.
The problems that arise with the individual patient assessment also pertain to the clinical oral examination, the patient management problems. In unstructured viva voce examinations candidates are liable to be asked whatever questions the examiner chooses and there is the risk that the examiner will concentrate on their pet interests. Furthermore, there is evidence that structured viva voce examinations are more reliable than unstructured examinations (Tutton & Glasgow, 2005) and the College has made the necessary changes to accommodate this. The concerns about clinical oral examinations are also pertinent to PhD vivas (Jackson & Tinkler, 2001; Morley et al, 2002) and to job selection interviews (Wiesner & Cronshaw, 1988; McDaniel et al, 1994).
In this study we investigate the level of agreement between the two examiners in both component parts of the MRCPsych clinical examinations. This is a measure of the inter-rater reliability.
|
|
|---|
Data were analysed using SPSS version 12 for Windows. In this study the
kappa statistic (
) was used as the measure of strength of agreement
between pairs of examiners. The generally accepted standards of strength of
agreement of
are: <0 poor, 0.1-0.2 slight, 0.21-0.40 fair,
0.41-0.60 moderate, 0.61-0.80 substantial, 0.81-1.0 almost perfect
(Landis & Koch, 1977).
|
|
|---|
for
exact agreement between the pairs of examiners for the individual patient
assessment for all scores was 0.513 (P < 0.0001), for the failing
candidates only was 0.462 (P < 0.0001) and for the passing
candidates was 0.485 (P < 0.0001). When the scores were
reclassified into pass and fail categories,
for agreement was 0.794
(P < 0.0001). The
for exact agreement between the pair of
examiners for the patient management problems for all scores was 0.515
(P < 0.0001), for failing candidates only it was 0.562 (P
< 0.0001) and for the passing candidates it was 0.475 (P <
0.0001). The
for agreement between examiners in the patient management
problems when the scores were reclassified into pass and fail categories was
0.802 (P < 0.0001). |
|
|---|
The reliability of the traditional long case and oral examinations in general is one of the most potent arguments against their use. Our finding suggests that the College clinical examinations are at least not problematic from this point of view. Norcini (2002) has argued that there are at least three ways to improve the reproducibility of scores awarded by examiners in the traditional long case: employing a statistical model to remove difference among them; training examiners; or increasing the number of examiners. It is likely that the close agreement between examiners in the College examinations is due to the training provided. New examiners receive initial training before examining and all examiners are required to attend the annual board of examiners meeting where refresher training takes place. However, Norcini (2002) argues that any improvements in reproducibility of scores, that is, in the reliability of the traditional long case, will only be modest and that the largest effect is likely to be due to increasing the number of examiners.
Several authors have proposed modifications to the long case to make it fit for purpose. These discussions have developed because of anxiety that the objective structured clinical examination (OSCE) assesses breadth of skill but at the expense of depth (Wass & van der Vleuten, 2004). The OSCE is able to assess many competencies but because of the format, very limited time is available for the assessment of these competencies. In the MRCPsych part I OSCE there are currently twelve 7-min stations. This illustrates the problem well; there is extensive coverage of novel clinical areas such as information-giving to patients, carers, doctors and other healthcare professionals, yet, the actual time allocated to the evaluation of these competences is arguably limited. Furthermore, in psychiatry, there is the risk that the OSCE promotes a disjointed acquisition of clinical skills and that the capacity to integrate a case in all its fullness may be lost in the process. In real-life situations, the discrete clinical competencies are deployed in order to serve the interest of an individual patient, and to do this satisfactorily the various aspects of the case need to be integrated into a meaningful whole. The proposals to improve the intercase reliability of the long case include:
McKinley et al (2005) suggest an innovative solution to the problem of increasing the number of patient encounters; they advocate sequential testing. This involves all students being directly observed in four consultations by a different pair of examiners for each case. Each consultation lasts 30 min. Those considered to be unlikely to fail are excused further testing; the rest, approximately a quarter of the class, are observed consulting with four more patients by another four pairs of examiners. In this system, failing candidates are examined on eight cases by eight pairs of examiners.
These proposals are resource intensive and probably impractical. In the current MRCPsych part II, the time required for each candidate to examine ten patients will amount to at least 10 h. In an examination dealing with approximately 1000 candidates annually, this will be an impossible task. The same problem pertains to any extension in the number of examiners or the introduction of wholly observed clinical examinations. The issue is whether these proposals will produce significant or merely marginal improvements, and ultimately whether they will be cost-effective.
There is evidence that the desire to create assessment methods that rely on standardised and objectified tasks in a controlled environment is returning full circle to the assessment of candidates in the real world of patients and the workplace (van der Vleuten & Schuwirth, 2005). The concern about the variance introduced by real patients and the emphasis on the desirability of standardised patients has lessened with the use of the Mini-Clinical Evaluation Exercise (mini-CEX) in workplace-based assessments, with limited observations of candidate encounters with real patients (Norcini et al, 1995). However, it is doubtful that the mini-CEX can be successfully applied to psychiatry without modification. What is now also clear is that the reliability of clinical examinations is not dependent solely on objectification or standardisation, but also on careful sampling across clinical content domains which needs substantial hours of testing time (Petrusa, 2002). The reliability estimates for the long case depending on hours of testing are reported as 0.60 for 1 h, 0.75 for 2 h, 0.86 for 4 h and 0.90 for 8 h. These estimates are comparable for multiple choice question papers, oral examination and OSCE (van der Vleuten & Schuwirth, 2005).
It is clear that the proposals to improve the traditional long case are unlikely to be efficient or cheap. However, the energy going into the process suggests an awakening to the potential risks of relying merely on tests of competence such as the OSCE. At present it is uncertain how far workplace-based assessments of clinical performance using instruments such as the mini-CEX can adequately replace the traditional long case and oral examination. Our findings show that there is a good measure of agreement between pairs of examiners in these examinations, particularly for global pass or fail judgements. In this transitional period, as assessments of clinical competence and performance evolve, whatever programme of assessments is developed and adopted, the value of the traditional long case and structured oral examination need to be carefully considered. It is probably true to say that the unique contribution of the long case in particular is unlikely to be surpassed by simulated patients or standardised and objectified assessments.
|
|
|---|
This article has been cited by other articles:
![]() |
G. F. Searle Is CEX good for psychiatry? An evaluation of workplace-based assessment Psychiatr. Bull., July 1, 2008; 32(7): 271 - 273. [Full Text] [PDF] |
||||
![]() |
F. Odukwe and M. McCauley The case for the long case Psychiatr. Bull., March 1, 2008; 32(3): 117 - 117. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||