The Future Potential of Examinations - A Personal Perspective
Russell Jones, Director of Education, Australian and New Zealand College of Anaesthetists, Melbourne
Purpose of Study
Examination results are used as an indicator of the underlying knowledge, skills and abilities of a candidate in, for example, anaesthesia, intensive care or pain medicine. An assumption is made that performance on the examination is a predictor of these underlying constructs. If an examination programme is valid, reliable, generalisable and practical then this is an appropriate assumption to make (1). Construction of an ideal hypothetical examination programme is an invaluable tool by which to compare the strengths and weaknesses of an existing examination program. Accepting that such a programme is the ideal goal for the continued evolution of examinations within ANZCA, it would seem worthwhile to considers the current state of the College examinations relative to this goal, identify the current differences between contemporary reality and the desired ideal, and to determine what can, should and needs to be done to make the ideal become reality.
Methods
Examinations are typically evaluated by the extent of their strengths in each of four major characteristics; validity, reliability, generalisability and practicality.
Result
Essentially validity is the extent to which an examination measures what it is supposed to measure, that is, in the case of ANZCA, JFICM and FPM examinations the extent to which these examinations actually measure candidate knowledge, skill and ability in anaesthesia, intensive care or pain medicine. There are numerous forms of validity. The ideal tool to establish content validity is to map examination content against the content specified in course curricula documents. Such a process is called “blueprinting” and requires content to be compared against specific, well defined learning objectives. Criterion validity, or more correctly criterion-related validity, is the extent to which examination marks or scores are related to one or more external measures (known as criteria). Concurrent validity and predictive validity are two subsets of criterion-related validity. Concurrent validity considers the strength of the relationship between performance on one examination (or part of an examination) and performance on another designed to measure the same underlying construct. For example, the strength of the correlation between a multiple-choice examination and a short-answer examination if both are purported to assess anaesthesia. Theoretically it is also possible to calculate concurrent validity correlation coefficients between trainee results on the different sections of an examination and their results on In-Training Assessment (ITA). Such meaningful calculations will be able to be performed on the new summative ITA processes once these have been implemented within the FANZCA Training Programme in 2008. Predictive validity can be calculated by establishing how well performance of a candidate on an exam correlates with future performance. Construct validity is optimised when examinations focus on mainstream knowledge, skills and abilities avoiding esoteric and extremely specialised aspects of anaesthesia. Face validity is what an examination appears to assess at first glance. That is, what the untrained eye would suggest an examination measures “on the face of it”. Although it may be argued that face validity is not of equal importance to other forms of validity, face validity is essential in order that an examination be perceived as appropriate by examiners, candidates, the Australian Medical Council (the organisation responsible for accrediting ANZCA, JFICM and FPM as training, credentialing and accreditation institutions) and others interested in the examination process. Maintaining credibility of the credentialing examination amongst the public and medical communities is extremely important (2). Long a problem with many examination programmes, consequential validity (3), or the extent to which an examination programme has a desirable effect on learning, is becoming so important that the American Educational Research Association has recently adopted positive learning as a requirement for validity (4).
The ideal examination should be a learning experience for a candidate as well as a method of evaluation. Candidates will learn from all the information and experience they absorb before, during and after an examination. Candidates also communicate their perceived interpretations of this information and experience to other trainees. In this regard, accurate and timely feedback to candidates about their examination performance becomes extremely important. Unless a candidate receives a mark of 100 percent then it is clear they have not fully mastered all content within the examination. Providing comprehensive feedback to all candidates about their performance will ensure optimum use of the learning opportunities provided by the examinations. Furthermore, provision of feedback to candidates will aid in demystifying the examinations and enhance transparency.
Essentially reliability is the extent to which an examination or examination process is consistent over time, on different occasions, with different candidates or using different questions. There are several different forms of reliability. Intra-rater reliability usually focuses upon the extent to which the same examiner awards consistent marks for candidates who provide similar responses. Inter-rater reliability is the extent to which different examiners award the same mark for the same answer Differences between individual raters arise from a combination of varying expectations about acceptable and unacceptable levels of performance as well as differential focus upon various aspects of candidate performance. Inter-case reliability is the extent to which different cases presented to different candidates introduces variability into examination results. The ideal examination programme is not affected by the variety of available cases, or the degree of consistency with which patients present themselves and their symptoms to candidates.
One of the fundamental purposes of most examinations is to provide information about candidate mastery of the underlying examination construct to enable generalisation from their performance on the examination to their likely performance on the underlying construct overall. Generalisability of a candidate’s performance on an examination to their likely real life performance related to anaesthesia, intensive care or pain medicine can be optimised by minimising threats to validity and reliability.
The practicality of designing, administering and marking an examination varies according to numerous factors including the specific content to be measured; the depth and breath of the knowledge, skills and abilities to be assessed; number of candidates; number of questions; and available personnel, resources and funds. Another important practical consideration is the available time commitment of all personnel (including examiners, timekeepers, administrative staff, data compilers/analysers) and an adequate range of real, simulated and/or standardised patients).
Examinations also require significant physical resources including rooms, time keeping equipment, record keeping tools and computers as well as (possibly) mannequins and props. Logistical considerations are important and include the: opportunity for examiners to optimally communicate when setting examinations; for examiners, candidates and, where appropriate, patients (i.e., all exam participants) to be brought together at the correct time and place; timing considerations; security; marking; and record keeping.
When considering the structure, format, administration and marking of an examination programme it is important to first consider validity, reliability and generalisability in the design of a desired optimal programme and then to modify the hypothetical examinations according to practical constraints. If the converse process is applied, whereby practical considerations are used as the initial priority, then the resulting examination (although practical) will include far from optimal examination characteristics relating to essential validity, reliability and generalisability. To illustrate the relative importance of each desirable examination characteristic, validity is the most important consideration within any examination programme (5), reliability sets an upper bound upon validity, and the opportunity to generalise candidate performance from an examination to performance within real world anaesthesia, intensive care or pain medicine is typically one of the principal objectives of an examination.
Important other considerations include speededness (or the influence of candidate speed within an examination as a contributor to overall candidate success); criterion vs. norm-referenced approaches; examination formats (1); standard setting; and the training of examiners.
Conclusion
Information obtained by the comparison of a real examination programme with the “ideal” provides clear guidance on the direction examination programmes should evolve; the extent of the required evolution; a defining example as an illustration of the eventual programme as the final goal; and, most importantly, identifies what needs to be achieved in order to make the ideal become the reality.
1. Jones RW. Medical specialist examinations: item format types and minimising error. Anaesthesia and Intensive Care 2007; 35: 80-85.
2. Norcini JJ. Examining the examinations for licensure and certification in medicine. JAMA 1994; 272: 713-714.
3. Messick S. The interplay of evidence and consequences in the validation of performance assessments. Educational Res 1995; 23: 13-23.
4. American Educational Research Association. Standards for educational and psychological testing. Author, Washington DC 1999.
5. Brown. Principles of educational and psychological testing. Holt, Rinehart and Winston, New York 1983.

