A recently published study in The British Medical Journal tested whether artificial intelligence (AI) could pass the Royal College of Radiologists (FRCR) Fellowship exam.
Radiologists in the United Kingdom (UK) must pass the FRCR exam before completing their training. Assuming AI can pass the same test, it could replace radiologists. The final FRCR exam consists of three components, and candidates need a passing grade in each component to pass the exam overall.
In the rapid reporting component, candidates must analyze and interpret 30 x-rays in 35 minutes and report at least 90% correctly to pass this part of the exam. This session measures the candidates on accuracy and speed. There is an argument that suggests that AI would excel at accuracy, speed, x-rays, and binary outcomes. Therefore, the FRCR exam quick report session can be an ideal setting to test the AI’s capabilities.
To learn: Can artificial intelligence pass the Fellowship of the Royal College of Radiologists exam? Multi-reader study of diagnostic accuracy. Image credit: SquareMotion/Shutterstock
About the study
In the present study, researchers evaluated whether an AI candidate can pass the FRCR exam and outperform human radiologists taking the same exam. The authors used 10 sham FRCR exams for the analysis because the RCR refused to share cases of early FRCR exams in retirement. The radiographs were selected to reflect the same level of difficulty as an actual examination.
Each sham examination included 30 x-rays covering all parts of the body of adults and children; approximately half contained pathology and the remaining had no abnormalities. Previous successful FRCR candidates (radiologist readers) who have passed the FRCR exam in the past 12 months were recruited through social media, word of mouth and email.
Radiologist readers completed a short survey that collected information on demographics and previous FRCR exam attempts. Anonymized X-ray images were provided via an online image viewing platform (Digital Imaging and Communications in Medicine, DICOM). The radiologists had one month (May 2022) to record their interpretations for ten sham examinations on an online sheet.
The radiologists rated 1) how representative the sham exams were relative to the actual FRCR exam, 2) their performance, and 3) how well they thought the AI would have performed. Likewise, 300 anonymized X-ray images were made available to the AI candidate called Smarturgences, developed by Milvue, a French AI company.
The AI tool was not certified to analyze abdominal and axial skeleton X-rays; Nevertheless, these X-rays were made available to him for reasons of fairness to the participants. The score for the AI tool was calculated in four ways. In the first scenario, only the AI-interpretable radiographs were evaluated, excluding non-interpretable radiographs. The uninterpretable radiographs were scored as normal, abnormal, and false in the second, third, and fourth scenarios.
Results
A total of 26 radiologists, including 16 women, were recruited and most participants were between 31 and 40 years old. 16 radiologists have completed their FRCR exam in the last three months. Most participants passed the FRCR exam on the first try. The AI tool would have passed two mock tests in the first scenario. In Scenario 2, the AI would have passed a mock exam.
In scenarios 3 and 4, the AI candidate would not have passed the exam. The overall sensitivity, specificity, and accuracy for AI were 83.6%, 75.2%, and 79.5% in Scenario 1. For radiologists, the summary estimates of sensitivity, specificity, and accuracy were 84.1%, 87.3%, and 84.8%. AI was the best-performing candidate in one exam, but second-to-last overall.
Assuming strict evaluation criteria that best reflect the actual investigation, which was the case in Scenario 4, the overall sensitivity, specificity, and accuracy of AI were 75.2%, 62.3%, and 68.7%, respectively. In comparison, the radiologists’ summary estimates for sensitivity, specificity, and accuracy were 84%, 87.5%, and 85.2%, respectively.
No radiologist has passed all sham exams. The highest-ranking radiologist passed nine sham exams, while the three lowest-ranking radiologists passed only one. On average, radiologists were able to pass four mock exams. The radiologists rated the mock exams as slightly more complex than the FRCR exam. They rated their performance on a 10-point Likert scale from 5.8 to 7.0 and the AI’s performance from 6 to 6.6.
The researchers say: “On this occasion, the artificial intelligence candidate could not pass any of the 10 mock exams when evaluated against similarly rigorous criteria as their human counterparts, but could pass two of the mock exams if a special exemption was made by the RCR.” to exclude images it hasn’t been trained for.”
Of the 42 uninterpretable radiographs in the dataset, the AI candidate gave a result for one that was mislabeled as a basal pneumothorax on a normal abdominal radiograph. More than half of the radiologists misdiagnosed 20 x-rays; Of these, the AI tool misdiagnosed 10 X-rays, but correctly interpreted the rest. In total, almost all radiologists correctly evaluated 148 X-rays, 134 of which were also correctly interpreted by the AI candidate.
Conclusions
In summary, AI passed two mock checks when the special permission was granted, namely the exclusion of uninterpretable images. However, AI would not pass unless an exemption was granted. Although the AI did not outperform radiologists, its accuracy remained high given the complexity and case mix.
In addition, the AI performed best in a mock exam, outperforming three radiologists. Remarkably, the AI correctly diagnosed half of the X-rays, which their human colleagues misinterpreted. Nonetheless, the AI candidate still needs more training to achieve performance and skills at the same level as an average radiologist, especially in cases that cannot be interpreted by the AI.