ChatGPT-3 and ChatGPT-4, OpenAI’s language processing models, flunked the 2021 and 2022 American College of Gastroenterology Self-Assessment Tests, according to a study published earlier this week in The American Journal of Gastroenterology.
ChatGPT is a large language model that generates human-like text in response to users’ questions or statements.
Researchers at The Feinstein Institutes for Medical Research asked the two versions of ChatGPT to answer questions on the tests to evaluate its abilities and accuracy.
Each test includes 300 multiple-choice questions. Researchers copied and pasted each multiple-choice question and answer, excluding those with image requirements, into the AI-powered platform.
ChatGPT-3 and ChatGPT-4 answered 455 questions, with ChatGPT-3 answering 296 of 455 questions correctly and ChatGPT-4 answering 284 correctly.
To pass the test, individuals must score 70% or higher. ChatGPT-3 scored 65.1%, and ChatGPT-4 scored 62.4%.
The self-assessment test is used to determine how an individual would score on the American Board of Internal Medicine Gastroenterology board exam.
“Recently, there has been a lot of attention on ChatGPT and the use of AI across various industries. When it comes to medical education, there is a lack of research around this potential ground-breaking tool,” Dr. Arvind Trindade, associate professor at the Feinstein Institutes’ Institute of Health System Science and senior author on the paper, said in a statement. “Based on our research, ChatGPT should not be used for medical education in gastroenterology at this time and has a ways to go before it should be implemented into the healthcare field.”
WHY IT MATTERS
The study’s researchers noted ChatGPT’s failing grade could be due to a lack of access to paid medical journals or outdated information within its system, and more research is needed before it can be used reliably.
Still, a study published in PLOS Digital Health in February revealed researchers tested ChatGPT’s performance on the United States Medical Licensing Exam, which consists of three exams. The AI tool was found to pass or come close to passing the threshold for all three exams and showed a high level of insight in its explanations.
ChatGPT also provided “largely appropriate” responses to questions about cardiovascular disease prevention, according to a research letter published in JAMA.
Researchers put together 25 questions about fundamental concepts for preventing heart disease, including risk factor counseling, test results and medication information, and posed the questions to the AI chatbot. Clinicians rated the responses as appropriate, inappropriate or unreliable, and found 21 of the 25 questions were considered appropriate, four were graded inappropriate.