TY - JOUR
T1 - ChatGPT outscored human candidates in a virtual objective structured clinical examination in obstetrics and gynecology
AU - National University of Singapore Obstetrics and Gynecology Artificial Intelligence (NUS OBGYN-AI) Collaborative Group
AU - Li, Sarah W.
AU - Kemp, Matthew W.
AU - Logan, Susan J.S.
AU - Dimri, Pooja Sharma
AU - Singh, Navkaran
AU - Mattar, Citra N.Z.
AU - Dashraath, Pradip
AU - Ramlal, Harshaana
AU - Mahyuddin, Aniza P.
AU - Kanayan, Suren
AU - Carter, Sean W.D.
AU - Thain, Serene P.T.
AU - Fee, Erin L.
AU - Illanes, Sebastian E.
AU - Choolani, Mahesh A.
AU - Rauff, Mary
AU - Biswas, Arijit
AU - Low, Jeffrey J.H.
AU - Ng, Joseph S.
AU - Gosavi, Arundhati T.
AU - Huang, Zhongwei
AU - Rajendran, Preethi
AU - Ng, Kwok Weng Roy
AU - Wong, Jeslyn J.L.
AU - Chan, Grace M.F.
AU - Lee, Kelvin Zhi Xing
AU - Yap, Jeannie J.Y.
AU - Ahmad, Nurulhuda
AU - Shanmugam, Shwetha
AU - Kathirvel, Rajeswari
AU - Mathur, Manisha
AU - Lim, Whui Whui
AU - Lim, Min Yu
N1 - Funding Information:
We acknowledge the Obstetrical and Gynaecological Society of Singapore for providing the mock objective structured clinical examination questions and anonymized historical scores. In addition, we would like to acknowledge the Royal College of Obstetricians and Gynaecologists examinations department for providing us with results for the Member of the Royal College of Obstetricians and Gynaecologists Part 3 examination for Singapore candidates over the past 2 years. Lastly, we sincerely appreciate the efforts of all who volunteered as candidates and examiners in this study. The members of the NUS OBGYN-AI Collaborative Group are as follows: • Department of Obstetrics and Gynaecology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore: Mary Rauff, FRCOG; Arijit Biswas, FAMS; Jeffrey J.H. Low, FAMS; Joseph S. Ng, FAMS; Arundhati T. Gosavi, FRCOG; Zhongwei Huang, PhD; and Preethi Rajendran, BSc • Department of Obstetrics and Gynaecology, National University Hospital, Singapore, Singapore: Kwok Weng Roy Ng, FAMS; Arundhati T. Gosavi, FRCOG; Jeslyn J.L. Wong, MMED; Grace M.F. Chan, MBBS; Kelvin Zhi Xing Lee, MD; Jeannie J.Y. Yap, MBChB; Nurulhuda Ahmad, MBBS; and Shwetha Shanmugam, MRCOG • Department of Obstetrics and Gynaecology, KK Women's and Children's Hospital, Singapore, Singapore: Rajeswari Kathirvel, FRCOG, and Manisha Mathur, FRCOG • Bia-Echo Asia Centre for Reproductive Longevity and Equality, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore: Zhongwei Huang, PhD • Department of Obstetrics and Gynaecology, Academia, Singapore General Hospital, Singapore, Singapore: Whui Whui Lim, MRCOG • Obstetrical and Gynaecological Society of Singapore (OGSS), Singapore, Singapore and Singapore International Representative Committee, Royal College of Obstetricians and Gynaecologists: Min Yu Lim, FRCOG. This study received no financial support.
Publisher Copyright:
© 2023 The Author(s)
PY - 2023/8
Y1 - 2023/8
N2 - Background: Natural language processing is a form of artificial intelligence that allows human users to interface with a machine without using complex codes. The ability of natural language processing systems, such as ChatGPT, to successfully engage with healthcare systems requiring fluid reasoning, specialist data interpretation, and empathetic communication in an unfamiliar and evolving environment is poorly studied. This study investigated whether the ChatGPT interface could engage with and complete a mock objective structured clinical examination simulating assessment for membership of the Royal College of Obstetricians and Gynaecologists. Objective: This study aimed to determine whether ChatGPT, without additional training, would achieve a score at least equivalent to that achieved by human candidates who sat for virtual objective structured clinical examinations in Singapore. Study Design: This study was conducted in 2 phases. In the first phase, a total of 7 structured discussion questions were selected from 2 historical cohorts (cohorts A and B) of objective structured clinical examination questions. ChatGPT was examined using these questions and responses recorded in a script. Of note, 2 human candidates (acting as anonymizers) were examined on the same questions using videoconferencing, and their responses were transcribed verbatim into written scripts. The 3 sets of response scripts were mixed, and each set was allocated to 1 of 3 human actors. In the second phase, actors were used to presenting these scripts to examiners in response to the same examination questions. These responses were blind scored by 14 qualified examiners. ChatGPT scores were unblinded and compared with historical human candidate performance scores. Results: The average score given to ChatGPT by 14 examiners was 77.2%. The average historical human score (n=26 candidates) was 73.7 %. ChatGPT demonstrated sizable performance improvements over the average human candidate in several subject domains. The median time taken for ChatGPT to complete each station was 2.54 minutes, well before the 10 minutes allowed. Conclusion: ChatGPT generated factually accurate and contextually relevant structured discussion answers to complex and evolving clinical questions based on unfamiliar settings within a very short period. ChatGPT outperformed human candidates in several knowledge areas. Not all examiners were able to discern between human and ChatGPT responses. Our data highlight the emergent ability of natural language processing models to demonstrate fluid reasoning in unfamiliar environments and successfully compete with human candidates that have undergone extensive specialist training.
AB - Background: Natural language processing is a form of artificial intelligence that allows human users to interface with a machine without using complex codes. The ability of natural language processing systems, such as ChatGPT, to successfully engage with healthcare systems requiring fluid reasoning, specialist data interpretation, and empathetic communication in an unfamiliar and evolving environment is poorly studied. This study investigated whether the ChatGPT interface could engage with and complete a mock objective structured clinical examination simulating assessment for membership of the Royal College of Obstetricians and Gynaecologists. Objective: This study aimed to determine whether ChatGPT, without additional training, would achieve a score at least equivalent to that achieved by human candidates who sat for virtual objective structured clinical examinations in Singapore. Study Design: This study was conducted in 2 phases. In the first phase, a total of 7 structured discussion questions were selected from 2 historical cohorts (cohorts A and B) of objective structured clinical examination questions. ChatGPT was examined using these questions and responses recorded in a script. Of note, 2 human candidates (acting as anonymizers) were examined on the same questions using videoconferencing, and their responses were transcribed verbatim into written scripts. The 3 sets of response scripts were mixed, and each set was allocated to 1 of 3 human actors. In the second phase, actors were used to presenting these scripts to examiners in response to the same examination questions. These responses were blind scored by 14 qualified examiners. ChatGPT scores were unblinded and compared with historical human candidate performance scores. Results: The average score given to ChatGPT by 14 examiners was 77.2%. The average historical human score (n=26 candidates) was 73.7 %. ChatGPT demonstrated sizable performance improvements over the average human candidate in several subject domains. The median time taken for ChatGPT to complete each station was 2.54 minutes, well before the 10 minutes allowed. Conclusion: ChatGPT generated factually accurate and contextually relevant structured discussion answers to complex and evolving clinical questions based on unfamiliar settings within a very short period. ChatGPT outperformed human candidates in several knowledge areas. Not all examiners were able to discern between human and ChatGPT responses. Our data highlight the emergent ability of natural language processing models to demonstrate fluid reasoning in unfamiliar environments and successfully compete with human candidates that have undergone extensive specialist training.
KW - Chat Generative Pre-trained Transformer
KW - artificial intelligence
KW - objective structured clinical examination
KW - obstetrics and gynecology
KW - postgraduate specialty training
KW - reasoning
UR - http://www.scopus.com/inward/record.url?scp=85160042351&partnerID=8YFLogxK
U2 - 10.1016/j.ajog.2023.04.020
DO - 10.1016/j.ajog.2023.04.020
M3 - Article
C2 - 37088277
AN - SCOPUS:85160042351
SN - 0002-9378
VL - 229
SP - 172.e1-172.e12
JO - American Journal of Obstetrics and Gynecology
JF - American Journal of Obstetrics and Gynecology
IS - 2
ER -