Am J Otolaryngol Head Neck Surg | Volume 8, Issue 1 | Research Article | Open Access
Lise Sogalow*, Louis Victoor, Mohamad Khalife and Jérôme R. Lechien
Department of Surgery, UMONS Research Institute for Health Sciences and Technology, Mons, Belgium
Department of Otolaryngology-Head and Neck Surgery, EpiCURA Hospital, Baudour, Belgium
Department of Otorhinolaryngology and Head and Neck Surgery, Foch Hospital, Paris Saclay University, Paris,
France
Department of Otorhinolaryngology and Head and Neck Surgery, CHU Saint-Pierre, Brussels, Belgium
Department of Otorhinolaryngology and Head and Neck Surgery, Elsan Poitiers Hospital, Poitiers, France
*Correspondance to: Lise Sogalow
Fulltext PDFShow more 6:42 PM Objectives: The accuracy of Large Language Models (LLMs) as adjunctive clinical tools has been investigated in numerous studies including real otolaryngology cases recruited in Western country populations. The aim of this study was to evaluate the accuracy of five AI-powered LLMs available on smartphones in the management of real clinical cases in Sub-Saharan resource-limited settings. Methods: Demographic and clinical data of patients consulting in Iten County Hospital for otolaryngological conditions were prospectively recruited from December 2024 to January 2025. Anonymized data and primary clinical examination findings were entered into the APIs of ChatGPT-4o, Gemini-2.0-Flash, Claude-Sonnet-3.7, DeepSeek-R1, and Mistral-Large2 for clinical patient management. Two practitioners independently assessed LLM recommendations using the Artificial Intelligence Performance Instrument (AIPI). The Intraclass Correlation Coefficient (ICC) was used to measure the interrater agreement. Results: Sixty-three patients were included (41.3% female; mean age: 24.2 ± 25.2 years). ChatGPT-4o achieved the highest total AIPI score (12.1 ± 2.3; p=0.036), outperforming other LLMs across differential diagnosis (2.3 ± 0.5; p=0.004), management plan (p=0.003), and diagnosis (5.4 ± 1.1; p=0.037). ChatGPT-4o and Claude-Sonnet-3.7 (2.0 ± 0.7 and 2.0 ± 0.5; p=0.053) reported the higher scores for treatment plan compared to other LLMs. The performance of LLMs for differential diagnoses was low-to-moderate with ChatGPT-4o having the highest correct rate of differential diagnoses (25.4%; p=0.008). Management responses were most accurate with Gemini-2.0-Flash (49.2%) and Claude-Sonnet-3.7 (39.7%; p=0.002). Interrater reliability was excellent with an ICC>0.8 for each LLM. Conclusion: Large Language Models demonstrate promising clinical performance in Sub-Saharan otolaryngology settings affected by a shortage of otolaryngologists. ChatGPT-4o consistently outperformed other models across key diagnostic and management tasks. Level of Evidence: IV.
Large language models; ChatGPT; Claude; Sub-Saharan Africa; Otolaryngology; Artificial Intelligence Performance Instrument
Sogalow L, Victoor L, Khalife M, Lechien JR. Evaluating Five AIPowered Language Models as Otolaryngology Clinical Support Tools in Rural Kenya. Am J Otolaryngol Head Neck Surg. 2025; 8(1): 1266.