ESTRO 2025 - Abstract Book

S431

Clinical - Breast

ESTRO 2025

409

Digital Poster Evaluating large language models as decision tools for early breast cancer treatment: comparison of AI suggestions and expert recommendations Loic Ah-thiane 1 , Pierre-Etienne Heudel 2 , Mario Campone 3 , Marie Robert 3 , Victoire Brillaud-Meflah 4 , Caroline Rousseau 5 , Magali Leblanc-Onfroy 1 , Florine Tomaszewski 1 , Stéphane Supiot 1 , Tanguy Perennec 1 , Augustin Mervoyer 1 , Jean-Sébastien Frenel 3 1 Radiotherapy, Western cancer Institute, Nantes, France. 2 Medical oncology, Leon Berard Center, Lyon, France. 3 Medical oncology, Western cancer Institute, Nantes, France. 4 Surgery, Western cancer Institute, Nantes, France. 5 Nuclear Medicine, Western cancer Institute, Nantes, France Purpose/Objective: Multidisciplinary team meetings (MDTs) are essential but resource-intensive in oncology care. Large language models (LLMs) show promise in medical decision support, but their accuracy in oncology treatment planning remains understudied. This study aimed to assess the accuracy of three leading LLMs in generating treatment recommendations for early breast cancer patients and compare them with expert MDT decisions. Material/Methods: We conducted a retrospective analysis of 112 anonymized breast cancer cases presented at MDTs between January April 2024. Three LLMs (Claude3-Opus, GPT4-Turbo, and LLaMa3-70B) were evaluated for their ability to suggest appropriate treatment options for each clinical case. Automations were implemented to provide medical records to LLMs and obtain their suggestions (see Figure 1). Primary outcome was the rate of appropriate suggestions compared to expert decisions. Secondary outcomes included performance metrics (F1-score and specificity) for individual treatment modalities.

Results: The rates of appropriate suggestions were 86.6% (97/112), 85.7% (96/112), and 75.0% (84/112) for Claude3-Opus, GPT4-Turbo, and LLaMa3-70B, respectively. No significant difference was found between Claude3-Opus and GPT4 Turbo (p=0.85), but both tended to perform better than LLaMa3-70B (p=0.027 and p=0.043, respectively) (see figure 2). All models achieved perfect accuracy (F1-score=1.0) for endocrine therapy and anti-Her2 targeted therapy recommendations. Performance varied for other modalities: adjuvant chemotherapy (F1-scores: 0.86-0.92), radiotherapy (F1-scores: 0.83-0.94), and genomic testing recommendations. Notable limitations included

Made with FlippingBook Ebook Creator