ESTRO 2025 - Abstract Book
S2179
Interdisciplinary – Education in radiation oncology
ESTRO 2025
The RAG-LLM pipeline was evaluated across three question categories: 51 multiple-choice questions from the American Board of Radiology (ABR) exams, 73 unique topics requiring relevant report numbers, and 19 answer questions. To assess the impact of RAG, LLaMA-3.1-70B's performance was compared with and without RAG across these categories. Additionally, models and their temperature variations were tested on these questions. Results: On multiple-choice questions, RAG reduced LLaMA-3.1-70B's performance from 58% to 45.1%. However, it improved the model’s accuracy in guessing the TG report number from 28.77% to 52.05%. Similarly, performance on written questions improved from 46.67% to 56.57% (Figure 1). When comparing models on multiple-choice questions, the performance of large models ranged from 44% (Gemini 1.5-Flash) to 54.9% (GPT-4O-mini). The smallest model (LLaMA-3.2-3B) scored only 7.84%, and it often indicated that the provided context lacked sufficient information. Lowering model temperature improved scores by up to 5% (LLaMA-3.1-70B). For guessing report numbers, performance ranged from 39.73% (GPT-4O-mini) to 60.27% (LLaMA 3.1-70B). Lastly, performance on long answer questions ranged from 30% (LLaMA-3.2-3B) to 60% (LLaMA-3.1-70B) (Figure 2).
Conclusion: RAG systems can help medical physicists find answers to their questions without having to read through all AAPM TG reports. This RAG-LLM system can serve as a powerful educational tool. Larger models like ChatGPT, Gemini, and LLaMA-3.1-70B showed similar performance, outperforming smaller models like LLaMA-3.2-3B.
Keywords: LLM, AI, AAPM Reports, RAG
Made with FlippingBook Ebook Creator