ESTRO 2025 - Abstract Book

First page Table of contents Previous page 2158 Next page Last page

S2150

Interdisciplinary – Education in radiation oncology

ESTRO 2025

Results: The RO and GPT-4 generated the highest-quality responses (median composite scores: 0.90 vs. 0.86, p=0.26). However, GPT-4 often provided overly verbose answers (median concision scores: 3.00 vs. 2.80, p<0.0001). The fine tuned GPT-3.5 model outperformed the base GPT-3.5 model in overall quality (median composite scores: 0.81 vs. 0.74, p<0.05), particularly improving conciseness (median concision scores: 2.80 vs. 2.40, p<0.0001), but often produced overly simplistic answers that lacked nuance (median reliability scores: 4.00 vs. 4.50, p<0.05). The results for all categories are presented in Figure 2. Low-quality responses, mostly characterized by irrelevant or excessively detailed information, were rare, occurring in 4% of GPT-generated answers (mainly in GPT-3.5), and were absent in RO responses.

Made with FlippingBook Ebook Creator