ESTRO 2025 - Abstract Book
S2464
Physics - Autosegmentation
ESTRO 2025
2062
Proffered Paper Quality assurance using out-of-distribution detection for deep learning auto-segmentation of organs at risk in head and neck cancer patients Joëlle E. van Aalst 1 , Tomas M. Janssen 2 , Jelmer M. Wolterink 3 , Rita Simões 2 , Federica C. Maruccio 2 , Peter M.A. van Ooijen 1 , Johannes A. Langendijk 1 , Stefan Both 1 , Charlotte L. Brouwer 1 1 Department of Radiation Oncology, University Medical Centre Groningen, Groningen, Netherlands. 2 Department of Radiation Oncology, The Netherlands Cancer Institute, Amsterdam, Netherlands. 3 Department of Applied Mathematics, Technical Medical Centre, University of Twente, Enschede, Netherlands Purpose/Objective: During retrospective quality assurance (QA) of an in-house deep learning (DL) auto-segmentation model for head and neck cancer (HNC) organs-at-risk (OARs), we observed inferior model performance on oral cavities for patients with metal artefact image distortions. We hypothesise that model degradation is caused by under-representation in the training data, resulting in out-of distribution (OOD) cases. Recognising the need for patient-specific auto-segmentation QA, this study explores whether uncertainty quantification could automatically identify OOD cases where model reliability is reduced. Material/Methods: We evaluated a computed tomography (CT) 3D nnU-Net auto-segmentation model [1] for 19 HNC OARs [2] on 10 patients with metal artefacts (OOD) and 10 without metal artefacts (in-distribution, ID). To confirm that reduced oral cavity segmentation quality in metal artefact patients was due to these cases being OOD, we trained an additional model (no-metal model) excluding metal artefact cases, intentionally degrading OOD-performance. We replaced 360 of 610 metal artefact patients from the baseline training set with non-metal artefact cases to maintain training size and train only on ID data. Both models were tested on 10 OOD and 10 ID cases, with Monte Carlo dropout [3] generating patient-specific voxel-wise uncertainty maps for each evaluation patient. Segmentation accuracy (surface Dice) and uncertainty calibration (Expected Calibration Error, ECE) were assessed. The number of voxels per uncertainty map above threshold=0.3 was compared between ID and OOD data using the Wilcoxon signed-rank test and an ROC analysis. Results: In the baseline model, oral cavity segmentation performance was lower in patients with metal artefacts than in patients without metal artefacts (surface Dice:0.99 vs 0.97). This difference was exacerbated in the no-metal model (0.98 vs 0.93). Uncertainty maps were well-calibrated for both models (ECE:0.06-0.07). Figure 1 shows larger uncertainty for OOD than ID data on the baseline model. This difference was even more pronounced in the no-metal model. Uncertainty effectively distinguished metal artefact patients from standard patients (AUC:0.89) for the no-metal model and, more relevantly, remains effective on the baseline model (AUC:0.82). Figure 2 shows decreased segmentation performance and increased uncertainty for a metal artefact patient on the no-metal model compared to both the baseline model and a no-metal artefact patient.
Made with FlippingBook Ebook Creator