ESTRO 2024 - Abstract Book
S3016
Physics - Autosegmentation
ESTRO 2024
4. A combination of the three methods mentioned above (referred to as "Complex"), 5. Inference from 5 samples utilizing the variational autoencoder approach from a Probabilistic Hierarchical Segmentation model (referred to as "PhiSeg")[10]. To capture comprehensive types of uncertainty, all five methods for epistemic uncertainty estimation were paired with TTA enabled, similar to Baseline, ensuring the capture of aleatoric uncertainty. We also used the conventional deterministic prediction, which was obtained without any uncertainty estimation, and labeled this as "NO TTA". In total, we compared 7 different uncertainty estimation methods mentioned above. For each method under consideration, we computed the average of the predicted softmax from each inference to create a probability map (P-Map). The segmentations (background, GTV-T, and GTV-N) were determined using an argmax function, and the entropy of the P-Map was used to generate an uncertainty map (U-Map). In the evaluation phase, we assessed the methods on the test set for both GTV-T and GTV-N, comparing them to clinical delineations. Segmentation accuracy was measured using the Dice Similarity Coefficient (DSC). We evaluated confidence calibration through Expected Calibration Error (ECE). The ECE measures consistency between the model's predicted probability and its overall accuracy, with a lower ECE indicating better calibration. Additionally, we assessed the overlap between uncertainty regions (>0.7 on the U-Map) and error regions using DSC, denoted as UE-DSC. Statistical comparisons were conducted between each of the methods and “Baseline” using the Wilcoxon signed-rank test.
Results:
For segmentation on the test set (n=97), the median DSC demonstrated only slight variations, with ranges of 0.73 to 0.76 for GTV-T and 0.78 to 0.80 for GTV-N. In contrast, the median ECE displayed a broader range, spanning from 0.12 to 0.30 for GTV-T and 0.09 to 0.25 for GTV-N. Similarly, median UE-DSC also showed a notable spread with ranged from 0.21 to 0.38 for GTV-T and 0.22 to 0.36 for GTV-N. These findings are visualized as a violin plot in Figure 1. The “Ensemble” method demonstrated the highest DSC for GTV-T segmentation, while the “Complex” method outperformed others in GTV-N segmentation. However, “PhiSeg” excelled in calibration, showing the lowest ECE for both GTV-T and GTV-N, indicating more reliable confidence estimates. An example comparison of different uncertainty estimation methods is illustrated in Figure 2.
Made with FlippingBook - Online Brochure Maker