ESTRO 2023 - Abstract Book

S83

Saturday 13 May

ESTRO 2023

Purpose or Objective Auto-segmentation with convolutional neural networks (CNNs) is a popular technique in radiotherapy. New models are being constantly developed and deployed, but with a limited availability of highly curated data, an often-asked question is “How much data is needed to successfully train a CNN?”. In this study, we answer that question for head and neck (HN) auto segmentation models and additionally demonstrate an ensemble technique to boost the performance of such models. Materials and Methods 1215 planning CT scans with clinical organ-at-risk (OAR) segmentations for the brainstem, parotid glands and the cervical section of the spinal cord were collected from the archives of a single institution. From these, 215 scans were reserved as a blind test set, leaving a pool of 1000 scans for training. We trained an established head and neck (HN) auto-segmentation CNN[1] from scratch using random subsets of 25, 50, 100, 250, 500, 800 and 1000 from this pool. For each dataset, a five fold cross-validation was performed (Figure 1a). We tested two inference strategies: 1) select the best of the 5 models, which is a standard approach; 2) Use an ensemble of all 5 models, where the predicted class probabilities were summed up before taking the argmax to produce the segmentation (Figure 1b). Both inference strategies were tested on the reserved blind test set. We calculated the bi-directional mean distance-to-agreement (mDTA) and used Wilcoxon signed-rank tests to determine which of the two inference approaches had superior performance.

Results The mDTA results of models for every dataset are shown in Figure 2 for each OAR. As expected, the auto-segmentation performance improves with dataset size, stabilising around 100 scans. However, the ensemble technique is significantly better for all datasets and OARs than merely selecting the best model of the five from cross-validation. The segmentation performance boost of the ensemble technique is most significant as the number of training examples decreases, and quite considerable for the smallest datasets. The 95th percentile Hausdorff distance (HD95) and Dice similarity coefficient (DSC) metrics produced analogous results but are not reported for brevity.

Conclusion

Made with FlippingBook flipbook maker