ESTRO 2024 - Abstract Book

S3082

Physics - Autosegmentation

ESTRO 2024

Material/Methods:

The models, previously trained by our group, consisted of two DL autocontouring models that used the nnU-Net framework [1]. These models were trained using T2 axial-weighted MRIs collected from five public prostate datasets [2][3][4][5][6], one with data that were balanced (B) by the source of domain shift and one using all available data (unbalanced - UB). With an unbalanced training set, this included 385 training subjects. With a balanced training set, the subjects were randomly selected with the constraint that there was an equal proportion from each domain (scanner vendor, field strength, imaging source) which resulted in a training set of 72 subjects. For the external validation dataset, following institutional approval [Ethics approval number: 18/NW/0297], data from prostate cancer patients treated with low dose rate brachytherapy (LDR BT) between 2017 and 2022 was obtained. This dataset consisted of diagnostic pelvis MRI Scans with all patients imaged at the same site and with the same scanner/field strength (Siemens 1.5T). These constraints were introduced to remove any additional sources of domain shift from a secondary objective, which was to investigate if performance of the models was different for patients of different races. Because of this secondary objective, the external validation set was limited to White and Black patients only. To control for other potential confounders, we applied a matched pair design approach when selecting the White and Black cohorts such that each pair had similar prostate volumes (with a tolerance of ±5 cm3) and age (±10 years). This resulted in a total of 66 patients (50% White and 50% Black). All clinical data were anonymised using the RT treatment planning system Varian Eclipse prior to analysis. The dice similarity coefficient (DSC) was used to compare the ground truth segmentations performed by an expert and DL model-produced segmentations. Mann-Whitney U tests were performed to compare DSCs between groups. The MATLAB [7] software was used to perform the statistical analysis.

Results:

The B and UB model performances on the external validation data are shown in Table 1. For the whole population median DSC was 0.844 (IQR 0.062) for the B model and 0.863 (IQR 0.065) for the UB model. The results show that the model performance gets slightly better with more training data (UB model), even when the extra data leads to non equal proportions from each domain. However, the statistical test results provided in Table 1 show no statistically significant difference (p>0.05) in performance between the B and UB models except when evaluated on all data. There was also no statistically significant difference (p > 0.05) in performance of either the B or UB models when evaluated on White and Black patients separately, suggesting no race bias.

Table1

Made with FlippingBook - Online Brochure Maker