ESTRO 2024 - Abstract Book

S2981

Interdiscplinary - Other

ESTRO 2024

For qualitative assessments systems to be useful, they must result in consistent scoring. They should be validated accordingly prior to use, for example using inter- and intra-rater reliability statistics [2] . However, this has been rarely performed. The aims of this study were to develop a novel tumour-specific qualitative evaluation tool, and to assess inter- and intra-rater reliability.

Material/Methods:

Following a literature review, a 5-point Likert scale was developed with a scoring system designed to capture the anticipated need for editing, time-saving potential and impact of contouring errors on dosimetry. This was adapted to be specific for auto-contours for prostate cancer, including prostate CTV, anorectum and bladder. A pilot study was undertaken with 6 cases and 10 observers to refine the scale. Radiotherapy planning CT scans were used from 24 patients previously treated for prostate cancer. Auto-contours for the prostate, anorectum and bladder were generated using Mirada DLC Expert. Since grade prevalence can influence the magnitude of inter-rater reliability statistics, some contours were manually modified by an independent observer to ensure a similar distribution across each score. To assess intra-rater reliability, six cases were repeated under a different patient identifier. Therefore, there were 30 cases in total, with the repeat cases randomly mixed in the list. Six clinicians with experience in treating prostate cancer were recruited and received a 1-hour tutorial on the scoring system, with the opportunity to try practice cases. A two-way mixed, single measures consistency intraclass correlation coefficient was used to calculate inter-rater reliability statistics. The weighted kappa coefficient (κ) was used to assess intra-rater reliability. For these statistics, 0 represents random agreement and 1 represents complete agreement. All analysis was completed using IBM SPSS Statistics (Version 29). A sample size calculation was performed using “kappasize” package [3] in “R” [4]. A null kappa hypothesis of 0.4 and an alternative hypothesis of 0.6 was set, using previously agreed parameters for acceptable inter- and intra-rater reliability [5, 6] . For 6 raters, assuming equal distributions of outcomes, a minimum of 24 subjects would be required.

Results:

A. Development of the tool

Following the initial pilot study and feedback from raters, a simplified scale was developed (see figures).

Made with FlippingBook - Online Brochure Maker