ESTRO 2025 - Abstract Book

S3391

Physics - Machine learning models and clinical applications

ESTRO 2025

1959

Digital Poster tackling missing values and imbalanced structured data using imputation techniques and synthetic data in colonoscopy prioritization Laya Rafiee Sevyeri 1 , Myriam Martel 2 , Charles Ménard 3 , Daniel von Renteln 4 , Harminder Singh 5,6 , Saro Aprikian 7 , Dong Hyun Kim 8 , Alan N Barkun 9 , Shirin Abbasi Nejad Enger 1,10 1 Medical Physics Unit, McGill University, Montréal, Canada. 2 Research Institute of the McGill University Health Centre, McGill University, Montréal, Canada. 3 Division of Gastroenterology and Hepatology, Université de Sherbrooke, Sherbrooke, Canada. 4 Division of Gastroenterology and Hepatology, Université de Montréal, Montréal, Canada. 5 Department of Internal Medicine, University of Manitoba, Manitoba, Canada. 6 Paul Albrechtsen Research Institute Cancer, CareManitoba, Manitoba, Canada. 7 McGill University Health Center, McGill University, Montréal, Canada. 8 Department of Medicine, McGill University, Montréal, Canada. 9 Division of Gastroenterology, McGill University Health Center, Montréal, Canada. 10 Lady Davis Institute for Medical Research, Jewish General Hospital, Montréal, Canada Purpose/Objective: In Québec, Canada, the AH-702 sheet, which categorizes patients into six priority levels, is used for colonoscopy referrals. We applied AI-based models to prioritize these referrals, using a large tabular dataset derived from these sheets, which included demographic, pathological, and medical history features. Like most tabular datasets, this structured data was subject to significant missing values and highly imbalanced classes. To improve the performance of AI models in prioritizing colonoscopy referrals while addressing these challenges, we used imputation techniques and synthetic data generation. Material/Methods: The dataset of 14,657 patients (7,469 females, 7,188 males) from three Québec hospitals (2018 – 2022) had an average waiting time of 126.5 days (max 1,946). It included demographics, waiting times, and diagnostic data, with limited fecal immunochemical test (FIT) (2.6%), blood test (28%), and imaging results (6%). Only 20 and 308 samples were in the highest and second-highest priority levels. Approximately 34.5% required semi- elective procedures (≤60 days), while others were elective (>60 days), screening, or follow-up. Data was split into training and test sets, with the test set reserved for final evaluation. The table initially included 165 features; however, after cleaning, 21 features with the least predictive properties were removed, resulting in 144 features. 54 features out of 144 exhibit high frequencies of missing values, with at least half of values missing (Figure 1). To use imputation techniques, all the unstructured data were converted to categorical values. Five imputation techniques (Mean, Mice, Most Frequent, SoftImpute, HyperImpute) addressed this issue. Additionally, two tabular synthetic data generation models based on generative adversarial networks (DP GAN 1 and PATE-GAN 2 ) were used. Decision tree (DT) and random forest (RF) models evaluated the effectiveness of these techniques in prioritizing colonoscopy referrals.

Made with FlippingBook Ebook Creator