When clinically deploying AI models to radiology, obstacles are often faced in the generalizability of these models in the real world. The generalizability of a model is a measure of how well the model adapts to and performs on, an unseen dataset that is different from the data primarily used for training and testing. One critical reason for reduced model generalizability is overfitting, which often occurs when a model is unable to recognize persistent, stable patterns in different datasets to produce congruous results.
What is Overfitting?
Overfitting in AI is a failure of a model to perform in practice, despite successful execution during the training phase. A model is overfitted when it has learned every aspect of the training set, and cannot differentiate between the real, repeatable patterns of the signal and the irrelevant, changing patterns of the noise. Therefore, when applied to future datasets, the learned noise will interfere with the model’s ability to make reliable observations. On the other hand, a non-overfitted model describes the signal exactly and is thus able to consistently identify it in varied datasets.
Recognizing and Resolving Overfitting
In radiology practice, AI models are used to quickly detect patterns that are identifiers of a particular medical condition. Thus, the models must be generalizable to discern the condition when data are resampled across different population groups. One of the major reasons for medical studies to be prone to overfitting is inadequate training data. Training with more data allows the model to recognize the common signal of interest better. However, gathering larger volumes of medical data is not always possible due to rising concerns about patient confidentiality costs associated with obtaining such data, as well as differing prevalences of diseases. Hence strategies have been devised to better use currently available data, such as splitting the training set into multiple folds to calibrate and fine-tune the model. Another approach, data augmentation, creates variants of existing images to artificially increase the size of the training data, for example, by flipping, rotating, or cropping images. These augmentations would further allow the model to identify a condition that may show slightly different phenotypes in affected individuals.
Apart from volumes of datasets, the neural architecture of the model itself can be adjusted to account for potential overfitting. When the neural network model learns, regularization can be performed such that no one feature of the image outweighs others (this may be achieved by limiting the magnitude of the weights assigned to input features). This decreases the dependence of the model on any particular feature of the image and instead encourages the agreement of many features to conclude the condition in question. Consequently, the generalizability of the model is improved to detect more robust and recognizable patterns across datasets.
Overfitting in Radiology Applications
Overfitting can be further explored in the context of lung abnormalities. For example, chest radiographs are a commonly used screening method to detect foreign bodies in the lungs of patients. AI technology has been deployed to classify any identified foreign bodies, and regression analysis of models has illustrated potential overfitting problems. Figure 1 below shows three such examples of regression models (red lines) alongside the true function of the models (blue lines). A clear sign of overfitting is seen in the third model, where the complexity of the polynomial regression model is very high such that the error on the testing dataset was much higher than the error on the training set. This is consistent with the idea that overfitting occurs because the model is memorizing all aspects of the training set and performing almost perfectly, but is unable to recognize new datasets, thereby producing higher error rates.
Figure 1. Overfitting in regression models. Figure reproduced from Baltruschat, 2021. Attempts were consequently made to reduce the overfitting of models, with both successful and unsuccessful results. For example, data augmentation techniques have been commonly performed to resolve overfitting in deep learning models used to detect COVID-19 viral infection in the lungs. Studies found that for 15 CNN models, data augmentation on the training set improved the models’ performances by reducing overfitting, but for 17 other models, augmentation provided no added benefits and could instead serve to harm models applied to COVID-19 radiographs (Elgendi et al., 2021, Rahaman et al, 2020).
Hence we can conclude that the overfitting of models, and the methods used to reduce the problem, can have important consequences in radiology and identifying diseases from their imaging. Before AI algorithms can be widely distributed in clinical settings, adequate validation is needed to ensure generalizability and regulate overfitting. High training performance cannot be solely assessed to determine a model’s efficiency, but rather varied external datasets need to be supplied to the neural networks to improve the real-world application.
Elgendi, Mohamed et al. "The Effectiveness of Image Augmentation in Deep Learning Networks for Detecting COVID-19: A Geometric Transformation Perspective." Frontiers. Last modified March 1, 2021. https://doi.org/10.3389/fmed.2021.629134.
Rahaman, Md Mamunur, et al. ‘Identification of COVID-19 Samples from Chest X-Ray Images Using Deep Learning: A Comparison of Transfer Learning Approaches. 1 Jan. 2020: 821 – 839. https://content.iospress.com/articles/journal-of-x-ray-science-and-technology/xst200715.
Baltruschat, Ivo-Matteo. Deep Learning for Automatic Lung Disease Analysis in Chest x-Rays. TUHH Universitätsbibliothek, 2021, https://doi.org/10.15480/882.3511.