10 June, 2024
Viraj Kulkarni
Artificial intelligence is showing amazing potential across many domains of healthcare from radiology to patient care. Applications of AI in radiology include detecting tuberculosis from chest X-rays, COVID-19 and intracranial bleeds from CT scans, cancer from mammograms, brain tumors from MRI, predicting progression of Alzheimer’s from PET scans, etc. At DeepTek, we have developed models that can diagnose 30+ conditions from chest X-rays alone. Besides radiology, AI applications have been proposed in pathology, analyzing electronic health records, contact-tracing for pandemic response, predicting re-admissions and mortality, and many more. The strong research output and the large number of companies actively working in this space are a testament to the promise AI has made to radically transform healthcare.
Be that as it may, this promise is still largely unrealized. Many exciting machine learning techniques that have shown remarkable results on benchmark datasets are lying unused in academic papers and code repositories. The valuation of AI-enabled healthcare startups is soaring, but they have no revenues, and only a handful of these have seen any form of productizing AI and commercial adoption.
As the Principal Data Scientist at DeepTek — one of the few companies that have been successful in achieving commercial adoption of their AI solutions — I would like to share the key challenges involved in developing machine learning models that actually work in practice. Here they are below in no specific order.
Yes, we have all heard this before, haven’t we? Data scientists perpetually whine about not having enough data, and after a point, the rest of the team becomes deaf to this complaint. It’s an inconvenient truth, but let's face the fact that there is simply no way around it. If you don’t have data, you have nothing. Machine learning projects start with data and end with data — all innovation happens in between.
Neural networks are data hungry, and they need not just volume but also variety in the data they are fed. Tonnes of medical data exists, but it is severely fragmented. It resides in silos in different hospitals, clinics, personal computers, USB disk drives, emails, and a large percentage of it is not even digitized. Even if a hospital permits you access to their data, you would find it incredibly difficult to track, compile, and organize it. Data privacy regulations such as HIPAA and GDPR, although very important in their own right, have made things even more difficult.
Supervised machine learning needs data samples along with labels. The algorithm is first trained on matching sample-label pairs where it extracts patterns from the data and distills them into a mathematical model. This model is later used to predict the label of any unseen sample. One difficulty is that these labels are often noisy.
Consider the case of radiologists annotating X-ray images as positive or negative with respect to a given pathology. As data scientists, we would like to believe that this labeling is clear, certain, and unambiguous. The reality is somewhat different. As many studies have shown [1][2], there is a high degree of inter-rater variability between radiologists when evaluating studies with kappa values ranging from 0.3 to 0.8. Therefore, if the AI models are built using radiologist annotations as the ground truth, they begin from an uncertain footing where the labels themselves may not be accurate. This problem gets somewhat compounded when the annotations come not from one radiologist but from a team of multiple radiologists — this is likely the case when working with large datasets.
The open-source research datasets (NIH, MIMIC, Padchest, etc.) come with labels that have been automatically extracted from reports using NLP algorithms. The NLP extraction process may introduce errors in the labels, but there is another subtle problem when using them. When the human radiologist evaluated the image and wrote the report, the radiologist had access to patient clinical information, and his/her diagnosis factored in that information. This additional information however is not available for the AI to train on. In the absence of this crucial information, the AI may not be able to learn the right correspondence between the samples and the labels.
Oh, this happens all the time! Consider the following situation. We get a large collection of X-ray images from a hospital. Our radiologists annotate this data for a given pathology, say tuberculosis. The prevalence of tuberculosis in this dataset is 25% ie. one-fourth of all samples are labeled as TB. We train our TB model using this dataset. As part of a new project, we deploy this model at a healthcare center engaged in conducting population screening for TB. The prevalence of TB in this population screening program is, say, 5%. How will the model perform?
Not very well. The model was trained and evaluated for a particular prevalence. It expects the same prevalence in the testing phase. Of course, there are many ways around this, but there are many more problems too. The X-ray equipment in the hospital may differ from that at the healthcare center. The patient position may differ. The hospital images may have tubes and chest leads showing up. Population screening images may include foreign objects like coins, pins, jewelry, and clips since these may be taken with people keeping their clothes on. Due to these factors, the data which the model was trained on may differ significantly from the data it is supposed to predict.
This is related to the above problem but demands a section of its own. Processes and workflows in healthcare are in a state of constant change. Hospitals change. Doctors change. The younger generation of doctors and medical practitioners are not shy about using technology. Patients change. The net effect of all these is that data distributions change all the time. A model, once developed and deployed, is never going to work smoothly for eternity. In fact, the adoption of AI in the hospital may lead to radical changes in the hospital workflow itself, thus causing a change in data distributions.
Distribution drift (also known as concept drift or dataset shift) occurs when the data distribution begins to change over time. To make deployment of AI solutions successful in practice, they need to be equipped with techniques to automatically identify when drift occurs and retrain and update the models to incorporate this drift.
Machine learning algorithms are optimized to achieve the best performance on the given dataset. Although we hope that the models learn the right set of features during training, this may not always be the case. Models are known to stealthily exploit unknown features from images that may or may not be relevant to the prediction. For instance, if the positive images in your dataset predominantly come from one hospital and the negative images from another hospital, the model may learn to differentiate the source of the image rather than the target pathology; so instead of learning to differentiate between classes based on visual symptoms that indicate the pathology, it may pick up contrast differences in the images, some text markers outside the main region of the scan, black border areas around the scan, or some other such irrelevant features (also called confounders). Such a model may even demonstrate excellent performance on a benchmark set if that too carries the same confounding features.
Inadequate understanding of how the AI solutions will be used in practice In many organizations, both research and industry, technology developers work at an arm’s length away from medical practitioners. Due to this gap, the solution they develop solves a problem that is qualitatively different from the one that needs to be solved to make the solution practically useful to the medical folks. Identifying the right stakeholders, eliciting through dialogue what problems they face, translating these into a set of clear requirements, formulating them objectively, and then designing a solution that fulfills these requirements is not an easy task when you start counting the number of people involved in this elongated process.
Artificial intelligence can revolutionize healthcare. By removing bottlenecks in processes, automating routine tasks, improving diagnostic accuracy, improving human productivity, and ultimately reducing costs, AI can be a game-changer in making healthcare accessible to all. To realize this potential, we need to acknowledge the above challenges, give them the respect they deserve, and find ways to mitigate them. At DeepTek, these challenges are a part of our daily lives. To know how we overcome them, reach out to us, and we will be happy to discuss more.
If you liked the article or want to know more, email me at viraj@berkeley.edu.