top of page

Abnormality Detection in Musculoskeletal Radiographs Using Deep Learning

Updated: Jul 8, 2020

Rohit Lokwani

A brief overview of Medical image Analysis, techniques to improve your CNN’s performance on real world datasets

Bio-medical image analysis is an interdisciplinary field which includes: biology, physics, medicine and engineering. It deals with application of image processing techniques to biological or medical problems. Medical images to be analyzed contain a lot of information regarding the anatomical structure under investigation to reveal valid diagnosis and thereby helping doctors to choose adequate therapy. Doctors usually analyse these medical images manually through visual interpretation.

Photo from an-introduction-to-biomedical-image-analysis-with-tensorflow-and-dltk showing Examples of medical images
Photo from an-introduction-to-biomedical-image-analysis-with-tensorflow-and-dltk showing Examples of medical images

From top left to bottom right : Multi-sequence brain MRI: T1-weighted, T1 inversion recovery and T2 FLAIR channels; Stitched whole-body MRI; planar cardiac ultrasound; chest X-ray; cardiac cine MRI.

Why computer vision and machine learning?

Computer vision methods have long been employed to automatically analyze biomedical images. The recent advent of deep learning has replaced many orthodox machine learning methods as it avoids the creation of hand-engineering features, thus removing a critical source of error from the process. Additionally, the faster GPU-accelerated networks, allow us to scale analysis to unprecedented amounts of data.

Abnormality Detection

Now, coming to the main topic that we will be focusing on for this blog is the abnormality detection in bones. Diseases and injuries to the bone are the major contributing factors in causing abnormalities in bones. Now, what happens is, whenever there is an injury to the bone, the physician asks you to do an X-Ray, thus when such hundreds of patients visit hospitals everyday there are massive number of X-rays done on regular basis. To be specific with the stats, Musculoskeletal conditions affect more than 1.7 billion people worldwide, and are the most common cause of severe, long-term pain and disability, with 30 million emergency department visits annually and increasing. So, in order to reduce the error rate of the Radiologist and to do the analysis much faster, an AI solution should suffice the purpose.

Dataset Used

So, there was this Deep Learning competition hosted by Stanford last year which expected the participants to detect the bone abnormalities. The dataset is widely known as MURA. MURA is a dataset of musculoskeletal radiographs consisting of 14,863 studies from 12,173 patients, with a total of 40,561 multi-view radiographic images. Each belongs to one of seven standard upper extremity radiographic study types: elbow, finger, forearm, hand, humerus, shoulder, and wrist. You can download the dataset from the official contest website here.

Source: Stanford Contest site
Source: Stanford Contest site

As AI in Medical domain is booming, I decided to get a hands-on experience with this dataset. The models I developed performed well and attained the Kappa just equivalent to the kappa attained by the Stanford team and for some bones it did outperform them, I’ll be sharing the comparison table at the end of this blog. The rest of the blog is focused on the techniques I used to improve the performance of my models. Without doubt, you can try these techniques on other datasets as well.

To start with, I built separate models for all seven different organs. The models that worked out for me were the DenseNet 169(The Stanford team used an ensemble of 5 such models) and DenseNet121. My baseline models were DenseNets pretained on the imagenet dataset and were transfer learned on this dataset. But sadly, the results weren’t really close to the Stanford kappa, the average kappa I achieved was around 0.5. And then I started browsing the internet for various tips and tricks to let your CNNs do the talking for you.

Following are the techniques I used to improve my model’s performance and bring the kappa equivalent to 0.705.

Learning Rate Schedulers

The learning rate is a hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated. Choosing the learning rate is challenging, as a value too small may result in a long training process that could get stuck, whereas a value too large may result in learning a sub-optimal set of weights too fast or an unstable training process. So what becomes necessary is to find the optimal learning rate. Although, libraries like fastai have built-in functions to find the ideal learning rate, here I use the traditional ones. The two techniques I used were:

1. Step Decay: This technique involves reducing the learning rate after certain predefined number of epochs. You can write your implementation of it. Mine was as follows:

def step_decay(epoch):
   initial_lrate = 0.0001
   drop = 0.1
   epochs_drop = 10.0
   lrate = initial_lrate*          math.pow(drop,math.floor((1+epoch)/epochs_drop))
   return lrate

This function basically reduces the learning rate to 1/10th of its initial value after every 10 epochs.

This function basically reduces the learning rate to 1/10th of its initial value after every 10 epochs. It can be included in the callbacks as follows:

lrate = LearningRateScheduler(step_decay)
callbacks = [lrate]

2. Loss based Decay: This technique reduces the learning rate after a patience of certain predefined epochs once the model stops improving. Keras has a built-in module named ReduceLRonPlateau which is shown in the code below.

reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.1,
patience=10, min_delta=0.0001, verbose=1, min_lr=0.0000001)
callbacks = [reduce_lr]

The above function monitors the validation loss and once it stops decreasing for a patience of 10 epochs it reduces the learning rate by a factor of 0.1. Here, my model had an initial learning rate of 0.001 and through this function I could take it to a value min_lr which is 10–6. For me the later technique proved to be more effective. To get a detailed overview of how to optimise the model’s performance with help of learning rate, you could skim through this impressive article.

Class Weights Penalisation

This is one of the techniques I came across while reading about optimizing model’s performance when you have a class imbalance. Although, when the class imbalance is not too high, it’s not that useful. But in my case, the positive cases were approximately 2/3rd that of the negative ones. So, I used this technique to train the model. There are two ways you could do this:

1. Set the class weights manually for each of the classes by calculating the imbalance in the training set.

class_weight = {0: 1, 1: 1.5}, Y_train, nb_epoch=50, batch_size=32, class_weight=class_weight)

2. The other method involves using Scikitlearn, which comes with a function which automatically computes the class weights from the training data which is as follows:

from sklearn.utils import class_weightclass_weights = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train), y_train, class_weight=class_weights)

Quoting one of the answers on stack exchange to a question regarding computation of class weights for training the model. It says,

Class weight penalisation is done in order to force your algorithm to treat every instance of class 1 as 1.5 instances of class 0. Basically, “treats every instance of class 1 as 1.5 instances of class 0" means that in your loss function you assign higher value to these instances. Hence, the loss becomes a weighted average, where the weight of each sample is specified by class_weight and its corresponding class.

Trying out Various Models

There are various models available in Keras. Although some of the models perform better in most of the cases than others, you can never really pick the best ones out of them. So, everytime you try solving a problem at hand, the smart way is to try various models intuitively.

For example, the Stanford team in this case used DenseNet169 for training the models for all the bones but I tried DenseNet121 for all of them and figured out that for Finger and Hand, DenseNet121 outperformed DenseNet169 massively. The others that I tried included ResNet50 and Inceptionv3 but these never performed upto the mark.

A picture from a Medium article showing the overview of DenseNet architecture
A picture from a Medium article showing the overview of DenseNet architecture

Data Augmentation

This technique is used when the dataset consists of less number of samples or the images have varied orientations. Here, the task at hand asked us to build a binary classifier to determine if the bone had some abnormality or was normal. Although, the dataset had around 1000–2000 samples of each bone for the positive class, still the orientation differed. Following are the sample image from Hand data.

Some pictures from the Hand data showing different rotated images and shifted ones
Some pictures from the Hand data showing different rotated images and shifted ones

From such images it is easy to guess that the images are rotated at different angles, also if you shift and flip the images vertically and horizontally you tend to increase the count of varied images your model sees. Although, I tried varied augmentations, rotations and lateral inversions, these were the ones which helped me improve my Model’s performance.

from keras.preprocessing.image import ImageDataGeneratortrain_datagen=ImageDataGenerator(
    #height_shift_range=0.2 )

You can train the model using this generator by using fit_generator in Keras. For a detailed overview of the image data generator you can visit this link In other ways you can write your own batch_generator and use open source libraries like ‘imgaug’ to do data augmentation.

Various Optimizers

You can use different optimizers as a part of hyperparameter tuning. You can change it to SGD, Adam. I have always used ‘rmsprop’ as a part of training but the Stanford team has used ‘Adam’ with some specific parameters changed. Although, I haven’t found luck in improving the model’s performance by changing the optimizers but why not try if you have options as well as time ?


Ensembling is a technique which allows you to make decisions by aggregating the predictions from your different models. The reasons ensembling works better than single models is because every model learns different features as per its architecture. So, in this case, you could ensemble all your best performing models to get better results. There are two different ways of ensembling, one is Averaging out predictions from the models and the other is to take the majority vote. The former method solely depends on the probabilities or the confidence with which the class is being predicted whereas the later one depends on the predicted class by most number of models in the ensemble. Anyway, there is no specific method which is considered the better of the two, it all depends on what works in your case.

The other methods you could work with are Stacking models in a way like collecting the features from models and feeding them into other models to get better results. You could also use thresholds on the probabilities to get an idea of the model predicting the positive class. All of this is definitely supposed to be done on Validation set and later to be tested on a separate Test set.

Source: Apaperon Semantic Scholar on DeepLearning of Feature Representation

Source: A paper on Semantic Scholar on DeepLearning of Feature Representation
Source: A paper on Semantic Scholar on DeepLearning of Feature Representation


The comparison table of results by the Stanford team and the results from my table are as follows

As you browse the internet you could find there are tons of problems you could be solving using deep learning out there and umpteen organizations hosting competitions on various platforms. Just pick a dataset and try building models as solutions to real world problems using your Data Science expertise.

P. S. The above techniques are also useful in bettering your models for other datasets as well and none of them are specific to the medical domain. As in Data Science its widely said, “At the end of all of it, in a nutshell, it’s your intuition that really matters.” Thus, build your intuition and use it in practice to get the best results.


[1] Martin Rajchl, (Jul 4, 2018), An Introduction to Biomedical Image Analysis with TensorFlow and DLTK [2] S. Renukalatha and K. V. Suresh (2018), A REVIEW ON BIOMEDICAL IMAGE ANALYSIS [3] Stanford MURA Official Contest Website [4] Pranav Rajpurkar, Jeremy Irvin, Aarti Bagul, Daisy Ding, Tony Duan, Hershel Mehta, Brandon Yang, Kaylie Zhu, Dillon Laird, Robyn L. Ball, Curtis Langlotz, Katie Shpanskaya, Matthew P. Lungren, Andrew Y. Ng, ( 22 May 2018 ), MURA: Large Dataset for Abnormality Detection in Musculoskeletal Radiographs [5] Jason Brownlee, (January 25, 2019), Understand the Impact of Learning Rate on Neural Network Performance [6] A thread on Classweight Penalization on stackexchange. [7] Keras documentation

163 views0 comments
bottom of page