The standard ROC approach is used for evaluating the image-level performance of the model

Beyond AUROC: Incorporating localization information using the Free-Response ROC paradigm

-Aalhad Patwardhan

10 June, 2024

INTRODUCTION:

Receiver Operating Characteristic (ROC) is a widely used metric to assess the performance of binary classification models. In clinical practice, ROCs are frequently used to illustrate a trade-off between sensitivity and specificity at various threshold settings. It indicates how well the model can differentiate between disease classes. For every data point, there is one prediction value between 0 and 1. Higher the AUC, the better the model is at predicting class 0 (no disease) as 0 and class 1 (disease) as 1. This is compared to a threshold value in order to calculate the confusion matrix for the dataset. The threshold is varied to get a list of metrics such as sensitivity, specificity, true positive rate (TPR), false positive rate (FPR), etc. The ROC curve plots TPR vs FPR, which helps in determining the best TPR and FPR threshold. The area under the curve (AUC) of this plot is used as the figure-of-merit (FOM) of the model. To adapt ROC and AUC for segmentation performance, the pixel with the highest prediction probability can be used as the prediction value for the entire image.

However, one downside to this method for segmentation tasks is the loss of the mask location inside the image. This additional aspect of locating desired object(s) in an image distinguishes image segmentation and object detection from standard classification tasks. AUC-ROC scores alone would not tell the reader whether the model is correctly predicting the location of the desired object(s).

Figure 1 (A) A chest X-ray of an adult male diagnosed with Lung Nodules as marked by the radiologist establishing the ground truth (indicated in square boxes) (B) A second radiologist predicted Lung Nodules in the same image. Under the ROC paradigm, this image would be considered a true positive (for the presence of lung nodules), whereas under the FROC paradigm this would count as 2 lesion localisation and 2 non-lesion localisation.

Figure 1(B) clearly shows that the second radiologist does not correctly predict the location of the abnormality. Because of the approach by which ROC is calculated, the classification report would display this as a True Positive result, which is not the desired situation.

These tasks where elements (such as lesions, or marks) of an image need to be locatedare called localization tasks. For such tasks, we would like to gain insight into the model’s localization performance in addition to detection performance. This is where Free-response ROC plots are beneficial.

Free-response Receiver Operating Characteristic (FROC):

FROC considers the locations of the predicted marks with respect to the ground-truth marks. It can consider multiple (>=0) correct and incorrect localizations on the same image. There are four main concepts that define the FROC paradigm(Chakraborty 2013):

Proximity Criterion: This is the criterion which determines how far the predicted mark can deviate from the ground truth mark and still be considered an LL (see below). The most common approach for calculating the proximity criterion is acceptance radius. For (ideally) spherical-shaped lesions, a clinically acceptable radius extends from the centre of the lesion. If the AI mark is inside this region, it qualifies the proximity criterion.

Lesion Localization or True Positive (LL or TP): A correctly localized lesion, i.e., the predicted probabilities in the predicted mark that are above the threshold and qualify the proximity criterion of the ground truth mask.

Non-Lesion Localization or False Positive (NL or FP): An incorrectly localized lesion, i.e., the predicted probabilities in the predicted mark that exceed the threshold and do not qualify the proximity criterion of the ground truth mark.

Mark-Rating: Each NL and LL is assigned a rating that represents the confidence of the observer in the localization mark.

Figure 2: Example of TP, FP, and proximity criterion. Source:Bandos et al. 2013

Similar to the ROC, the metrics are evaluated for different thresholds and used to plot a curve. This curve depicts the performance of the classifier in terms of the rate of LLs to the rate of NLs.

The FROC curve plots the LLF vs NLF for different thresholds. The LLF (lesion localisation fraction) is the total number of LLs for all the subjects divided by the total number of lesions across all the subjects. The NLF (Non-Lesion localisation fraction) is the total number of NLs for all the subjects divided by the total number of subjects.

While ROC provides a straightforward and trustworthy Figure-Of-Merit (FOM) that can be used to assess an observer’s performance, viz., Area-Under-the-Curve (AUC), FROC has one drawback related to using a FOM. Even though LLF always lies between [0,1], NLF does not as the total number of False Positives may exceed the total number of subjects. This value will also differ for different observers.Because the lengths of the curves don’t match (see Figure 3), there isn’t a good FOM for FROC curves. There have been attempts to standardise one FROC-FOM (eg.:Bandos et al. 2009) to address this issue, but the most commonly adopted option is to plot the Alternative FROC(AFROC) instead.

Alternative Free-response ROC (AFROC):

AFROC can be thought of as a hybrid of ROC and FROC plots. It conveys the same information as FROC while also adding the scope of having a good FOM, as the curve for AFROC exactly fits inside the unit square, i.e., both x-axis and y-axis of the plot have a range of [0,1]. This is achieved by plotting the LLF against the FPF. The FPF is the total number of false positive images divided by the total number of normal subjects. This is equivalent to the FPR metric in ROC.

Figure 3: Comparison of (A) FROC and (B) AFROC for detection of abnormalities in extracolonic soft tissue using computed tomography colonography (CTC) with SD (Standard Dose) and ULD (Ultra Low Dose) of radiation. Source:(Thorén et al. 2021)

As shown in Fig 3A, the curves for SD and ULD in FROC end at different points, making standardized methods to compare them (FOM) difficult to find.

In AFROC plot (Fig. 3B), because the data for plotting curves is normalized between [0,1] for both axes, the comparison between SD and ULD becomes more intuitive and a FOM such as AUC can be used.

The proponents of AFROC(Chakraborty and Winter 1990) have suggested that AUC is a valid FOM for AFROC, similar to AUC-ROC score. The diagonal line from (0,0) to (1,1) serves as a good baseline for the model performance. There are other methods to calculate the AUC under the AFROC curve, but the most commonly used method is called JAFROC (Jackknife AFROC).

Jack-knife AFROC (JAFROC):

JAFROC is equivalent to the trapezoidal AUC of an AFROC curve. It essentially compares all pairings of LL and NLs on normal subjects and calculates a score based on the ratings of LLs and NLs for a particular subject, sums up the scores for each pairing and divides by the product of number of lesions and number of normal images.

The score function is:

Here x would be LLs across the data and y would be the highest rated NL on a normal case. If we define CLesion as the count of LLs and CNormal as the count of normal cases, then the number of comparisons, Ccomparison is given by:

Ccomparison= CLesion ✕ CNormal

Therefore, the JAFROC-FOM θ is given by:

Where, xj would take values of ratings of LLs for each lesion j and yi would take the rating of the highest rating of NL for each normal subject i.

Other important measures for localisation performance:

In addition to FROC and AFROC, a few other methods have been proposed to measure the performance of localisation tasks. These are as follows:

Localization ROC (LROC): Observer is prompted to locate and provide a rating to the most suspicious region in the subject. Unlike in FROC, where the observer has the freedom to add as many mark-rating pairs as they suspect.
Region-Of-Interest ROC (ROI-ROC): The observer splits the subject into many regions based on suspicion, referred to as “regions of interest(ROI)”, and each region of interest is assigned a ROC rating. However, this is ineffective in scenarios where there is no uniform and consistent strategy for identifying and partitioning the subjects into different regions of interests.

Conclusion:

In conclusion, the standard ROC approach is used for evaluating the image-level performance of the model, i.e., whether the lesion is present or not. On the other hand, FROC approach provides information on the location and accuracy of multiple lesions present within the image. Although the FROC paradigm is sometimes considered more insightful or superior than the standard ROC approach, it also depends on the type of problem at hand. Both ROC and FROC are commonly used to answer different research problems: sometimes ROC is the best suited approach and sometimes FROC is an appropriate choice.