10 June, 2024

Artificial Intelligence (AI) models are capable of processing unseen data to produce meaningful outputs. However, evaluating the performance of these models on all possible data points in the world is not possible. Therefore, statistical methods are employed to infer the real-world performance of AI models. Given multiple AI models, how do we compare which model performs better than others on unseen data? When radiologists use AI models in their workflow, do they detect diseases more accurately? The answers to these questions lie in statistical significance tests. Statistical tests help us make conclusions about the population using information from a sample of that population and help us rule out anomalous or chance results.

Every statistical test begins with a null hypothesis, a specific statement about a population parameter that represents a “default” situation, e.g. no effects observed. The alternative hypothesis includes all other feasible values for the population parameter besides the value stated in the null hypothesis. For instance, let’s consider a scenario where we compare the performance of two AI models. The null hypothesis, in this case, would be that the median performance level (denoted by *P*) of Model 1 is equal to that of Model 2, while the alternative hypothesis would be that the performances of the models are not equal. In simple terms, the hypotheses can be stated as follows:

Null Hypothesis: *P1=P2*

Alternative Hypothesis: *P1≠ P2*, i.e. *P1 < P2* or *P1 > P2*

These hypotheses form the basis of statistical significance tests, which allow us to draw conclusions and make informed decisions about the performance of AI models in various scenarios.

There are two types of statistical tests:

**Parametric tests **— These tests assume that the population data follows a normal distribution. Parametric tests are in general more powerful (require a smaller sample size) than nonparametric tests **(Chin, 2008)**. However, this also makes the tests more prone to Type I error (where an effect is detected when it is not present). Some popular parametric tests are as follows:

One-Sample t/Z-Test

Unpaired 2-Sample t/Z Test

Paired 2-Sample Z/t-Test

Analysis of Variance (ANOVA)

**Non-parametric tests** — These tests do not make assumptions about the distribution of population data. As these tests, in general, require a larger sample size than parametric tests to reject the null hypothesis, the tests are more prone to Type II error (where the effect is not detected even though it exists). A few examples of non-parametric tests are:

One-Sample Sign Test

Wilcoxon Signed Rank Test

Mann-Whitney U Test

Let us look at each of these tests in more detail.

**Figure 1.** A simple algorithm to determine which statistical test should be performed depending on the data.

● **One Sample t/Z-Test:** It is a test to compare the sample mean or proportion with the population mean or proportion proposed in the null hypothesis (Whitlock & Schluter, 1989 Analysis of Biological data, 2nd edition). The Z-statistic (and Z-test) is used when the population is normally distributed and the population standard deviation is known or when the sample size is greater than 30. The t-statistic (and t-test) is used when there are unknown population variances but the populations are normally distributed (Park, Hun Myoung 2009).

● **Unpaired **two-sample** t/Z-Test:** It is a test to compare two different population means or population proportions. Independent samples from two populations are taken and sample means or proportions are calculated.

● **Matched pair t/Z-Test:- **If two samples have a natural pairing amongst them, this test is used. For example, this test is used to elicit the effects of a treatment by comparing samples before and after treatment. The mean of differences before and after treatment is calculated. The null hypothesis would be that the treatment has no effect on the population. If the population standard deviation of the difference in means is known, the Z-test is used, otherwise, the t-test is used.

**ANOVA:** ANOVA is a generalization of the unpaired two-sample t-test for means. It provides a statistical test of whether two or more population means are equal. The null hypothesis of ANOVA is that the population means are the same for all treatments/populations. Rejecting the null hypothesis in ANOVA is evidence that the mean of at least one group is different from the others. The key insight of ANOVA is that we can estimate how much variation among group means ought to be present from sampling error alone if the null hypothesis is true.

**Figure 2.** Types of t-tests. The one sample t-test compares the sample mean with population mean proposed in null hypothesis. The independent samples t-test compares means of two populations. The paired samples t-test is to identify differences between groups containing matching pairs, e.g. a cohort before and after treatment. Image source: Datatab.net

● **One-Sample Sign Test: **This test is used to test whether the median value for a single data set is equal to a hypothesized value. Each value in the sample above the hypothesized median is given a positive sign and each value below the hypothesized median in the sample is given a negative sign. The One-sample sign test compares the number of negative signs with the number of positive signs. The null hypothesis is that the number of positive signs is equal to the number of negative signs.

● **Wilcoxon Signed Rank Test:** Analogous to the matched pair t-test, this test is used to compare the effects of a treatment on a single population, if the differences between the matched pairs are non-normally distributed. To perform this test, the matched pair differences are signed and ranked. The test statistic is calculated by summing the ranks of differences with the same sign and taking the smaller sum of the two. The test statistic is compared to a standard value to determine if the difference is significant or not.

● **Mann-Whitney U Test:-** This test is used to compare 2 different population medians. Independent samples from 2 populations are taken and observations of both samples are ranked. The sum of the ranks of each sample is calculated and the smallest sum is considered the test statistic.

Let’s look at a hypothetical case study, similar to the ones we often encounter at DeepTek. Consider an AI model to identify the presence of lung nodules in Chest X-ray (CXR) scans. We need to carry out a statistical significance test to tell whether the radiologists aided by the AI model can identify nodules better as compared to the situation when they were not aided by the AI model. We gave ten radiologists a sample of 4500 CXR scans collected from multiple centers. They were asked to identify nodules in each sample with and without AI assistance. Radiologist performance was measured using the Area under the Receiving Operator Characteristics (AUROC) curve, one of the most common metrics in AI models with binary outputs.

ROC curves were constructed for unaided and aided readings based on the results of each of the samples. The AUROC values were calculated. The results are shown below.

**Table 1. **An example list of 10 pairs of AUROC values based on radiologists detecting nodules in chest X-rays without the AI aid (unaided) and with aid (aided).

This data is a sample input for a statistical significance test to be carried out to see whether radiologists aided with the model consistently performed better than when they were unaided. As we have paired data and the differences between each matched pair may not be normally distributed, we performed the Wilcoxon Signed Rank test on the data to assess whether the AUROC values were consistently higher for radiologists aided with the AI than the AUROC values when they were unaided.

**Table 2. **The differences between unaided and aided AUROC values in table 1 are calculated, signed and ranked.

The sum of the ranks of all positive differences is 55. Since there are no values with negative differences, the sum of the ranks of all negative differences is 0. Therefore, the test statistic is 0.

**Table 3.** Critical values of Wilcoxon Signed-Rank Test. Source: Openpress of The University of Saskatchewan, Canada.

As we can see, we have evidence at the 0.005 significance level that the AUROC is consistently higher for radiologists aided with the AI than those unaided, since our test statistic (0) is less than the critical value when n=10.

In conclusion, when it comes to hypothesis testing, there is no one test that is fit for all situations. It is crucial to carefully evaluate the experimental setup, data characteristics and distribution, and the validity of assumptions, if any. Incorrect usage of statistical inference can lead to wrong conclusions and drastic consequences, ranging from overlooking life-saving remedies to releasing harmful products into the market. In the context of AI, the role of statistics is more important than ever in discerning between genuinely capable algorithms and the algorithms that just pretend to know. By employing statistical methods effectively, we can ensure that AI models are rigorously evaluated and their true potential is accurately assessed, leading to trustworthy AI solutions.

T. G. Dietterich, "Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms," in Neural Computation, vol. 10, no. 7, pp. 1895-1923, 1 Oct. 1998, doi: 10.1162/089976698300017197.

Richard Chin, Bruce Y. Lee, in Principles and Practice of Clinical Trial Medicine, 2008

Comparing Group Means: T-tests and One-way ANOVA Using Stata, SAS, R, and SPSS (Park, Hun Myoung 2009)

Zou, K. H., Fielding, J. R., Silv