Statistical Significance Tests to assess the validity of AI models

-Rohan Kalluraya

May 8, 6 min read

Artificial Intelligence (AI) models are capable of processing unseen data to produce meaningful outputs. However, evaluating the performance of these models on all possible data points in the world is not possible. Therefore, statistical methods are employed to infer the real-world performance of AI models. Given multiple AI models, how do we compare which model performs better than others on unseen data? When radiologists use AI models in their workflow, do they detect diseases more accurately? The answers to these questions lie in statistical significance tests. Statistical tests help us make conclusions about the population using information from a sample of that population and help us rule out anomalous or chance results.

Every statistical test begins with a null hypothesis, a specific statement about a population parameter that represents a “default” situation, e.g. no effects observed. The alternative hypothesis includes all other feasible values for the population parameter besides the value stated in the null hypothesis. For instance, let’s consider a scenario where we compare the performance of two AI models. The null hypothesis, in this case, would be that the median performance level (denoted by P) of Model 1 is equal to that of Model 2, while the alternative hypothesis would be that the performances of the models are not equal. In simple terms, the hypotheses can be stated as follows:

Null Hypothesis: P1=P2
Alternative Hypothesis: P1≠ P2, i.e. P1 < P2 or P1 > P2