When You Have Counts, Expected Value Is All You Need

Is 10 and 11 significantly different from each other? What about 10,000 & 11,000.

Just comparing counts is unusual in the world of experimentation, since we are often focused on comparing differences in means or differences in proportions. In many cases, this nudges us to create metrics that conform to the latter format - either as an average or as a proportion. And we have some pretty good reasons for doing this. One reason is that normalizing a value to the user level (or any other randomization unit) is much easier to compare. e.g., Average Revenue per user (a mean-based metric) or the percentage of active users who had made a purchase (proportion based metric). The other reason is that the normal distribution and the z-statistic is well defined for this kind of comparison (thanks to the Central Limit Theorem).

But what if, we would like to compare raw counts, like, the number of users assigned to the different variants. How can we figure out if the difference we see is actually significant and not just due to random chance?

When does comparing counts makes sense

Comparing counts makes sense in the following scenarios:

The metric we'd like to compare is a top-of-the-funnel metric (there's nothing to normalize against)
We are comparing invariant metrics

Top-of-the-funnel metric

Top-of-the-funnel metrics are metrics that we probably use as randomization units. e.g., number of users assigned to both the variants or the numbers of qualified users assigned to the variants. As the metric suggests, any comparison at this level is a validation of the randomization process itself.

We anticipate that the number of units assigned to each variant should be equal. In other words, we assume that the expected value for the units in any bucket is around 50% of the total units (assuming an even 50-50 split). Additionally, the units in one bucket are not expected to significantly differ from those in the other. Testing any difference for these kinds of metrics are known as Sample Ratio Mismatch (SRM) tests and the significance level of the SRM tests are kept at a very low value (very low tolerance for TYPE I error).

Invariant metrics

Invariant metrics are very similar to the top-of-the-funnel metric, in the fact the fact that we expect the numbers in both the variant to be almost the same. They just need not be top-of-the funnel metric. e.g., we expect the number of units assigned to any of the covariates (factors which we assume are not impacted by the variant or the new feature being tested) to be the same.

Tests to compare counts

In such cases, we can either use the binomial test or the Chi-squared test. But the underlying distribution of both are slightly different. So which to use when?

Binomial test

The binomial test assess whether the observed number of successes in a fixed number of trials significantly deviates from what would be expected by chance.

The key elements are the number of trials and the probability of success. Typically, the probability of success is assumed to be 50% (assuming an equal split). The "number of trials" refers to the total count of both the control and treatment groups in your experiment.

The crux of the binomial test lies in comparing the observed number of successes in one of the groups (say, the treatment group) with what we'd expect if the probability of success was truly 50%. If the observed number of successes significantly differs from the expected count, it suggests that the 50% assumption may not hold true (or that some kind of Sample Ratio mismatch has occurred, warranting further investigation).

No alt text provided for this image The p-value of the binom-test indicates that the counts are indeed significantly different

The Chi-squared test

The chi-squared tests whether the observed distribution significantly varies from an expected distribution.

To apply the chi-squared test, we need the observed counts (the counts of treatment and control) and the expected counts. In our case, we are expecting treatment and control to have equal counts, hence the expected values can be calculated as the average of both values.

By comparing the observed and expected counts, the chi-squared test quantifies the extent of the discrepancy between the observed and expected distributions. If this discrepancy is large enough, the test will indicate that the variables are likely not independent and that their association is statistically significant.

No alt text provided for this image The p-value of the chisqaured-test also indicates that the counts are significantly different

When we are dealing with large numbers, both distributions output almost identical p-values, though the underlying distributions itself is different. In the above examples, we compared two numbers, 10,000 and 11,000, the resulting p-values almost close to 0. When the counts are the same, the resulting p-values are exaclty 1.0 as expected.

No alt text provided for this image For equal counts, the p-value is 1.0 for both distributions

The biggest different is when the counts are very small. Let's test this for counts such as 10 & 11.

No alt text provided for this image

It seems that the binomial distribution is well suited for smaller counts (since 10 & 11 could feasibly arise from the same underlying distribution), although chi-squared distribution is technically applicable for expected counts greater than 5.

Summary

In experimentation settings, we usually deal with large numbers. Under those contexts, we can either use binomial test or a chi-squared tests to detect differences in counts, with identical results. The tests are also easily implementable and scalable (think about running the tests for each day of the experiment to detect of the number of units assigned to different buckets or covariates). All you need to do is figure out is, what's number you'd expect to see in both the buckets; which is usually an average (of both the buckets).

So, remember this: when you're working with counts, all you need is that expected number to dive into these tests and understand the health of the tests.

Note : Underlying test distributions, stats and core assumptions validated by our Stats expert Athulya Ganapathi Kandy 🙏

Originally published on LinkedIn.