One of the most common claims in testing is also one of the greatest misconceptions. “We are 95 percent confident,” the statement often starts, “that the lift was 10 percent.” The problem isn’t that a 10 percent lift wasn’t observed. And it’s not that the confidence claim is wrong per say. Rather, it’s that the statement confuses the relationship between the two. Put more precisely, the hypothetical analyst quoted above should have reported that, “We are 95 percent confident this result of 10 percent lift demonstrates the challenger is performing better than the control.”
The mistake is easy to understand. After all, every optimization team would like to know what to expect when they implement their winning variation. Grasping at a specific winning margin is convenient and appealing. However, it’s important to remember that the result came from samples of the entire population.
This nuance is far from trivial. In fact, understanding confidence intervals is an essential step in transforming results into insights. Most testing programs use “95 percent confidence” as the benchmark for a winning test. But statistical significance must be translated effectively to communicate practical importance. Therefore, instead of simply reporting 95 percent confidence in a test’s result, analysts should also report how that assessment informs expectations going forward. It’s not useful, for example, to report confidence in a lift when it’s probably so small that the cost of implementation would be greater than the benefit of the return. One of the steps we’ve taken at Brooks Bell to address this is to report confidence intervals for lift rates.
A common mistake is to interpret the conversion rate of the control as an absolute, static number, ignoring the variability inherent whenever binomial trials (success/failure, conversion/non-conversion) occur. Instead, we must ensure the control and challenger are treated as samples. Our method, a type of Monte-Carlo simulation and a bootstrapping technique, replicates the entire experiment more than one million times and has the important benefit of aligning statements of confidence with the true probabilities of random occurrence. With a large number of simulated results to analyze, this method can be extended to generate intervals through target shuffling as well.
Take this example from a high-traffic, low-conversion situation:
This is reported as exactly 95 percent confident by many standard tools after observing a 6.25 percent lift. However, that same tool would give a 95 percent confidence interval of -4.2 percent to +17.6 percent. How is a marketing team supposed to interpret this information? Imagine telling the product-owner that we are 95 percent confident in our winning test, but the actual change in performance could be between -4 percent and +18 percent. To be useful, a 95 percent interval should not cross zero if we are truly confident in a lift.
Using the same data set, our simulation method looks at more than one million replications and notices that 95 percent of them are between zero and 12.9 percent lift. This is exactly what one would expect at 95 percent confidence! If you’re exactly 95 percent confident in a change, then you should end that interval on exactly zero. That means there’s a 2.5 percent chance you’re below 0 (an unexpected loss) and a 2.5 percent chance you’re above even 12.9 percent (a bigger than expected win) but you’re 95 percent confident that the result indicates a true difference between zero and 12.9 percent.
The best analysis is one that provides the clearest interpretation of events, both historical and anticipated. In this case, we want to outline the most accurate and reasonable expectations for a winning variation after it has been implemented. Modern statistical methods have evolved to give us ever-increasing predictive power with the data we collect through testing, but these methods must be applied carefully and appropriately to create a reliable, practical model capable of informing business decisions.
This post was written by optimization analyst, Dave Rose. This year Dave has focused on translating what confidence means and worked with the team to get the most accurate interpretation of results.