Does it Matter How We Measure the Significance of Test Results?

July 9, 2014

David DeFranza

Imagine two colleagues at a coffee shop during a mid-morning break. Both order large coffees and move to the condiment counter to add milk and sugar. “I like it when they do this before pouring the coffee,” one comments to the other. “It just tastes better.” Incredulous, the other challenges the first to a taste test to determine whether this preference bears any significance.

Back at the office, eight cups of coffee are prepared—four with sugar added before the coffee is poured and four with sugar added after. Presented randomly, the coffee connoisseur tastes each individually, and then announces whether sugar was added before or after the coffee. A certain number of correct guesses will be expected due to the nature of random chance. But if the taste tester performs well—say, by guessing all eight cups correctly—his astonished officemates will declare his palette exceptional and his performance significantly better than random chance.

This hypothetical experiment offers an important illustration for anyone analyzing A/B test results. Office antics aside, this coffee tasting is an example of a one-tailed test of significance. In this case, one outcome—the ability to observe a difference in the coffees—was considered unlikely. An ability to do so is the important outcome. Because of this, the significance of the result was measured in one direction only—how much better the taste tester’s guesses were than pure chance alone.

Calculate significance quickly with the Brooks Bell StatsCalc!

If we think of this example in terms of testing, it does make sense. Typically, there is a control—the “A” of A/B testing—and a variation—the “B.” Ultimately, we’re interested in how much better the variation is than the control. When the measured improvement is extreme enough, the platform announces our test has reached a significant lift and we can call it a win.

But there is a problem with the method in this application. When running a test, there is always the possibility a variation will underperform the control. To account for both wins and losses, a different significance test is needed—the two-tailed test. Doing so allows us not only to measure the degree of improvement over random chance (if it exists) but the difference from random chance. Using a two-tailed test provides a more accurate and detailed picture of the performance of an A/B test.

Currently, some testing platforms on the market use one-tailed significance and some use two-tailed significance. To ensure tests are analyzed accurately, it’s important to keep the following guidelines in mind:

1. Take Control of Analysis

All testing platforms provide built-in analysis tools. This is convenient and can be useful for quick checks on the status of a running test. But it’s the responsibility of testing managers to take control of analysis and data governance within their own business. Instead of relying on reports from the tools themselves, use the raw data to perform independent analysis, utilizing the statistical methods most relevant to your own business. This requires more work—and expertise—but accurate reporting is essential for developing trust in testing and making data-driven business decisions.

2. Go Beyond Significance

Testing to significance is important but this is not the only measure necessary to ensure accuracy. Tests must run for an appropriate period to capture shifts in traffic and behavior—this often requires a test period to extend days or even weeks beyond the point of mathematical significance. Calculating confidence intervals, too, in addition to significance provides a measure of the result variability, helping to place the impact of a win in more accurate context.

3. Embrace Change

If our coffee taste tester guessed eight of eight cups correctly his first try, his colleagues would be amazed. But if the experiment is repeated a week later, he may only guess six of eight correct. A week after that, he may only guess three of eight. This trend—in which an extreme result is slowly eroded over subsequent trials—is known as regression to the mean. This slow decrease in lift can happen with winning A/B tests, too. But when it comes to website testing, this effect is more likely a sign that the market, business environment, or customer has changed. The only way to address these changes is to keep testing, keep iterating, and keep learning.

4. Test to Learn

Everyone wants tests to win—and at the end of the day, a program’s performance may be evaluated by the win rate. But the ultimate goal should always be to learn. Whether a test wins or loses, it can provide benefit either from new insight into customer behavior, added depth to the knowledge of user preferences, or the avoidance of a risky strategy that would not have succeeded. It’s important to ensure any statistical methods used support this drive to learn.

On the surface, A/B testing appears to be simple. It’s just a basic experiment, comparing one thing to another. But in practice, these simple experiments expand rapidly into complex tests. Generating hypotheses, implementing a rigorous experimental design, and applying the appropriate statistical methods to analysis are critical to uncovering insights and, ultimately, winning lifts.