How To Reach Desired Confidence & Power Levels with Your Experimentation Programs

August 16, 2017

Taylor Wilson

This is the second of a three-part blog series. If you missed the first post, “Using Fixed Time Horizon to Generate Credible Optimization Results,” read it here. The final blog, “How to Understand Minimum Deductible Lift,” will be published next week.

Enterprise businesses continue adopting the concept of data-informed decision-making. They are shifting their organizational cultures to those where more and more decisions are based on facts and insights. At the same time, optimization continues growing in popularity among top brands and is essential for business success in today’s digital world.

Optimization is a data-driven process of continually striving to enhance customers’ experience on- and off-line, fostering such loyalty that they return time-and-time again. Optimization removes the guesswork. Coupled with expert insights, data offers information about what works and what doesn’t work. It helps identify customer roadblocks and opportunities for revenue growth.

As mentioned in the first post, all optimization tests are not created equal. In order to achieve statistically reliable results – which in turn helps define actionable strategies – tests must be properly designed, conducted and analyzed.

Understanding Confidence
From a frequentist perspective, it is considered best practice to end a test once it has reached the pre-determined sample size, which takes into account desired confidence, desired power, Minimum Detectable Lift (MDL) and data on the specific metric being tested (either response rates for binary metrics or mean and standard deviation for continuous metrics).

Statistical confidence facilitates accurate inference for your population of customers and/or prospects. It is the likelihood that the difference in conversion between a given test variation and the baseline (or control element) is not random, nor is it caused by chance. In statistics, confidence is a way of mathematically proving a statistic is reliable.

The confidence level you target in the test design phase reflects your risk tolerance and controls for Type I errors, or false positives. The higher the confidence level, the less likely you would see unexpected results when implementing a recommended change. If testers are 95 percent confident the increase they observed is, in fact, an increase, there’s still a 5 percent chance of a false positive result, in which the lift is actually not positive. At 80 percent confidence, there is a 20 percent probability of a winner being a false positive result.

There is a tradeoff of speed for certainty. When designing a test, setting a high target for your desired confidence levels mean you are seeking a low false positive error rate. In other words, you want to run a precise test, and precise tests require more visitors. On the other end of the spectrum, lower confidence levels allow you to test more quickly – but will result in more false positives. The decision on what confidence level to test will need to be determined by the amount of traffic you receive on the page, as well as the business implications surrounding the risk threshold for that particular test.

Understanding Power
Optimization professionals discuss power less frequently as a design variable for tests, but it is equally as important as confidence. Setting an appropriate power level plays an important role in giving your test the requisite chance to reach a desired confidence level.

Power controls Type II errors, otherwise known as false negatives. Put in a more simplistic way, power controls the probability that you will find a significant difference between a challenger and control, should that difference actually exist. As power increases, so does the probability of rejecting the null hypothesis, or detecting a statistically significant difference.

For standard A/B tests, the industry standard power level is 80 percent. This is less than the recommended confidence level of 95 percent because there is less risk involved for most programs in failing to find a significant difference compared to advocating for a change when improvement does not actually exist.

Like increasing confidence, increasing power increases test precision, and the test will require more visitors to reach that level of precision. If your test is underpowered, you can technically end tests quicker, but you may miss out on detecting statistically significant differences that actually exist. If your test is appropriately powered, you will be able to detect most differences that exist at your desired level of confidence.

Like most projects, weighing the different options available against the desired business goals, helps testing teams find the right balance with confidence, power, and test duration.

Stay tuned for our final post in this series to gain a deeper understanding of how to set a Minimum Detectable Lift (MDL) to achieve optimum experimentation results.

Taylor Wilson, Senior Optimization Analyst
Taylor has fluency in all major testing tools and extensive experience in data analysis, data visualization, and testing ideology. He believes that effective communication of data is as important as the analysis. For over 4 years at Brooks Bell, Taylor has led the analytics efforts for optimization across all major verticals from Finance to Retail including brands like Barnes & Noble, Toys”R”Us, Nickelodeon, and OppenheimerFunds. Previously, he was involved in real estate and telecommunications, with a focus on lean process through data. Taylor holds a bachelor’s degree in engineering from NC State.