This is the first of a three-part, weekly blog series, which shares optimization best practices for test sample size, desired significance, power, and Minimum Detectable Lift (MDL).
Technology and evolving consumer behaviors are transforming the way people evaluate products and services to the way they pay for the things they buy. As optimization experts, Brooks Bell knows testing is the most effective and reliable way for merchants to determine which marketing strategy – from a specific promotion or targeted message to a unique website element such as a Call to Action (CTA) – will produce the highest Return on Investment (ROI).
After conducting thousands of tests for enterprise brands, client results – based on statistical methods and empirical data – are consistent. Testing improves conversion rates, increases sales, and boosts annual revenue by enhancing customers’ digital experiences.
Optimization Tests Are Not All Equal
A few different test methods exist today, including Fixed Time Horizon, Sequential Testing, and Multi-Armed Bandit (auto-optimizing). Each has its own place, but the majority of digital optimization programs rely on Fixed Time Horizon testing. This is perceivably the most easily understood, and a common design type we use at Brooks Bell.
Fixed Time Horizon is a hypothesis test, relying on traditional and proven statistical processes to set the right sample size. It provides realistic data after reaching a preset sample size of visitors, which allows analysts to make data-based decisions and strategic recommendations.
Fixed Time Horizon is unique because the sample size is predetermined prior to the test launch. In addition, someone manually stops the test when the ideal sample size reaches the desired significance, power, and Minimum Detectable Lift (MDL).
Creating the Right Sample Size
The concept of sampling (collecting measurements from each test) a larger population to determine consumer behavior comes from applied statistics. Statistically, the more measurements, the more realistic results.
Developing a large enough sample size – the number of observations from a group through which statistical and realistic inferences are made relative to the whole population (your customers and targeted prospects) – is necessary for valid results. Creating the right sample size is also critical to produce realistic test results. Additionally, by presetting the sample size, experimentation teams will better understand the amount of time and resources required to develop and execute the tests. If the sample size doesn’t accurately represent your audience, tests lose their value and often end with incorrect conclusions.
However, predetermining sample size can be difficult. Find out how Brooks Bell analysts determine the sample size in six simple steps.
Identifying a Test Winner with Confidence
The Fixed Time Horizon approach is also unique because the sample size is predetermined, and allows testers to manually stop the test.
Some tools distribute automatic alerts when they identify a winner. This continual monitoring leads to a higher than expected false positive rate. In other words, certain tools predict winners too early because the appropriate settings are not in place to end tests based on desired significance. Therefore, these tools cannot properly recommend the optimal user experience for all scenarios.
For example, even weeks after tests are live, a tool can calculate and provide a confidence level, defined as a percentage of certainty about the results. But realistically, the sample size was probably too low to accurately draw that conclusion. The lower the percentage, the less confidence testing teams trust the results, and the higher the margin of error.
A confidence level above 95 percent is best practice. Ideally, experimentation teams want a high confidence level, and a low margin of error – the amount of random sampling error in a survey or test results, asserting the likelihood (not certainty) the result represents if the test included the whole population.
Testers also tend to check the data, too frequently. As a result, they insert biases into their conclusions, erroneously declaring an optimum user experience.
Do not assume these examples only happen for low-traffic conversion funnels. Even high-traffic pages will produce various results during the test cycle. The data will fluctuate in the first 24 hours, the first few days, and potentially weeks into a test. The test must run its course to collect enough data to be statistically accurate. Otherwise, it is a waste of time and money.
Businesses must also set measurable testing goals. If the goal is to increase online conversions, you need to capture a baseline and create a Key Performance Indicator (KPI). Testing programs are critical for any organization with a digital marketing strategy, and resources should include technical analysts, who can accurately interpret statistics and data. To determine whether or not testing teams are reviewing the right statistics, read: Testing Differences in Revenue? You’re Probably Not Using the Correct Statistics.
Conversion is also difficult because it impacts every aspect of the user experience – from landing pages, category pages, and every customer touch point. It varies based on many factors, including retail category, shoppers’ preferred device and platform, and geographic location. Based on a Q4 2016 benchmark report by Monetate, the Conversion Rate Optimization (CRO) worldwide average was 2.95 percent, compared to three percent for the U.S. While these numbers may seem small, any incremental improvement can create a huge ROI by increasing sales and revenue. These percentages will continue to climb based on historic data, and expected growth trends.
Ultimately, any business conducting optimization tests needs to set the right sample size prior to test launch to collect enough data so technical analysts can make realistic predictions, and suggest confident changes for the greatest ROI.
Stay tuned for next week’s blog, “How to Reach Desired Significance & Power with Your Experimentation Programs.”
Taylor Wilson, Senior Optimization Analyst
Taylor has fluency in all major testing tools and extensive experience in data analysis, data visualization, and testing ideology. He believes that effective communication of data is as important as the analysis. For over 4 years at Brooks Bell, Taylor has led the analytics efforts for optimization across all major verticals from Finance to Retail including brands like Barnes & Noble, Toys”R”Us, Nickelodeon, and OppenheimerFunds. Previously, he was involved in real estate and telecommunications, with a focus on lean process through data. Taylor holds a bachelor’s degree in engineering from NC State.