Calculate | |
Baseline Success Rate | % |
Desired Sensitivity | % |
Experiment Groups | |
% Traffic in Baseline | % |
KPIs |
This is a calculator for planning simple A/B tests (or A/B/C tests, or...you get the idea). If you want to analyze a test you have already run, click on the ‘Analysis’ link at the top. When planning a test, the most important question is: how long do we need to run it, or more specifically, how large of a sample size do we need?
Or, if you have already decided on a sample size, what kind of effects will we be able to detect?
Often, planning a test is about finding the right balance between achieving a meaningful sensitivity with a reasonable sample size.
Any kind of experiment is trying to glean information from observations. If we want more information, we need more observations. We can actually quantify how many observations we need to get a certain amount of information, or how much information we can glean from a particular number of observations. We can think of the amount of information we can get from a particular experiment as the sensitivity of that experiment.
If you have experience with statistics or A/B testing, you may be used to hearing this described as the alternative hypothesis, but I think sensitivity is a much more intuitive description!
Let's say we're doing a test with email subject lines. We have two candidate subject lines and we want to find out which one leads to more opens. Maybe we have been using one of these for a while, so we know it usually has about a 10% open rate. Or maybe they are two completely new subject lines, but we know that most of our emails usually have about a 10% open rate. One way or another, we need to speculate about what the open rate for one of the subject lines will be.
Enter 10% as the baseline success rate in the calculator. Note that you enter ‘10’, not ‘0.1’ which some people who are used to Excel might do.
Then we need to think about how sensitive we want the experiment to be. Of course, we want it to be as sensitive as possible! But as a starting point, let's say we want a 1% sensitivity. That would correspond to a situation where the second subject line has an open rate that is 1% higher (in relative terms) than the first subject line. Since 10.1% is 1% higher than 10% (the baseline success rate), if the second subject line has an open rate higher than 10.1%, we want to be able to detect it.
Enter 1% as the desired sensitivity in the calculator. Note that you enter ‘1’, not ‘0.01’. Hit the calculate button.
We see that we need a sample size of about 2.9 million. That's a lot! That number is spread evenly across both groups, so we need about 1.45MM people in each experiment group. Now try increasing the desired sensitivity to 10%. How large a sample size do we need then?
Suppose we decide we can only use a sample size of 100,000, spread evenly across both groups. In the drop-down menu, select ‘Sensitivity’ as the thing to calculate. Enter 10% as the baseline success rate and 100,000 as the sample size. Hit the calculate button. We see that the minimum detectable lift is about 5.5% and the minimum detectable drop is about 5.2%. Here's what that means: if the new subject line is in fact at least 5.5% better than the baseline, the experiment will be sensitive enough to detect it. Or, if the new subject line is at least 5.2% worse than the baseline, the experiment will also be able to detect it. If both subject lines have about the same performance, it is unlikely we will be able to tell which one is better.
Of course, we do not know how much better or worse one option is than the other (if we did, we wouldn't need to do the experiment), but we can size the test to be able to detect a difference that would be meaningful for our use case.
The main point of the planning process is to decide on a sample size. Once we have run the test and achieved the desired sample size, we stop the test. At that point, it does not matter how or why we selected a particular sample size–the analysis conclusions will still be valid. If the observed success rates are way different than those used to plan the test, it might not be as sensitive as we hoped, but the conclusions are still valid.
Because it's hard to do the UI for that. There's no technical reason. Maybe someday I will think of a good UI for it. You would have to specify the percentage of traffic in each individual experiment group, and make sure they add up to 100%, and it's just really annoying. Three groups probably isn't that bad, but ten would be obnoxious.