Anyone seen Confidence Intervals Change Drastically?

  • 5 November 2018
  • 4 replies
  • 10 views

Hi there,

Has anyone seen drastic changes to their variants’ performance over time?

In one test w/ 4 variants, I had a top performing challenger (~3k impressions each for test & control) where the top performer reached 30% performance & 97% confidence after 1 day. After reaching 10k per treatment, it’s only 17% confident that it’s 1% stronger.

While I’m more inclined to trust the version with more impressions generally, i’m bewildered by what’s going on:

  • My GA numbers have flatlined to almost 0 but conversions are still coming in on unbounce steadily.
  • this is the second test where a strong performer does an about face as it gains more traffic.

The 2016 unbounce fall lookbook also shows a lot of winning results with 2k-5k visitors so while I lean towards more traffic, i’m curious how much folks are allowing in their tests.

My main KPI is a purchase (ie full sign up for subscription). thanks!


4 replies

Userlevel 7
Badge +3

Hi @alviny,

There are quite a few questions and assumptions in your post so I’ll try to get to each one:

  • “GA numbers have flatlined” - If you are still pushing traffic to your landing page(s) and your GA data is not registering it, I would have to assume there is something wrong with your GA setup. Maybe someone in your organization changed something in your setup after testing has began. (Happens a lot in large organizations and a lot of people with access to GTM).

  • “this is the second test…” - This actually happens quite often and that’s why it’s important to have your statistics for A/B testing down to a T. You need to calculate the sample size you’ll need before you actually begin testing.

A few general pointers to get you in the right direction:

  1. Based on past data, you’ll need to calculate your sample size and be patient. No peaking at the results.

  2. If you are running more than 1 variation, you’ll need to account for that and calculate your correction.
    (ex. At 4 variations, you are looking at almost 20% false-positive rate and the confidence level goes up to 99%)

  3. Segmentation of post test results is just as important. You can’t rely on overall page performance. This is where you GA data comes into play but you’ll also have to keep in mind that the GA data might be sampled.

  4. You should not stop results early or when you think you’ve reached significance. (see point 1 above)

  5. Try not to test more than 1 variation unless you know what you are doing. Build your testing process on strong foundation. Don’t test for the sake of testing unless you have really strong hypothesis.

  6. Run your tests for at least a full business cycle or at least a week (it would depend on your business).

  7. There is no universal number you have to look for. Do your own calculations and don’t rely heavily on other sources - simply because you don’t have the full picture of what others are testing or trying to achieve.

Best of luck,
Hristian

Hey Hristian, thanks for your thoughtful response. It’s a great starter pack of tips & tricks. I uncovered the evanmiller.org suite of tools and have also started using that in my calculations.

Any rules of thumb for how you set expected lifts in your sample size calcs? My assumption is that they should be proportional to the scale of changes being tested. But i’ve one question that is what are scenarios where folks wouldn’t just always use an estimated 15% lift (or any % lift) as the norm which translates to 30k per variation (or more/less)? (ignoring the fact that it’s dependent on traffic for a moment).

Userlevel 7
Badge +3

Hey @alviny,

There are no rules when it comes to setting the expected lift.

In most cases you do it based on experience.

After a while, you start to get a feel for how a particular change might affect your conversion rates.

As far as the required traffic goes… there is no way around that if you want to run a scientifically sound CRO program.

It’s also why web properties with less traffic can better utilize their time by concentrating on other aspect of optimization rather than testing.

Best,
Hristian

If you have an hour, you will learn much from the presentation below. If you are short on time the section between 8:15 - 28:48 is most relevant to this topic.

Reply