Confidence formula


What formula do you guys use to calculate the confidence?

I noticed a drop in the confidence value after a conversion. This is quite surprising as I would expect more visitors to yield a higher confidence – apparently not?


12 replies

Hi Jerome! We use a chi-square test. The chi-square test is sensitive to both overall sample size and the difference in conversions, so if the conversion rate for two page variants is already close, and a conversion makes them even closer, then the confidence that one is clearly better than the other will decrease. Does that make sense?

Hmm yes that makes sense. I had thought it could be something like this and I’ll look into the chi-square test to understand it a bit better.

However I’m not sure the chi-square is such a good instrument. Here is some sample data for some recent new variants:

21 1 4.76% -27% 24%  
4 0 0.00% -100% 40%  
 
Vis = Visitors  
Con = Conversions  
Rate = Conversion rate  
Del = Delta  
Conf = Confidence ```  
 
So obviously the rates here aren't very telling, especially the lower variant. It's just to demonstrate that the confidence feels like a random made-up number (obviously it's not -- but it does feel like it).

So, one of the things we need to do a better job of is to provide some guidelines on how to use the test results. That’s definitely on our radar.

You’re generally shooting for a 90 to 95% confidence rating. In your example, you need more visitors before you’ll be confident (statistically speaking) that one of your variants is clearly better than the other.

Hi Carl - better guidelines on how use the test results, specifically the confidence %, would be helpful.

Coming from a project where I was using Google Site Optimizer, I was expecting a “winner” to be declared. During my very first Unbounce test, I had a variant at 98% and was patiently waiting for it to get to 100% so it could be declared the “winner,” before I realized/learned/looked at the manual and realized that’s not how Unbounce’s “confidence %” is supposed to work.

Yup I totally agree that I need more visitors in that example. The point I was demonstrating was rather that the rate weighs in far too heavily compared to the visitor count.

I think putting a higher weight on the visitor count would be a good idea (if possible to do with the chi-square test).

Also, as I see it the conversion rate here is actually a range. In the example with 4 visitors the rate should actually be 0.00% to 12.50% or something. There isn’t even the accuracy to have two digits but of course for good formatting it’s difficult to do this differently.

As for how this range (vs single point) influences the confidence, you can see that the range for the 4-visitor example actually includes the champion (somewhere between 6% and 7% at the time). So, the delta should really be 0% at least as far as the chi-square test is concerned. I am sure this would yield a much lower confidence, possibly even 0%.

What do you think?

EDIT: I do see that some common sense helps a lot and it’s kind of obvious that 4 visitors are simply not telling. However, displaying a confidence of 40% does seem kind of misleading.

Hey guys, this is all great feedback!

Jerome, further to your comments about weighting different aspects of the numbers, what you’re actually talking about is coming up with new statistical methods. We worked with a local university to ensure we were using a technique that was valid, statistically speaking. We’ve discussed ways in which we can present this information better, and definitely hope to have some improvements there soon, but we’re not likely to go about inventing new methods. 😉

Incidentally, we chose not to use the same approach as Google. Google’s tool provides a “confidence interval” for the test metric, a range of values that represents a prediction of the “real” value. However, that approach typically requires larger sample sizes. We chose an approach that will let customers who aren’t driving larger volumes still achieve valid results.

Matthew, really appreciate your comments as well, especially with regard to comparisons to Google’s tool. We’re hoping to provide something similar (a clear winner/loser indication) relatively soon.

Thanks guys!

Idea: Maybe have a setting that doesn’t show confidence score until you have enough visitors. What’s the minimum?

There’s a rough guideline for the chi-square test that most of the cells in the input should have a minimum value of five (paraphrasing there, the guidelines vary and are a little more specific). However, there’s also the guideline that the sample size must be “sufficient”. This would include things like allowing your experiment to run for at least a week to correct for typical periodic fluctuation, and it of course depends on the nature of your audience and traffic sources.

Anyway, we’re definitely going to improve the presentation of the test results soon. Likely we’ll de-emphasize the numeric confidence score, and place more emphasis on an indicator that displays whether or not you’ve potentially achieved a significant result.

Since the confidence score is taking into account the sample size and the conversion rate of the other variants, what happens if I discard a variant mid test? Right now I have a 94% confidence that one of the variants converts at a lower percentage than the champion. There are still a few more variants that haven’t reach a 90+ confidence level. If I discard a variant, will that change the scores of the others? Is it bad practice to take a variant out of an a/b test mid-test?

Though I can’t answer the actual question, I can suggest that you simply set its traffic to 0% – then it is effectively disabled but still in there for statistics.

That would have been a good idea. I just demoted and it didn’t appear to change the confidence score.

Carl, can you write out an example of how you are using the chi-square formula?

Hey guys,

Each challenger variant is an independent test against the champion, so the only thing discarding another challenger variant will do is change the rate of future traffic reaching the remaining challengers. It won’t change the confidence of remaining challengers directly. Think about it this way, if you’re testing three pages, and you discard one of them, does that give you *more* information about the other two?

Once you reach sufficient confidence (95%+) on a particular variant, you’re good to either promote it or discard it as appropriate. Remember too what the test is actually telling you. With a 95% confidence, there’s a 1 in 20 chance of being wrong about your hypothesis, whereas with 99% confidence there’s only a 1 in 100 chance.

I’ll see what I can do about providing some sample calculations. We’re using a stats library that can compute a confidence value, which is something you wouldn’t necessarily want to do by hand. Typically when computing a chi-square test by hand you’ll produce a p-value, and then lookup the corresponding confidence from a pre-computed table. This will tell you whether or not you’ve achieved a particular pre-computed confidence level, but won’t give you the exact confidence level itself.

Good questions!

Reply