Significant Results are still wrong 5% of the time

A few weeks ago I posted my take on the History of Marketing. I was apparently one of my more popular posts. In addition to the social media sharing, I received an email asking a question about this paragraph:

First, there is a difference between proof and Proof. Significant results are still wrong 5% of the time (and likely not important 50% of the time). When you are running Big-Data sized tests and trying to backward-infer results, this 5% gets really really important.

Here was the question:

“Significant results are still wrong 5%”  – is that really the same as there being a 5% chance of wrongly rejecting the null hypothesis? I’d like more on that please and, especially, the subsequent point about 50% irrelevance which I think is even more important for people to understand.

Rather than answer it via email, I thought I would expand upon it here for general consumption.

  1. Yes. When I say “Significant Results” are wrong 5% of the time, I mean that the very definition of “significance” is that it is wrong 5%of the time. You wrongly reject the null hypothesis 5% of the time. Always.
  2. The 50% comment: What I meant here is that even if your result is “proven” to be different from what you are testing it against, it says nothing about the magnitude. What often happens in practice is that your A/B test gives results like this:

A- 10% C/R
B- 11% C/R
Significance test says it is significant with a more than 95% certainty.

The analyst who ran the test takes the result to leadership and says the new landing page will be 10% better (and it’s significant!)

The new page is implemented, but a few weeks later they find C/R hasn’t gone up by 10%. In fact it’s gone down. How is that possible? (This happens ALL THE TIME by the way!)

Here’s what happened:
The ‘real’ result from B wasn’t a 10% improvement, it as a 1% improvement. But the randomness produced a 10% result. The 95% error bars around the result verified it the difference was more than a 0% improvement, but no one asked, “What is the chance this is only a 1% improvement?”. Turns out those odds are a lot higher than 5%.

But the good news is that it is at least a positive impact. So why didn’t the results after implementation show a 1% improvement? Two reasons:

  1. There is more randomness after the fact. Your C/R is going to fluctuate all the time and it’s very unlikely you will be able to notice if your ‘real’ C/R shifts from 10.00% to 10.01% after a week (or a month)
  2. Things change. Just because your new landing page is better than the old landing page last week, doesn’t mean it will be better next week. We know this intuitively, which is why we run A/B tests simultaneously rather than sequentially any time we can, but somehow we just think that those simultaneous tests will always turn into impact in a future sequential time period.

I’ll have another post at some point (likely more than one) that digs into this issue further. It’s a big issue that becomes more important as Big Data analytic techniques become more common (although it’s still a problem with traditional techniques the way they are often used)

If you have more questions on any of this feel free to comment below or email me (or even better sign-up for emails when I update the ‘book’ and just reply to the welcome email. It goes directly to my personal account).


Please note: I reserve the right to delete comments that are offensive or off-topic.