Companies gather and analyze data to fine-tune their operations, whether it’s to help them figure out which webpage design works best for customers or what features to include in their product or service to boost sales. Marketers, in particular, use data analytics to answer questions like this: To put people in a shopping mood, is it better to make the webpage banner blue or yellow? Or do these colors not matter? Getting the answer right could mean the difference between higher sales or losing to the competition.
But new Wharton research shows that 57% of marketers are incorrectly crunching the data and potentially getting the wrong answer — and perhaps costing companies a lot of money. “We expected business experimenters [to make this error], but I was nevertheless surprised that so many of them do so,” said Wharton marketing professor Christophe Van den Bulte, who coauthored the study. Wharton marketing professor Ron Berman, another of the study’s authors, agreed: “This was a pretty common phenomenon that we observed.” (Listen to a podcast interview with Berman about the research at the top of this page.)
Their paper, “p-Hacking and False Discovery in A/B Testing,” which was popularly downloaded and widely cited in social media, looked at the A/B testing practices of marketers who used the online platform Optimizely before the platform added safeguards against potential mistakes. In A/B testing, two or more versions of a webpage are tested to see which one resonates more with users. For example, half of a company’s customers would see webpage version A and the other half version B. “Imagine one version says something about the brand of your product and the other version says something about the technical abilities of your product,” Berman said. “You want to determine which one makes consumers respond better, to buy more of your products.”
Berman and Van den Bulte — along with Leonid Pekelis, a data scientist at Opendoor, and Facebook research scientist Aisling Scott — analyzed more than 2,100 experiments in 2014 from nearly 1,000 accounts, comprising a total of 76,215 daily data. This level of granularity is unique, the researchers write, allowing them to essentially “look over the shoulder” of marketers and draw “stronger” conclusions about their behavior. What they found was that marketers were making an error in the statistics process called “p-hacking.” Berman said p-hacking is like “peeking.” It is the practice of checking the experiment before it is over and stopping it when one sees the desired results.
The problem is that if marketers don’t run the experiment all the way through, they won’t know if the initial results will change. For example, if the experiment is supposed to run for four to five weeks, 57% of marketers look at initial results daily and stop the test when it reaches 90% significance. “Experimenters cheat themselves, their bosses or their clients,” said Van den Bulte. Berman added: “What people shouldn’t do, and this is what many of them were doing, is wait until the first time [the experiment] hits this 90% threshold and stop. The reason it is a mistake is because if you waited a bit longer, you might go below 90%, and below 70% and fluctuate again because it is a random process.”
“Experimenters cheat themselves, their bosses or their clients.” –Christophe Van den Bulte
While 70% might still sound high, that confidence level actually means there is no meaningful difference whether one uses webpage version A or B. “The great majority of commercial A/B tests … involve tweaks and changes that have no effect whatsoever,” Van den Bulte said. That’s because it is hard to come up with good ideas that will actually make a significant enough impact, Berman added. “If someone tells you, ‘Let’s design a website and test 10 different colors on the website,’ actually seven out of these 10 will make no difference … unless one of them is very ugly or something [else is wrong]. So it is pretty hard to find variations or changes in the websites that are actually expected to yield a big gain. Most of them will do nothing.”
But there is a definite downside when marketers engage in p-hacking: It leads to more wrong results. “If no one p-hacked, about 30% of the tests claiming to be statistically significant findings would actually be false discoveries,” Van den Bulte said. With p-hacking, that number increases to 42%. “P-hacking boosts the probability that an effect declared significant is actually a null effect … [and] doing so greatly harms the diagnosticity of commercial A/B tests,” according to the paper.
Costs of p-Hacking
P-hacking has real — and potentially costly — consequences for companies. Berman said firms can incur the cost of commission and also of omission in p-hacking. In the cost of commission, the company commits to the wrong strategy because it relied on incorrect results from the A/B test. For example, a company decides to test whether it would drum up higher sales by offering two-day free shipping versus 10 days. If the p-hacked result pointed to two days, “you are going to change all of your shipping processes and procedures to allow for two-day free shipping,” Berman said. “It is going to be very, very costly, and in the end you’re not going to make extra revenue.”
In the cost of omission, the company incorrectly thinks it has the optimal result and stops looking for a better option. “Because you have now incorrectly thought that A is better than B (although A is not better than B), you are going to basically ignore the version C that you could have tested,” Berman noted. The researchers estimate the cost of commission in terms of “lift” or gain. For example, if webpage version A leads to 50% of visitors buying a product and version B results in 55% becoming buyers, the lift of version B over A is 10%. The lift that is lost in the cost of omission is 2%, the paper said.
“It sounds little, 2%. But in our data in total, the average experiment gets an 11% lift,” Berman said. “This is the average value. So you could have gotten an extra 2% over this 11% if you just ran another experiment. And what we also find is that 76% of our experiments have a lift gain of less than 2%, which means that 2% is a pretty high improvement that you are missing out on because of this p-hacking.”
Berman first saw the practice of p-hacking in academia, where researchers were under pressure to show statistically significant results and so would game their experiments. He said his colleagues at Wharton — Uri Simonsohn and Joe Simmons — and U.C. Berkeley’s Leif Nelson actually became well-known for developing a method to catch p-hackers in academic research. But when Berman looked at p-hackers in the business world, he was perplexed. They have every motivation to get the correct result, otherwise it could cost the company a lot of money. Still, the majority of marketers were p-hacking. Why?
“What people shouldn’t do … is wait until the first time [the experiment] hits this 90% threshold and stop.” –Ron Berman
The authors believe there are two main reasons why marketers p-hack. One is poor statistical skills. “Many experimenters do not have the background or experience to validly interpret the statistical results provided by a platform,” the paper said. The second reason is the marketer has incentives to produce significant results. For example, if an ad agency is asked by the client to test the effectiveness of two campaigns, it is under pressure to show one of them has significantly positive results. In another instance, an employee in charge of running the A/B test might feel pressure to report good results to his boss — one version is clearly more beneficial than the other version.
There is also a difference among sectors. The authors discovered that those in the media industry were more likely to p-hack, while those in tech were not. “We suspected this second motive [of wanting to produce the desired results] is more pronounced for media businesses and advertising agencies that stand to gain commercially, at least in the short run, from running a campaign or rolling out a new idea even if it does not really boost business performance,” Van den Bulte said. “That does not mean they do it knowingly. But our findings do suggest that one should be extra cautious about the validity of A/B tests run by third parties.”
The authors offer some possible solutions to address p-hacking. One is to make statistically significant results harder to achieve on the platform. Another is for the platform itself to protect against p-hacking, which is what Optimizely did. Van den Bulte said companies should instill a “just don’t do it” culture among managers and analysts — but concedes this alone is not enough.
A more drastic option is to shift from “null hypothesis” testing — where the baseline is to assume there is no statistically significant difference among the choices or groups being observed — toward decision-theretic tools, where marketers don’t merely look at, say, the engagement levels of A and B, but find out which choice optimizes the firm’s goals, like increasing revenue. “That makes sense since the purpose of commercial A/B testing is not to know whether the outcomes in A are really different from B, but to decide whether to roll out A or B,” Van den Bulte said.
Finally, Berman recommends that marketers do a follow-up after the experiment. “You should probably continue to run with a small control group afterwards,” he said. If the A/B testing shows that version A is better, it would be useful to have a small group still getting version B “to make sure this difference is actually maintained over time … and was not just a fluke of the test.” Also, people and their responses change. “You want to see that your version is consistently better than the other one,” he said.