A/B Testing: Clustered vs Clean Experiment Design

We’re currently designing an A/B test for a client’s product page. It’s the first experiment on the page, and we’ve recommended a number of changes.

The client pointed out that our recommended experiment contains, in effect, several variables. (Or as I called it, a “cluster” of variables.) How then will we know which variable had the greatest impact on the results?

A very good question. And the fact is, we won’t know which variable had the greatest impact.

For a first test, the goal is usually to achieve the greatest possible lift in performance. In most cases, you’re not going to achieve that by making one isolated change (for example the wording of the call to action, or the color of a button.) Usually, there’ll be a combination of elements under review.

How do you decide what to change? Well, it’s a mixture of art and science. In reviewing a page, you consider established best practices and what you’ve learned from past experience, then hypothesize as to how the page could be made to perform better. Just a few things you might consider:

Is there a simple, obvious call to action?
Why should the user do as you ask? What’s the payoff?
Does the page invoke urgency? Why should the user act now?
Are there any unnecessary distractions on the page?
Is there anything on the page that might undermine its trustworthiness or make the user hesitate?
Does the page communicate effectively with all different personality types? (For example, Humanistic, Competitive, Methodical and Spontaneous personalities?)
Are there any particular persuasion tactics that could be employed on the page? (For example, Social Proof, Liking, Authority, Reciprocity, The Contrast Principle… For more online persuasion ideas, see this post for a Persuasion Checklist.)

As you can imagine, you can usually spot a whole raft of issues. So on a first test, the redesigns are usually quite dramatic. You’ll have a large cluster of variables to test.

And yes, that means you won’t know which changes had the strongest impact. It’s even possible that some of your changes had a negative impact. From a scientific viewpoint, these experiments aren’t very “clean”.

But that’s what follow-up tests are for. You can’t expect to get all your answers from one test.

As I wrote years ago (in discussing when is it advisable to end a test early) I think we should ask ourselves why we do these experiments. Almost always, it’s for marketing. The goal is to improve performance, not to advance scientific knowledge. “Cleanliness” takes a distant second to enhancing the bottom line.