A/B Testing – Knowing what the User Wants

A/B Testing?

Have you ever wondered how the Social Games like Farmville and MafiaWars become so viral and engaging? How Google, Amazon, eBay and Zynga are able to present to you the content at the right time and with the right keywords to grab your attention?

There is a User Testing process known as A/B Testing that is employed by many giant IT companies that run their business online. It started as an email-marketing trick, to present variations (single-variable at a time) of the same content to different cohorts of users and collect the response and the change in response based on those variations.

For instance, if 10 users were sent emails about a flower shop in downtown San Jose, 5 of the users might see the picture of the flower shop at the top and 5 users will see the same picture at the bottom. Based on which group of users chose to read the mail and follow the links more would then be collected, using the parametrized urls embedded in the email, to determine which approach was working better than the other. And that information would then be used to create email templates that would have higher chances of being followed by the user. This stochastic analysis allows the data analysts to create models based on different variants of the component that are being tested and use these models to incorporate attractive features to the marketing tools and eventually the products that are being marketed. Here is  demo to demonstrate what is actually tested,  ABTestDemo.

The Science

To make it work, users shouldn’t really know that they are being tested on, as knowing that may affect their natural reaction. Thus, only one of the two options are presented to a user at a time and user’s reaction is captured to measure effectiveness of that option for that particular user; a collection of responses for a larger user base allows analysts to determine how that cohort generally feels about the feature. ABTests are very similar to psychometric tests except the goal is not to understand a particular user better but to establish behavioral trends in a large group of users.

Having said that, it is still a huge task to identify what should be and can be tested, how the variations should be parametrized, which variations should be used for which cohort and how to interpret the results collected for each variation used in the tests. Since both the scale and the magnitude of ABTesting can easily become intractable, ABTesting and the conclusions drawn from the results are basically part of a stochastic process. What that means is the conclusions are not definite but probabilistic. Or in other words, what a test like the one shown above in ABTestDemo can tell us deterministically is 8 out of 10 users like Option B more than Option A, however it may not be able to tell us why 80% users like one option over the other since the users that are making the choices do not consciously make a choice rather they are subconsciously being driven towards or away from the conversion goal (where conversion may mean different things to different applications or websites.) Therefore, the improvement from ABTesting is dependent on the testing team coming up with really compelling options for the users. If ‘Option A’ and ‘Option B’  of a test end up evoking the same or similar emotions, the test can not really yield any worthwhile results. So, choosing the right options for the tests and choosing the right tests becomes critical to the success of ABTesting.

The Technology

I came across ABTesting while working as the SDK architect at Playfirst Inc. I was developing the SDK and was assigned the additional responsibility of including ABTesting support in the games, by means of SDK. What I had was a hashing algorithm for creating the basic percentage based buckets to allocate the test variants to. The architecture was very simple, there was a service written in PHP that would be invoked at the time of deployment to load the A/B variations and rules of distribution for the percentage of users getting allocated to one or the other bucket for each test. These rules once created per persisted in the Memcache, so that when each user logged in to the game, their hash would be calculated using their unique id and some other input parameters. The calculated hash would then be used to place user under one of the two A/B buckets based on which percentage distribution range they fell in. This allocation map for all tests the user would undergo was then stored against user’s id in the Memcache, so on subsequent login, the valid tests would not be recalculated. However, the cached maps were invalidated every time the tests, the distribution or the rules changed on the server side. Once calculated the bucket in which user was placed would be used as an additional parameter for all the subsequent requests from client to server, so that the server will send metadata that was applicable for the user’s allocated bucket option. The client would also have the intelligence to present features (being tested) differently to the users and measure their responses.

Conclusion

I am a big fan of this method of testing and collecting data because this approach favors an unbiased and natural process of selection of a stronger candidate. I think this is the most natural way of evolution and is therefore more organic, since no abrupt undesirable or hard-to-deal-with changes are introduced to the users. Quickly changing products cause more nervousness than confidence in users. Even thought it can be a slightly slow and cumbersome process, it promises a much deeper insight into user behavior and how the product is perceived generally. This method is also very tricky to implement because identifying the areas to be tested and tests to be defined takes enormously wide perspective and can become an issue of disconcert between the team members designing and implementing it. Also, I haven’t come across any frameworks or cookie-cutter templates for implementing ABTesting I guess the reason is that each product has different needs and is implemented differently, so it will take a while before patterns are identified and frameworks are created.

References

  1. Wikipedia – http://en.wikipedia.org/wiki/A/B_testing
  2. http://www.smashingmagazine.com/2010/06/24/the-ultimate-guide-to-a-b-testing/
  3. http://thinkvitamin.com/design/human-photos-double-your-conversion-rate/
  4. http://www.topnews.in/healthcare/content/21929garlic-good-high-blood-pressure