Paradoxical Averages

One day, the owner of a Thai restaurant has to make a decision on whether he should continue making his own paste or use pre-made paste from the supermarket for the green curry dishes on the menu. He devised an experiment to help him decide. He cooked two dishes, one with a paste that he made from scratch (dish A), and the other with pre-made paste (dish B).

metal-confusion-1-1413118

During the experiment, he served dish A to 10 random people who were in his restaurant at the time for their orders. Since he cannot serve another green curry dish to the same people, the next 10 orders from other people were served with dish B. He requested for feedback from the 20 people in the form of ratings on a scale of 1 to 5, where 1 means bad and 5 is excellent. He collected the ratings and lay them out as shown below.

Dish A – 1, 2, 1, 1, 2, 4, 5, 4, 4, 5 = 2.9
Dish B – 1, 1, 1, 4, 4, 4, 4, 4, 4, 4 = 3.1

He had a quick glance at the ratings and calculated the averages. Dish A scored an average rating of 2.9 and dish B with an average of 3.1. He was surprised that dish B has a higher average rating, considering that it was cooked from pre-made paste. He did not see the point in making his curry paste from scratch anymore and began using pre-made paste for all his curry dishes.

A few months later, there were less and less people going to his restaurant for his curry and the business started making losses. He did not know why. Now, hindsight is a wonderful thing. Let us go back and try to figure out what went wrong. At a high level, there are two things that we can learn from his predicament: he did not know his customers well enough and he accepted the averages at face value.

The first learning points to the importance of only looking at data once you have an appreciation for how customers interact with your products or services. In the case of the Thai restaurant, there were actually two types of people who frequent his establishment which he did not know about. There were aficionados who swoon over Thai food for their delicate yet bold flavours (type A), and the others who cannot tell apart coriander from mint but they simply love everything Thai (type B). When he conducted his experiment, he did not know that the make up of the people tasting dish A and dish B were different. Type A people were harsh critics and generally rated his curry dishes poorer. Looking back at the ratings, there were actually 5 type A and 5 type B people who tasted dish A. As for dish B, there were less type A people (the 3 people who provided the rating of 1) and 7 type B people. This understanding is an important input into statistical analysis of data, which we will explain next.

Dish A, Type A – 1, 2, 1, 1, 2 = 1.4
Dish A, Type B – 4, 5, 4, 4, 5 = 4.4
Dish B, Type A – 1, 1, 1 = 1.0
Dish B, Type B – 4, 4, 4, 4, 4, 4, 4 = 4.0

In order to explain the second learning, let us re-analyse the ratings with the additional context of customer types. We segmented the data further based on whether a rating comes from a type A or type B customer. We then calculated the averages and the results are shown above. What this now says is that the customers of type A and type B both preferred dish A over dish B. How was it possible that when we look at the ratings without the customer types, dish B got a higher average rating? This paradoxical situation is known as the Simpson’s Paradox or the Yule-Simpson Effect. The restaurant owner took the initial averages literally which favour dish B and ran with them. The reality, however, is both customer types preferred dish A over dish B.

In short, the restaurant owner overlooked the possibility that there might be other variables that can influence the outcome of this experiment beyond just how good dish A was compared to B. In this case, the tendency of the different customer types to rate differently confounded the results for him. Simply put, the customers never actually preferred dish B over dish A. The false conclusion that dish A was inferior was just the result of more lenient customers (type B) assessing dish B than dish A, and the rookie mistake of not taking that into account during analysis.