What is data leakage and why does it matter?

Data leakage occurs when the data you are using to train a machine learning algorithm happens to include unexpected information related to what you are trying to predict, allowing the model or algorithm to make unrealistically good predictions.  In other words, since the data you are using to predict already contains the prediction, hidden in some variable,  the results of the model may not actually be useful.

For example, let’s say we’re trying to predict which customers made a purchase, and the dataset used to make the predictions contained a customer ID (which we assume to be random), but the customer ID started with a 1 if they made a purchase. We can now make a model with impeccable predictions (just use the rule if first digit = 1). However, these rules are actually useless. They don’t help to determine whether a new customer will make a purchase based on the rules that actually matter. Of course, this example is a bit ridiculous, since using a customer ID as a predictor in a model would be naive, but here are some more concrete examples of where this happens.

In a recent episode of Data Skeptic, they discussed an interesting example of how Amazon wanted to cross-sell jewelry. They pulled together a dataset of everyone in their database who had purchased, and from there, attempted to determine who had purchased jewelry as a function of another purchase. This led the model to determine that anyone who had purchased nothing would purchase jewelry.

Why? Because for a customer to be included in their dataset, they had to have purchased something. And, if they hadn’t purchased any non-jewelry item, it is tautologically true that they must have purchased jewelry. This is, again, a useless model. What we would have preferred to have is a model that predicted a customer’s likelihood of purchasing jewelry in the future based on their past purchases.

Another example would be if, located in the Amazon data, we found that users who had purchased gloves were very likely to purchase jewelry. However, if after a bit of digging, we found that the only product recommended while purchasing jewelry was gloves, and vice versa when purchasing gloves was recommended jewelry, we have stumbled on a pattern that the website itself dictates rather than is inherent within the user’s behavior.

Finally, we may observe that revenue goes significantly up or drastically down for the past few days of each month and then conclude that customers are more or less likely to purchase on the past few days of the month. However, we might dig into the advertising data and find out that due to either the need to use up the rest of a monthly budget , or because the monthly budget has run out, that a certain advertising channel has been ramped up or turned off completely for the last few days of the month. It isn’t that we have observed that that the particular user is more or less likely to purchase over the course of these specific days, we have observed something inherent in the data generation process that hides the customer’s underlying behavior.

All of this underscores the need to be aware of how the process that generated the data impacts the data itself. Without this, we may build useless and meaningless models. Is the data telling us something that we already know?

The Solution

To deal with these challenges, you’ll need to collect your data precisely and accurately frame your predictive problem – this can be difficult because we will only learn the flaws of our setup once we go to use our model and data! Be aware of any instance where your website is pushing users to behave in a certain way. This is because if you happen to stumble across a pattern, you will  know whether it was designed, or whether it is actually an element of the customer’s behavior. And most importantly, when we discover any relationship in our data, it is important to ask ‘why’ the relationship might be happening and if it makes sense logically.