What is data leakage and why does it matter?

Data leakage occurs when the data you are using to train a machine learning algorithm happens to include unexpected information related to what you are trying to predict, allowing the model or algorithm to make unrealistically good predictions.  In other words, since the data you are using to predict already contains the prediction, hidden in some variable,  the results of the model may not actually be useful.

For example, let’s say we’re trying to predict which customers made a purchase, and the dataset used to make the predictions contained a customer ID (which we assume to be random), but the customer ID started with a 1 if they made a purchase. We can now make a model with impeccable predictions (just use the rule if first digit = 1). However, these rules are actually useless. They don’t help to determine whether a new customer will make a purchase based on the rules that actually matter. Of course, this example is a bit ridiculous, since using a customer ID as a predictor in a model would be naive, but here are some more concrete examples of where this happens.

In a recent episode of Data Skeptic, they discussed an interesting example of how Amazon wanted to cross-sell jewelry. They pulled together a dataset of everyone in their database who had purchased, and from there, attempted to determine who had purchased jewelry as a function of another purchase. This led the model to determine that anyone who had purchased nothing would purchase jewelry.

Why? Because for a customer to be included in their dataset, they had to have purchased something. And, if they hadn’t purchased any non-jewelry item, it is tautologically true that they must have purchased jewelry. This is, again, a useless model. What we would have preferred to have is a model that predicted a customer’s likelihood of purchasing jewelry in the future based on their past purchases.

Another example would be if, located in the Amazon data, we found that users who had purchased gloves were very likely to purchase jewelry. However, if after a bit of digging, we found that the only product recommended while purchasing jewelry was gloves, and vice versa when purchasing gloves was recommended jewelry, we have stumbled on a pattern that the website itself dictates rather than is inherent within the user’s behavior.

Finally, we may observe that revenue goes significantly up or drastically down for the past few days of each month and then conclude that customers are more or less likely to purchase on the past few days of the month. However, we might dig into the advertising data and find out that due to either the need to use up the rest of a monthly budget , or because the monthly budget has run out, that a certain advertising channel has been ramped up or turned off completely for the last few days of the month. It isn’t that we have observed that that the particular user is more or less likely to purchase over the course of these specific days, we have observed something inherent in the data generation process that hides the customer’s underlying behavior.

All of this underscores the need to be aware of how the process that generated the data impacts the data itself. Without this, we may build useless and meaningless models. Is the data telling us something that we already know?

The Solution

To deal with these challenges, you’ll need to collect your data precisely and accurately frame your predictive problem – this can be difficult because we will only learn the flaws of our setup once we go to use our model and data! Be aware of any instance where your website is pushing users to behave in a certain way. This is because if you happen to stumble across a pattern, you will  know whether it was designed, or whether it is actually an element of the customer’s behavior. And most importantly, when we discover any relationship in our data, it is important to ask ‘why’ the relationship might be happening and if it makes sense logically.

Message Sent

Thank you for registering.

Cardinal Path hosted a live session to connect with you and answer all your questions on Google Analytics.
Get all the expertise and none of the consultancy fees in this not-to-be-missed, rapid-fire virtual event.

Thank you for submitting the form.

Thank you for submitting the form.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you.

Click here to download access the tool.

Message Sent

Thank you for registering.

Message Sent

Thank you.

Message Sent

Thank you.

Message Sent

Thank you

Message Sent

Thank you

Message Sent

Thank you.

Message Sent

Thank you

Message Sent

Thank you.

Message Sent

Success!
Your message was received.

Thank you.

Thank you for registering.

Cardinal Path is continuing with its series of free training. Next we are conducting training on Google Data Studio. Check it out here.

Message Sent

Thank you for registering.

Thank you for your submission.

Your request has been submitted and a rep will reach out to you shortly.

Message Sent

Thank you for your interest.

Thank you for registering.

You should receive a confirmation email from GoToWebinar with your unique webinar login information. If you do not receive this email or have trouble logging in to the event, please email asmaa.mourad@cardinalpath.com.

Thank you for subscribing!

You're now looped into the world's largest GMP resource hub!

Thank you for your submission.

Thank you for your submission.

Thank you for your submission.

Thank you for your submission.

Message Sent

Thank you for registering.

Message Sent

Thank you for your submission.

Thank you for your submission.

Message Sent

Thank you for registering.

Thank you for registering.​

Paid media spend by Government websites increased a whopping 139% YoY in 2020.

2020 Online Behavior Live Dashboard

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

2020 Online Behavior Live Dashboard

Thank you for your submission.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Thank you for registering.

Message Sent

Success! Thank you
for reaching out.