We almost deployed a model trained on missing data 🫣

Published 4 months ago • 2 min read

We almost deployed a model trained on missing data 🫣

SARAH GLASMACHER

MAR 17

Last week, my team and I were deep into a data audit, checking for import mistakes and ensure data completeness. The forecasting model has already been trained - so of course the data should be very clean and complete, right? After all, someone has already worked intensively with it... Spoiler alert: our data wasn’t nearly as complete as we thought. 😅

The expectation: We simply deploy the model

We had this plan: audit the data, fix some time-shifting issues, and then move on to checking the model so we can deploy it. That was the job: to deploy a model a data scientist had already trained. Simple, right? Wrong. We didn’t anticipate just how much data would be missing. And how do you evaluate model performance when you’re either skipping those entries or, worse, imputing values? Suddenly, our evaluation data points felt… questionable.

Don't just interpolate the missing data - check the source!

Here’s another problem with missing data: it doesn’t just impact model performance - it can undermine the entire evaluation process. If you’re interpolating or making up values, you’re essentially comparing your model against fabricated data. And that’s a problem. In our case, the missing data was actually available in the source system. But something went wrong during the import process. It’s frustrating because it’s such an easy fix - just clean up the imports - but it’s also a reminder of how small mistakes can snowball.

Build a system to catch errors in production

Don't just fix the data - note every mistake that happened and build a system to catch these import errors in case they happen again. During deployment, if you don’t have checks for missing or incorrect data, things can go south quickly. In our case, it was just a few missing values here and there, but imagine if larger chunks of data were missing and you don't notice because you're interpolating over them. Yikes... Developing a pipeline to check for these issues in production is crucial.

A simple SQL statement can be enough

You don’t need a massive framework to ensure data quality. Sometimes, a few simple SQL queries or a quick check in pandas can catch a lot of problems. For example, we noticed that every week, a specific timestamp was missing. Turns out, it was likely due to a syntax error in the import or export statement. By checking the timeline for consistency, we were able to spot the issue easily. No need for a huge data quality framework - just a few straightforward checks can make a world of difference.

Conclusion: Your data is fine? Check again. And again.

This is something that doesn't often come up or is practiced in university or courses. Often the data is perfect or the missing data has to be imputed. In reality there often is a source and you could have high-quality data, if only you bother to check. I'm so glad I was paranoid enough to check all data when our team took over the project.

What’s your experience with data completeness? Do you have tricks for dealing with this issue? (And if you have a data engineer team who takes care of this for you, please thank them on my behalf!)

Senderinfo:

Sarah Glasmacher, c/o Postflex #2871, Emsdettener Str. 10, 48268 Greven, Germany

sarah@sarahglasmacher.com

Imprint Privacy Policy

Unsubscribe · Preferences

Share this page