profile

Sarah Glasmacher

What I learned while building ML data pipelines for energy demand forecasting 📉


What I learned while building ML data pipelines at work this year 📉

SARAH GLASMACHER

JUL 20



I’m currently deep in a project at work that involves forecasting energy demand using machine learning. One of the most challenging aspects is that we are working with several data sources:

  • Internal data from production facilities, coming from legacy systems that aren’t yet integrated into our ML platform
  • Weather data, historical data stored in an on-site SQL warehouse, now updated daily via API
  • Previous model results, from an external provider that we need to load for comparison during the transition phase

We’re building this new model on Databricks, but I believe the lessons here are applicable to most use cases that have distinct data sources.


Lesson 1: Protect your raw data like it’s sacred

Create a raw/bronze data layer and never touch it again. Don’t delete, overwrite, or “clean up” this layer. Just dump everything in and keep it safe.

Why? Because:

  • Some source systems are expensive or slow to re-query
  • Others require coordination with people (who don’t want to resend you data for the third time)
  • You will mess something up and need the original data again

Think of bronze as your safety net. Back it up. Limit write access. Only admins or very senior engineers should be able to change or delete it. (Apparently not everyone understands "keep this data intact" 🤷🏻‍♀️😭)


Lesson 2: Save intermediate steps — always

Save. Everything. (Yes, this is a theme here.)

Build your pipeline in stages:

  • First, make sure your timestamps are consistent
  • Then look at null values
  • Then explore imputation strategies

And save each stage. You can branch out and try multiple approaches - for example for imputing null values. But don’t delete old versions until you’re 100% confident and the project is shipped.

Storage is cheap. Your time isn’t. Especially during development, saving intermediate results will save you hours (if not weeks) later.

It’s like playing a video game: save frequently, so you can reload if something breaks.


Lesson 3: Visualize early, visualize often

If you’re not sure what’s wrong — try plotting it.

I come from more of a data science and math background than data engineering, so sometimes I found it hard to craft the “right” SQL queries to check things. But every time I plotted something, I discovered a problem or insight I would’ve missed otherwise.

Some quick ideas:

  • A bar chart showing how many values are missing per day
  • A grid view of your time series data (is this timestamp complete or not?)
  • Even basic line plots can reveal outliers or shifts in scale

At some point, you can’t scroll through all your tables manually. Visualization forces you to engage with your data. Bonus: you can use those same plots to explain issues to stakeholders and justify why a pipeline is still in progress.


There’s still a long way to go on this project (and only a few weeks left until the first deadline), but these lessons have already helped us move faster and avoid costly mistakes. Hopefully, they help you too.

I’ll be back soon with more lessons from the trenches of energy forecasting and ML pipelines. Until then: save often and visualize everything. 😬


Blog posts I've shared since my last newsletter:

Using Taskfile with uv and pyproject.toml to manage your Python Machine Learning projects

Senderinfo:

Sarah Glasmacher, c/o Postflex #2871, Emsdettener Str. 10, 48268 Greven, Germany

sarah@sarahglasmacher.com

Imprint Privacy Policy

Unsubscribe · Preferences

Sarah Glasmacher

Read about what I'm learning as an ML engineer, what I observe in my field, useful links and resources I found, incl. courses and books and get updates on new content and tutorials I'm releasing

Share this page