profile

Sarah Glasmacher

I got side-tracked in my side-project, send help. A January Local-RAG project update šŸ«£


I got side-tracked in my side-project, send help. A January Local-RAG project update šŸ«£

SARAH GLASMACHER

JAN 21



yeah... so... I'm not sure if this was supposed to happen, but it sure did. So let's talk about it and what I'm going to do about it. Earlier this month I decided to do a monthly coding project to get myself to build more. Spend less time reading and procrastinating, spend more time coding. The first project? A local RAG "from scratch" project (read more about it here on my website).

Right now, Iā€™m stuck in the messy middle. Itā€™s where excitement fizzles, side-quests appear, and the actual coding? Well, itā€™s kind of stalled. But let's break this down: What have I done so far? What will I tackle next? And how do I keep moving forward with the time left in this project?

What Iā€™ve Done So Far

In my last newsletter, I shared a few early wins:

  • Created a script to process local markdown files, extract content, generate embeddings using OpenAIā€™s text-embedding-3-large, and store results in a Pandas DataFrame.
  • Set up a local Postgres database with pgvector in Docker for storing and querying embeddings (not connected to Python yet).

Thatā€™s honestly it. Since then? Progress has slowed to a crawl because I fell down the rabbit hole of embedding models. I wanted to replace OpenAIā€™s embeddings with something else, but wow - that topic is much too complex for a short sprint. I now have 1,500 words of notes on multilingual benchmarks and metrics. Useful? Yes. Immediately actionable? Not so much.

Research Wins (a.k.a., My Procrastination Disguised as Progress)

While I havenā€™t written much code, Iā€™ve dived deep into research:

  • Embedding Models & Multilingual Data: I explored benchmarks for multilingual embeddings since Iā€™m working with both German and English text data. This is especially relevant for my day job, where weā€™re building a vector store for German text. (apparently bge-m3 and stella-en-1.5B are popular options?)
  • Blog Progress: I wrote and published a blog post about setting up a Postgres + pgvector database locally using Docker. Itā€™s not directly tied to this project, but itā€™s a tutorial I know Iā€™ll need in two months when Iā€™ve forgotten how I did it. These setup tutorials are lifesavers - without them I will never be able to answer "How did you install this?" - "No idea, followed 5 tutorials, mixed results, at some point it worked." šŸ¤·šŸ»ā€ā™€ļø

Whatā€™s Next: The 5-Day, 30-Minute Plan

Iā€™m committing to 30 minutes of coding per day for the next five days. Itā€™s time to turn research into action. Hereā€™s the plan (because without a to-do list, I won't get anything done):

  1. Connect Python to Postgres: Choose a library (Psycopg2? SQLAlchemy?) and set up a basic connection.
  2. Run a Query: Start small - maybe just list tables in the database to confirm the connection.
  3. Insert Data: Use Python to insert anything into the database. I have an example insert in my DBeaver already that I can use.
  4. Work with Notebooks: Figure out how to run .py interactive cells in VS Code to get away from ipynb files because AI extensions & GitHub hate them. (Do they just appear when you add the right symbols? Weā€™ll see!)
  5. Move to .py Files: Transition the notebook code into said standalone Python scripts.
  6. Insert Embeddings: Add the embeddings Iā€™ve generated from OpenAI to the database.
  7. Query & Compare: Write a function to compute similarity using pgvector or cosine similarity, then test it with a query.
  8. Integrate Everything: Combine the embedding and similarity search functions into a pipeline that takes a query string, computes its embedding, and finds the closest match in the database.

Building in Public: Humbling and (Occasionally) Embarrassing

Building in public isnā€™t glamorous. Right now, Iā€™m not even sure I can call myself a ā€œbuild in publicā€ person because, well, Iā€™m not consistently building anything. But thatā€™s what this project is supposed to fix - itā€™s an attempt to build a coding habit. I wanted to be smart and strategic, but I overcomplicated things. Now, Iā€™m scaling back to the basics: show up, code a little, and share the journey, messy bits included.

Looking Ahead: February and Beyond

When I started this project, I gave myself permission to continue into the next month if needed. And honestly? Iā€™m not ready to abandon it. Thereā€™s so much more to explore with this local RAG project. Plus, Iā€™m not sick of it yet, which is a miracle considering my novelty-craving brain. Maybe my perfectionist side is kicking in - I donā€™t want to leave this project in an embarrassing state, especially since Iā€™m building it publicly.


So here we are. If youā€™ve ever gotten stuck in a side-project, let me know how you got unstuck. And if you have tips for making time/finding energy to code after dinner, please let me know - today I napped instead... šŸ„“

Until next time, keep building (or procrastinating productively).

Senderinfo:

Sarah Glasmacher, c/o Postflex #2871, Emsdettener Str. 10, 48268 Greven, Germany

sarah@sarahglasmacher.com

ā€‹Imprint Privacy Policy

ā€‹Unsubscribe Ā· Preferencesā€‹

ā€‹

Sarah Glasmacher

Read about what I'm learning as an ML engineer, what I observe in my field, useful links and resources I found, incl. courses and books and get updates on new content and tutorials I'm releasing

Share this page