Retrieval Augmented Generation- An Introduction

Everybody knows about ChatGPT. A few know about Llama. Fewer know about Mamba. But there is new word doing the rounds- RAG and Enterprise LLM. Well those are two-technically three words. But you get what I am saying-there is a new baby in town. Well the thing is this baby isn’t new, its been around a while, sometimes written about in a few niche research papers. But after the recent barrage of LLMs, it gained new traction and new dreamy possibilities.

But what is it…exactly?

Well instead of your usual run-off-the-mill Large Language Models (try saying that 5 times faster) where you train the model on loads and loads and loads of data and hope that it “knows” about the topics you care about and gives you the correct information without making shit up when it doesn’t, you give it something of a cheat code. Its like allowing books in the exam hall. You let the LLM “look stuff up”.

Any copycat worth his salt would tell you- Cheating is an underestimated skill. A master of such craft does not begin his work on the day of the exam. No. He prepares. He plans. He observes. His work begins with the first day of the semester when the geeks take out their sharpened pencils. The copycat begins with scouting the best note-takers. The master singles out his targets.

He prepares his cheat sheets the night before- choosing the best study material, compressing it in tiny bits of paper.

The morning comes and the master turns into a human swiss army knife- sticking the cheat sheets every which where and indexing the locations to memory.

This is an exercise in extreme data curation and compression. It is a matter of retrieving the best answers with great efficiency.

RAG is pretty much the same.

The way it works is- you get a corpus of text-pdfs, books, research papers on some topic (Technical jargon: data curation), that you chew up (Technical Jargon: data chunking) and use something called an embedding model that turns this text (Technical Jargon: tokens) to numbers (Technical Jargon: embedding vector) and then put them into a vector database. Tokens turned embedding vectors are stored in such a way that tokens that have a stronger connection get to be closer as numbers too.This is our cheatsheet. Lets call it- lookup-as-numbers.

If it helps, here’s a mantra for you- CHUNK..EMBED..STORE.

Now lets say, you have a question- something like- “Why do pigeons only poop on nice clothes?”. RAG will go ahead and do the same steps as before. Well almost. It will- CHUNK and EMBED. (You can do STORE too, but we can talk about caching later). Now that you have your question as numbers. Lets call this- question-as-numbers.

Now we want to look up the most relevant text to the question we just asked. How do we do that. Simple. We just compare question-as-numbers against lookup-as-numbers. And we get the text chunks closest to the question. This is called retrieval.

Now what?

Well now we have the correct passages to answer the question. The master copycat uses his imagination to expand and embellish a passage to form a grammatically correct answer that pleases the examiner.

We will do the same. But we will use a Large Language model. We will get an LLM model and let it do what it does best- generate text. But this time, it will not use its own vast sea of knowledge. It will use our little water bucket. It will use the information from the text chunks we just retrieved. And give us an answer. This is the Generation bit of RAG.

And thats all folks! This is RAG. 🙂