What Is Retrieval-Augmented Generation and How to Make AI Work for You, with Guil Hernandez

Guil Hernandez (00:00):
AI models, despite their training and their capabilities, they don't understand things like we do. You seem to get a human-like response and they're eerily sort of human, but really, when you're presenting a question to the AI, it has no idea what text you're feeding it. What happens first is that text needs to be converted into something that the AI can understand, and that format is what's called an embedding.

Alex Booker (00:24):
That was Guil Hernandez, developer and teacher here at Scrimba. I wanted to talk with Guil because he's knowledgeable about how to make foundation models aware of external knowledge bases through RAG, embeddings, vectors and vector databases. If those terms sound foreign to you, they won't by the end of this episode.

(00:50):
You're listening to part three of our four-part rapid response series on how to become an AI engineer. We're deliberately closing out 2023 with this series about a topic that's going to be very relevant to all developers in 2024.

(01:05):
In the previous episode, Tom from Scrimba taught us about foundation models, which are the basis for adding AI capabilities, like text and image generation, to our own applications. Today, we're getting an eloquent introduction about how to make a model aware of your own data source so that that data can be considered for the AI outputs.

(01:27):
For example, using the techniques you'll learn about from Guil in this episode, you could connect a model to your customer support conversations, so that the model has the knowledge necessary to answer unique questions about your business. Guil, welcome to the show.

Guil Hernandez (01:42):
Hey, Alex, pleasure to be here. Thanks for having me back on, and yeah, excited to chat about AI and what we're up to at Scrimba with AI courses.

Alex Booker (01:50):
Last week I spoke with Tom about foundation models, and we ended on this interesting idea about something called RAG, which I believe stands for?

Guil Hernandez (02:00):
Retrieval-augmented generation.

Alex Booker (02:03):
Yeah, so I was going to look it up, but you know it, and embeddings as well. And when I was asking Tom, he mentioned, "Well, you should speak to Guil because he has a whole course all about this." What was the motivation behind the course? What made you focus on embeddings?

Guil Hernandez (02:18):
We did a lot of research in developing this AI engineering path and we wanted to teach the skills that maybe existing developers needed to know in order to become AI engineers. And I kind of got on the project a little bit late once the curriculum had already been laid out, there had been some groundwork already done by some of the teachers. So I kind of jumped onboard and said, "All right. Well, this sounds really interesting. I don't know a whole lot about it yet, but it'll be exciting to learn and work with vector embeddings and vector databases, and not just maybe build new cool stuff with it, but also teach a course on it."

Alex Booker (02:56):
Yeah. I don't know all the specifics, but what Tom was alluding to, and I think I understood, is that these collective technologies, they enable you to make a foundation model aware of your own data or data or that was not part of the model originally.

Guil Hernandez (03:14):
Yeah, exactly. So, a lot of these models often need access to data that it wasn't trained on, external data, domain specific data. And these models also have a cutoff date. For example, GPT-IV has a cutoff date of, what, April 2023? With this RAG process, retrieval-augmented generation. It's more of a framework and there's a lot of moving parts, but the whole idea is to give a model access to an external knowledge base, and from there, create conversational responses by feeding that data that might've been stored and retrieved in a vector database to then another model, a generative model to kind of enhance the response. Yeah, a lot of moving parts, but we can get to more of the nitty-gritty too as well, later.

Alex Booker (04:02):
Well, I guess that's how foundation models work. Right? They're trained upon terabytes and terabytes and terabytes of data, companies like OpenAI and there are other companies like them, they probably scrape the internet and find all kinds of data sources on which to train the models, but there might be data that it wasn't trained on, either because it wasn't around at the time or frankly it's too niche. It's not something that would've surfaced in that scraping process.

(04:27):
It could also be the you have your own data. Here's the thing, foundation models and the outputs they produce can be so impressive, they have such a broad range of knowledge, but they don't know anything specific about your company potentially or your users.

(04:42):
I guess one very obvious example that we talk about a lot is if you want to build a support agent type chatbot, if you ask ChatGPT about a business that was incorporated last week, "What time do you open?", it couldn't possibly know, but you might still want to leverage the foundation model to produce a cool feature. And then the other aspect maybe, is you have your own users and they have data, that if you could make the foundation model aware of, it could produce more tailored and interesting sort of outputs.

Guil Hernandez (05:09):
Yeah, so that's exactly right. So models, even though they're highly trained don't have all the details of a product, a company or even beyond that, more private information, like a contract, legal or medical documents that you want to ask specific questions about or maybe you want to have a user upload and have conversations about it.

(05:29):
So RAG and the tools and steps involved allow us to ask AIs about data that it hasn't been trained on. And it even goes beyond that, like you said, with these big platforms that use AI-powered search, like Spotify, Netflix and Amazon, they use a little bit of this RAG process and embeddings to not only help you search for information using free text, but also match shows, songs, products, all based on your preferences.

Alex Booker (05:56):
Let's talk about that a little bit because it's actually, in my opinion, a more interesting application than this... This knowledge-based example thing is valid, don't get me wrong, but we've all seen this with our favorite platforms, like YouTube and Spotify, they seem to recommend videos and music that we really like. It's almost like you have a personalized creator putting things in front of you that you're likely to enjoy based on things you've enjoyed previously.

(06:22):
My understanding is that even though we see things we enjoy as a very emotional thing, you can't quantify that with data as well. For example, on a YouTube video, did you watch it to the end? That's probably a really good sign that you've enjoyed it. Right?

Guil Hernandez (06:35):
Yeah. Anytime you interact with an AI-powered app or a platform, like Spotify, Netflix, there's a good chance that this embedding process and embeddings are at work behind the scenes. In fact, in my course I kind of opened up with a story about how Spotify used this process back in 2014 to improve its music recommendation system. They converted millions of songs, user data, artists, into vectors and used that to help offer personalized recommendations, and that enhanced the platform's engagement, performance and scalability. So yeah, this process is pretty much being used everywhere these days.

Alex Booker (07:13):
You mentioned that happened in 2014, which was before OpenAI even existed, potentially, let alone models like GPT.

Guil Hernandez (07:20):
They created their own embedding system, and what they did was, if you've ever been on Spotify, especially back then, and you were kind of blown away by how quickly tracks would play. Well, what they did was they would pre-download many of the tracks that were likely to be played one after the other or tracks that matched your preferences, using this sort of embedding data that we'll talk about, so that if you ever shuffled through songs and artists, they would not only play right away, but they would also fit your taste and then that would keep you on the platform. They also did something very similar for the podcast search feature. So yeah, it's used in fascinating ways.

Alex Booker (07:56):
'Cause it's like pre-loading essentially, but when you have 50 songs to choose from, you can't pre-load everything, you might as well just download the entire library at that point. But they would use, essentially, these techniques to best predict what you're going to play next based on what you've done in the past and what other people have done in the past?

Guil Hernandez (08:12):
Yep. They were predicting it, they were sort of clustered together, and then that cluster would just pre-load, and then you most likely choose a song from that cluster, and boom, it's already available to play. So that's how they were able to get it to perform so well.

Alex Booker (08:23):
That's a very interesting application of these things. And even though they didn't use a model like GPT because it didn't exist at the time, am I understanding you right that potentially we can enable features like that in our own applications, leveraging something like GPT?

Guil Hernandez (08:37):
Yeah, absolutely. That's the core idea with using OpenAI's embedding model in this whole RAG process. Because embeddings are, as we're going to talk about, it's not just for text, you can use it for images, video, audio, so many, many industries are sort of using this to extract information and just gain more insight to improve their product and experience.

Alex Booker (09:00):
Well, maybe we should talk a bit about the technical nitty-gritty then. You have mentioned a few, at least a couple of technical terms, like RAG, which I couldn't remember what the acronym stood for, as well as embeddings. What is the relationship between RAG and embeddings at a high-level before we get into the details?

Guil Hernandez (09:17):
It's a bit technical, there's a lot of moving parts to this, but the idea is that you are using an embedding model, a vector database, and a generative model. So you're taking regular text and you're creating these embeddings from the embedding model, which you're storing in a vector database. And then you're using information from that vector database to then feed it to a generative model to create a conversational, human-like response and enhance that text that you're getting back from the database, which is usually highly relevant text, text that's up-to-date.

Alex Booker (09:53):
This is starting to make an enormous amount of sense, of just a couple of sentences, you've explained a bunch of answers to my questions. Because when I thought about the problem we set out at the beginning, I got the idea that maybe you would do more training on the model, you would train it with your data, but what you're describing is actually connecting the model to an external knowledge base. And if I understand well, retrieval-augmented generation, that would suggest that you're retrieving data from the knowledge base to augment the generated text, the output?

Guil Hernandez (10:26):
That is exactly right. You're basically enhancing content retrieved from a database with a generative model.

Alex Booker (10:34):
Probably a foundation model can't just understand the text in your database. Right? You probably need a step to translate that into a format that can be understood and used as part of the generation process?

Guil Hernandez (10:46):
Yes, 100%. That is where this concept of embeddings comes into play.

Alex Booker (10:52):
Yeah. Maybe a good place to start is, what is an embedding and what does it enable?

Guil Hernandez (10:56):
We know that AI models, despite their training and their capabilities, they don't understand things like we do. They have no idea what words and phrases means, and even the, quote, unquote, "Real world."

Alex Booker (11:09):
They seem like they know.

Guil Hernandez (11:10):
Yeah, it's pretty fascinating, but as I mentioned, there's a lot going on with something called embeddings and this whole RAG process, and knowledge retrieval, which is the focus of my course. One example that I give in the course, is think of it like asking a AI tool, like a chatbot, a question. You seem to get a human-like response, as you mentioned. They're eerily sort of human, but really when you're presenting a question to the AI, it has no idea what text you're feeding it.

(11:37):
It doesn't understand the text, but what happens first is that text needs to be converted into something that the AI can understand, and that format is what's called an embedding. The embedding itself is a vector, and that vector embedding preserves the original meaning of the text all through its training. It captures all the different semantic or contextual aspects of that text, which sounds pretty intense. That is the sort of magic, capturing all the semantic meaning and relationship between words that makes AI then recognize word and context and different scenarios. So I like to think of embeddings, and I say this in my course,. As the language that AI can understand

Alex Booker (12:23):
When we think about embeddings and the vectors within them, that makes me start to think about vector databases potentially. What's the link between embeddings and vector databases?

Guil Hernandez (12:35):
We've talked about how our questions and queries for the AI get converted into embeddings, and remember, the AI is going to be pulling in answers from an external knowledge base, or let's say domain-specific documents. So those are all text-based, so it means that all of those documents also need to be transformed into embeddings, and we need a place to store all of these massive embeddings and their corresponding text. That's where vector databases comes in.

(13:02):
Many of you might've worked with traditional relational databases like MySQL or Postgres. Now, the challenge is that a lot of those cannot handle the size and complexity and scale of all that data, so vector databases are specialized, and they have the capacity to store retrieve and manage these high dimensional embeddings.

(13:22):
And there's lots of options out there. You may have heard of Pinecone, Chroma, there's Weaviates, but my course uses Supabase, which has a really, really cool and useful Postgres extension called pgvector that makes it pretty simple to store embeddings and perform what's called a vector similarity search. So it takes on a lot of the heavy lifting for you there.

Alex Booker (13:43):
So when we talk about making a foundation model aware of external data, we have that external data in some formats. It could be a support ticket, could be a document, it could also be an actual database, like a MySQL database or some kind of document database. You can't just plug that directly into the foundation model because it's like trying to speak a language that doesn't understand. Hence, you have this step, it sounds like, where you need to do a bit of work, I assume, to then translate that external knowledge in its current formats into vectors, which then get represented in the Vector database?

Guil Hernandez (14:21):
Yeah, that's exactly right. That's the idea. And you can update the text in that vector database as much as you want with up-to-date and relevant information, so the AI chatbot has the information it needs. But you wouldn't want to create one embedding from a massive PDF or multiple documents. So what engineers do first, is break the text into smaller chunks, it's called chunking. By splitting these large documents into smaller chunks, what you do is you create embeddings for more specific content segments, and that is super, super useful for the AI to then create a better, more relevant response. And there's lots of tools out there for that. I use LangChain in my course to split text, and then feed it to the embedding model and then store it in the vector database.

Alex Booker (15:05):
One idea I had was, 'cause every Scrimba Podcast episode, we put a lot of effort into producing a very high-quality transcript. It's not auto-generated, believe it or not, a person actually creates the transcript. And we also produce things like episode descriptions and that kind of thing.

(15:22):
That could be our external data and imagine making a foundation model aware of everything that's ever happened on The Scrimba Podcast, the podcast, then we could do some really cool things. You could build a chatbot where you can ask it to be your career coach or something, and it can draw on the knowledge and even attribute the advice to a particular guest, an episode, and probably even timestamp. How cool would that be if it could even find the timestamp and then play the clip and say, "Oh yeah, in episode 40, Gergely Orosz, previously at Uber, he said this about resumes," and then played it or something?

Guil Hernandez (15:53):
Yeah, 100%. I've seen cases where they have an AI listening to a call. There's a student and a teacher, they're going over learning JavaScript, for example, and then all of that transcription from the call, then yeah, it gets embedded, stored. And then the student can go back and have conversations and ask questions just based on these calls, and the AI then reminds them, "Hey, by the way, remember to do this and that." And a lot of it has to do with the whole RAG process, so that is certainly doable, what you just mentioned.

Alex Booker (16:24):
Say I got a text file for every transcript of The Scrimba Podcast, or maybe I represent it as a JSON document, potentially, with some metadata and the transcript, if I understood you well, I can't just do one full step to turn that into something that's represented in a vector database because you'd have to do that chunking process to translate it in a way. Can you help me understand what I would actually have to do?

Guil Hernandez (16:49):
Yeah, absolutely. And actually, OpenAI, you can upload files now and then use those and chunk them. A lot of it has to do with this new Assistants API tool I just told you about, but you can feed these documents to a text splitter, like LangChain, for example. You can set all of these settings for it, like, "How big do I want these chunks to be?" Keeping token limits in mind, and how much context do you want to keep between text chunks? You can have text overlap. I know I'm getting too technical here, but yeah, the whole idea is that you would first feed all of these documents to a text splitter and then store them in a vector database, and this database, depending on which one you use, you can create different indexes for your content to speed up the retrieval process.

Alex Booker (17:33):
Yeah, you've mentioned the Assistants API. Maybe now is a good time to talk about what it is and what it enables.

Guil Hernandez (17:39):
Absolutely. So we talked about RAG and the whole retrieval-augmented generation process. There's a whole lot more to it, and you can learn all about it in my Scrimba course. But this new Assistants API that came out maybe a month ago at OpenAI's DevDays, what it does is lets developers build customized assistants that are tailored to specific use cases and knowledge sources, so they can do something with external knowledge and it could do many other things, too. It's super powerful.

(18:11):
And it also has something really useful and relevant to this podcast, which is knowledge retrieval. It can pull in external knowledge from outside its model. I think currently you can upload 20 documents to it and then it can do the whole RAG process. So behind the scenes it chunks the text, it stores it in its own vector database and does a lot of the RAG for you behind the scenes, if you will.

(18:33):
And it also can call functions, so you can list custom functions for your assistant to run and then it could do something with them. And that's how your assistant can act like an AI agent, which is another relevant topic. It's pretty amazing. It's basically the whole RAG process in one API.

Alex Booker (18:52):
It feels like things evolve quite quickly in this space. You might've been learning about a bunch of tech before that announcement, and then that announcement kind of renders some of the things you were learning, or maybe even teaching in your case, to be not as relevant as they were even a day before.

Guil Hernandez (19:07):
Yeah, Alex, it was the week we were planning to release the AI path where that Assistants API got announced. And yeah, it had us wondering if it's going to just render the whole RAG process and everything that we've worked on related to embeddings in vector databases obsolete, but that's not really the case. I don't see RAG itself going away.

(19:28):
In fact, our CEO here had a chat with AI engineers at big companies and he asked them kind of the same question, "Is RAG going away anywhere? Are you going to get away from RAG because of the Assistants API?" And at the moment they're like, "No. No, we're going to continue to use embeddings and vector databases, because RAG itself gives you more control, more flexibility than what you might get right now with the Assistants API," and it's a little bit pricier too when it comes to token costs.

Alex Booker (19:55):
Is the Assistants API a type of RAG, a way of doing RAG, or is it a separate concept?

Guil Hernandez (20:00):
It is basically the RAG concept baked into the API. It does the same thing. It enhances the API's response with up-to-date information from an external knowledge source, and it's a little bit different too than fine-tuning, if you've heard about that. But yeah, it's basically RAG in one API call, and a whole lot more with function calling and code interpreter.

Alex Booker (20:22):
It's interesting you mentioned fine-tuning, because when I first heard the term fine-tuning, instinctively, or intuitively, I suppose, I thought that might be describing what RAG does in the sense that you're tuning the model to your data or something like that, but Tom was kind enough to teach me that isn't really the case.

Guil Hernandez (20:39):
The way I think of fine-tuning is exactly that. Let's say you are a teaching already educated model or a robot, and it's learned a lot of stuff, but we want it to be very specific and good at something. So for example, "Hey, I want this robot to write exactly like Alex Booker," or maybe understand medical or legal text. So you give it a bunch of examples and maybe stuff Alex wrote or medical or legal documents to learn from, and then it uses those for specific applications. Where RAG, as we've talked about, is just more about enhancing the AI's response with up-to-date, relevant external information.

Alex Booker (21:14):
I have to be honest, the distinction still isn't super clear to me. And to be clear, Tom explained it to me last time. What you're saying sounds similar, but the distinction between fine-tuning and RAG, it hasn't quite dropped into place. Could you possibly explain it a different way?

Guil Hernandez (21:29):
There's certainly a lot of similarities, but I think with fine-tuning, you're not using this vector database. There's no intermediate tool that is feeding the generative model context to then form replies from. What it's doing is taking a bunch of example data, so let's say that you're building a support bot for the Scrimba Bootcamp and you want it to respond in a certain way. So you feed this model a lot of examples of conversations and questions and answers from your support team, and from that, it gets an idea of exactly how to respond to users.

(22:06):
And it's pulling all that information from a JSONL file, so it's evaluating all of that, versus the whole RAG process, where there's a lot going on behind the scenes with embeddings and vector search, but it's a lot more complex, I think. And because, as I mentioned, there's a lot of moving parts, it integrates the generative model, a retrieval system, but I would say that RAG excels highly in tasks where having the most current up-to-date information is crucial, where fine-tuning is more about, "All right, I want you to act a certain way. Here are some examples of how you might act that way, and then here's a prompt and do it."

Alex Booker (22:44):
I've got to tell you, Guil, you're kind of blowing my mind in a way, because this is a very complex subject. I'm starting to realize that it is also the sort of thing that is a little bit visual. Yeah, it's very, very useful to get a kind of overview of what the landscape is, what's possible, and the different tools out there. I think what we've got today from this conversation, apart from an overview of the tooling, is this relationship between foundation models and embeddings and vectors and vectored databases, and what this technique called RAG encompasses exactly.

Guil Hernandez (23:18):
Yeah. No, it can be highly visual, and that was one of the more challenging parts of my course. So yeah, the course, I highly recommend it because it does take the time to walk you through that sequence, from the problems embedding solves, what they are, how to create one, how to chunk text, how to store it in a vector database, retrieve it, the whole thing. And it does use a lot of visuals to help sort of grasp that, 'cause it is highly technical and there can be a lot of moving parts to it.

Alex Booker (23:45):
We'll be sure to link the course in the description so people can check it out. But for now, Guil, I'll say thank you so much for your time and for telling us a bit more about this exciting idea called embeddings.

Guil Hernandez (23:55):
Yeah, anytime. Alex, it's been delightful. Thanks so much.

Jan Arsenovic (23:58):
That was The Scramba Podcast, episode 142. If you're just discovering this show, that means that there's a lot of great episodes for you in our backlog. However, if you're discovering it on YouTube, I have to tell you that over the past, almost three years at this point, we mainly existed as an audio-only podcast, which means you won't find that many of our old episodes in YouTube, but you will find them wherever you listen to podcasts. So, Apple Podcasts, Spotify, Overcast, Pocket Casts, Google Podcasts, you name it, we're probably there.

(24:31):
If you like this show, and if you want to make sure we get to do more of it, the best way to support us is to share it on social media. As long as your Twitter or LinkedIn posts contain the words Scrimba and podcast, we will find them, and you might get a shout-out on the show, just like these people right now. On LinkedIn Holly Rees shared her Spotify Wrapped and wrote, "Do you think my Spotify Wrapped might be a little telling of someone transitioning into tech? Only discovered Hot Girls Code very recently, but they're quickly becoming one of my favorite podcasts. And of course, CodeNewbie taking the top spot. Saron Yitbarek is genuinely one of the best hosts I've ever listened to."

(25:11):
By the way, she was also a guest here on The Scrimba Podcast, and I will link to that episode and the show notes. "The Ladybug Podcast, The Scrimba Podcast, and A Question of Code are also great." Well, we are in a pretty good company here. If you're looking for more coding podcasts, check out the ones in Holly's list. And on Twitter, Al shared his Spotify wrapped and wrote, "Well, it looks like The Scrimba Podcast made it to the first place this year. It inspired me during last year until I found my first job as a developer, and it keeps on inspiring me today to continue learning and growing. Great job, guys."

(25:49):
Al, congratulations on landing your first dev job. Was there a particular piece of advice you've heard on the show that helped you, either with learning or with interviewing? If so, I would like to hear that story. You can find me on Twitter. If you're feeling super supportive, you can also leave us a rating or a review in your podcast app of choice. The show is hosted by Alex Booker. I've been Jan, the producer. You'll find both of our Twitter handles in the show notes. Thanks for listening or watching. Keep coding and we'll see you next time.

What Is Retrieval-Augmented Generation and How to Make AI Work for You, with Guil Hernandez
Broadcast by