Contact Us

Why Selecting the Right LLM is Harder and Easier Than You Think | EP.5

In episode 5 of Hidden Layers, Ron sits down with Reed Coke, a Director of Engineering and Principal Machine Learning Engineer at KUNGFU.AI. Reed and Ron discuss what a Large Language Model (LLM) is, why and how they are so powerful and why it's critical you pick the right one for your company. They dive into how to measure performance of LLMs, customizing them to fit your specific needs and much more.

Ron Green: Welcome to Hidden Layers, where we explore the tech and the people behind artificial intelligence. I'm Ron Green, and I'm excited to be joined today by Reed Coke. Reed is Director of Engineering at KUNGFU.AII and an expert in natural language processing, with a background in both industry and research. His areas of expertise include conversational AI and linguistics. Reed is deeply passionate about the ways in which language, communication, and AI education influence our world. He holds a master's degree in computer science from the University of Michigan at Ann Arbor. Welcome, Reed!

Reed Coke: Thanks, Ron.

Ron Green: Okay, so we're going to jump into some deep waters in a little bit, but let's start at the beginning. Let's describe what our large language models and how they work fundamentally.

Reed Coke: So the way that I think about it, and I should add, I've been doing NLP for over 10 years, and that makes me, if anything, just really crotchety about stuff. I think about it as mostly a really good autocomplete. Everyone's used autocomplete before, they have it on their phone, you have an idea of, there's some amount of a window of the last couple of things you typed, and then it's going to add the next most likely word, and the next most likely word, and you just do that over and over. But it turns out that if you make that window big enough, and you have enough patterns to draw from, by reading terabytes from the internet, you can get pretty specific, and that's basically what it's doing. There's one really important difference though, and something that matters to me as a linguist, which is that usually these autocomplete systems before large language models, they were trying to come up with the correct next word, and that's not the goal of a large language model. A large language model is trying to construct an overall response that a human would find useful. And specifically, to be a little bit more technical about it, that one of their annotators would have marked as useful during the RLHF process.

Ron Green: Right, right.

Reed Coke: During the final process after you have a raw model. So just to add to that, these LLMs, despite their amazing capability, have a really simple training process. They're just trying to correctly predict the next word, given some input context, and if you train them on enough data, and you have enough learnable parameters, we're finding they can do some pretty amazing things.

Ron Green: So everybody is familiar with what are some other ways that large language models are being used? You're actually building systems that go into production. Can you talk a little bit about that and your experiences?

Reed Coke: Yeah, I can. I have to be a little vague, but I'm happy to work around that. I think when people think about ChatGPT, or at least when my mom thinks about ChatGPT, who asks me about it often, it's really kind of a better chat bot, you might say. It's very much about this dialogue and back and forth. And what we're finding professionally is really that that's actually probably not the best way to be using these types of systems. They're incredible in the ways that they can understand things, and it actually makes them very good at categorizing a diverse set of inputs or expanding. Here's 10 examples of something, give me another five, which has a surprising amount of applications, actually. It's not so much, oh, there used to be a customer service, or let's have chat GPT do it instead. It's much more of, hey, there are people. Let's give them some tools to do what they're doing faster, or give them a starting point that they can edit from. That's good enough that it's not actually more work to edit it than it is to just make it in the first place.

Ron Green: Right, right. And so one of the things that I think is really interesting is most people are using large language models interactively. And you have experience in production using them in a way that is sort of slightly mediated. Can you maybe talk about that a little bit?

Reed Coke: Yeah, absolutely. So when you look at a lot of technical diagrams of these systems, they're built up from those previous ideas in transformers of kind of an encoder and a decoder. In short, the encoder is focused on understanding the input and then the decoder is focused on creating output, whether that is more language or an action, knowing what action to take or something like that. A lot of the current applications are really leveraging its ability to encode stuff, to understand, to create kind of more like contextually relevant behaviors or something. But there's a gap that is very interesting to me, which is that it's very good at encoding. I don't actually think it's impressive, but I think the fact that it knows why I said what I said and has some idea of what I was fishing for is deeply, deeply impressive. And that's also where we're seeing most of the traction in industry.

Ron Green: Okay, fantastic. And when you're using these large language models in production settings, are you constrained or are you finding them fairly flexible across different domains?

Reed Coke: I think that they're pretty flexible, but it is a little bit of a difficult thing to think about just because the range of what it can do is so large. So it's often hard to say what is the constraint of the model versus oh, just spend another week trying to find a better prompt, that actually does do that.

Ron Green: Right. That's very hard to assess, but they can and can't do, but in general, they're pretty impressively flexible as long as you're playing to their strengths and you're willing to put the time into the prompt engineering, which is really, I think we've both been surprised by how much time we've spent on that in production engagements this year and how effective it can be.

Reed Coke: Yeah, absolutely. And that time spent would be remiss not to highlight. A lot of that time is coming up with better prompts. An equal measure of that time is figuring out if the new prompt is actually better or not, which is to say measuring.

Ron Green: Right. Okay, perfect segue. I want to get into sort of the heart of the conversation around selecting a large language model and how our listeners can think about it and go about making the right choices. And I know there are lots of different elements that go into that. So just at a high level, could you maybe kind of talk us through the things that you think are important when choosing an LLM?

Reed Coke: Yeah, I would say it's both very complicated and maybe deceptively simple. I brought some notes for the complicated part. Fantastic. Very, very detailed notes. Ha ha ha. There's, you know, if I were to like write it all down and make a pros and cons list and kind of take that approach, there would be kind of these six big pillars that I would be thinking about, which just quickly, it's cost, latency, stability, privacy, security, and performance. The simple version of that is most LLMs you pick are going to do well if you're doing the right thing with them, and any LLM you pick is going to do very poorly if you're not strategic about what problem you're tackling.

Ron Green: Okay, okay. And so digging into some of those, like what are some of the cost considerations that you see?

Reed Coke: Yeah, so cost is actually, to me, as a consultant, maybe the most interesting one, because there are a lot of kind of little pitfalls in it. Different APIs out there, different systems have very different costs. GPT -4 is one of the more expensive ones, because it's one of the more powerful ones, conceptually. There's Claude, also Claude specializes in creative writing, they say. And as a result, it can take these very, very long prompts, which makes it more expensive.

Ron Green: Right, right. And there are a lot of other smaller models that have been kind of distilled down to just the necessary parts that might take more work on your hand, making, kind of bringing them to life and ensuring quality, but also are a lot cheaper.

Reed Coke: Or you can host one, that's not an API, and then you're paying for the computer, you're not paying for access. So then that's a whole different set of things. But one of the big pitfalls that has been really interesting and has, I think, surprised a lot of folks, as we've gotten into generative AI more, is it's very, very important to think about how the LLM is actually going to get surfaced to its users, because it turns out, if you have a button in your app that someone can press to see something funny, or see something personalized or something, they're going to press it a lot. It's not a slot machine, but it's a little bit a slot machine. And so if every time a user presses that button, it's a call to the API, which is cost for you, that's going to be a really expensive way of bringing the capability forward compared to something more comfortable.

Ron Green: You sort of have an unbounded potential cost liability there.

Reed Coke: Yeah, exactly.

Ron Green: So I want to transition a little bit into specific metrics around LLM selection. So when you're evaluating a large language model, are there metrics that you're looking at when you were choosing which ones to go into production like latency, context window, domains?

Reed Coke: I, you know, people, I think, joke that the consultant dancer is always, it depends. But I think that's why I'm here to explain that. So it does depend. Right. It's, you might have a system where you need it to be immediate, and that's going to disqualify a whole bunch of them. You might have a system where you only have, you know, a total budget that you can hit every month or something. That's going to disqualify a lot of them. So I find it to be a little bit less useful to compare them in abstract versus here's how we're going to use it, what's best for this. And that sort of takes you from this world of maybe what you could call an intrinsic evaluation of how good the model is to this extrinsic. It's not how good is the model, but it is what is using this model mean for my user's outcomes versus what is using a different model mean for my user's outcomes. And then you care about all the typical things you would care about. You start to care about engagement and satisfaction or, you know, what is the phrase I'm thinking of? It's extremely common and it has slipped off my tongue. You have detractors, promoters? NPS is what I'm thinking of.

Ron Green: Oh, net promoter. Right.

Reed Coke: Like you get into that world as opposed to the sort of like F1 score and perplexity and all these scientific things that at the end of the day are divorced from probably what you really care about. Well, it's one of the most challenging parts about selecting and comparing performance between the models because we can use single or maybe just a handful of metrics in most other modeling scenarios with language models. It's just so much more complicated once you begin to connect them into the world because their usability as a human is the ultimate litmus test. And that's just kind of hard to pin down with a single metric. And this isn't a new problem either. Especially for language models, BERT was the big exciting language model before the time of LLMs. And you had the same thing. You could look at these abstract scores of how well BERT recreates the text in red. Or you could plug it into a question answering system and see how many questions it answers right. And hypothetically, if you had a better version of BERT, it would get more questions right. And I think ultimately, that's at least a more tangible way to assess what it means for you in a space with so many different options.

Ron Green: Right. So if so if you're looking at very specialized domains like healthcare or the legal domain what what special Criteria or what what additional considerations need to be made when selecting a large language model?

Reed Coke: So in the older world of language models, small language models I guess you could call them now, there was a question of vocabulary. The model just might not know what acetaminophen means. It might not have a concept of that. And that's not so much the case anymore, I think, which is just a function of that large part of the larger language models. But there's still, it goes back to, I guess, what I was talking about earlier, which is it's a question of usefulness. These language models are trained to generate things that an annotator called useful during training. And if you have a system that was specifically trained in the medical domain, that idea of usefulness is probably very different from just this kind of general purpose system. At least in the case of ChatGPT, it was like a science experiment originally.

Ron Green: Right. Right. And there are increasingly available as open source models, language models that have been trained on domain specific data sets. So they may have been trained on popular data sets like the pile and other things like that, but then they were fine tuned on a corpus that was domain specific.

Ron Green: Do you feel that that is sort of an important task in getting a language model to be really successful in a domain?

Reed Coke: I think so. I think that you can still find success with a more general purpose one with very careful prompting and very careful thought about how to deploy it. But if you want to offload some of that work, I think you're going to have a much easier time picking something specialized. Or putting it a little bit differently. I have a friend and 60 to 70 percent of his opinions are parroting Reddit and he is hard to be around. So in that sense you might want something that is trained a little bit more specifically too.

Ron Green: I love it. OK, let's transition and talk a little bit about the ethical and legal considerations that go into this. Because again, in healthcare, as an example, model hallucinations can be a real problem. And then you have other issues around licensing and the availability of these open source models to be used for commercial purposes. And it's kind of all mixed together. I'd love to just hear your thoughts on that.

Reed Coke: Yeah, it's one of those things that feels to me like the best laid plans will still go awry here. You can be as careful as possible and there's probably some little weird nook in the space of things that it can do that is bad. And so you end up needing to spend a lot of time on things like red teaming or on other types of ways to kind of put guardrails around the experience that the model can create unless you want to give up a car for a buck.

Ron Green: All right, so I was gonna bring that up, so that's a perfect, that's fantastic because you bring up red teaming really quickly. Would you describe what that is for the listeners?

Reed Coke: Yeah, I don't think that I'm going to give the best definition in the world, but it's, you know, conceptually, it is this idea of let's try to break this and let's go as far down this path as we can to see if we can cause it to do something bad and then, you know, it's almost like bug testing in like a video game sense, right? Of like bug testers in video games don't just play the game, they're running at the corner where the rock meets the wall for 10 minutes trying every which way to get in between things they shouldn't be able to, you know, get through a door.

Ron Green: Right.

Reed Coke: for the boundary conditions or the edge cases of interactions.

Ron Green: Yeah, yeah. OK, so you bring up this story that happened just recently where somebody was able to interact with the language model. This is what I understand, correct me if I'm wrong. Interact with a large language model online and negotiated the purchase of a car for a dollar, I believe.

Reed Coke: Yeah. I assume that that didn't actually happen. I mean, in the sense that I assume they don't currently have the car.

Ron Green: Right, right, right.

Reed Coke: But yeah, I mean, the picture of the dialogue is there. Right. You can, for me at least, I can look at this and see exactly how this happened.

Ron Green: Yeah, and I think that that is one of the challenges that I think we should maybe call out for our listeners, which is we have seen probably the most success with deploying language models into production when there is a little bit of a Disintermediation between the the user and the system itself. Could you maybe kind of talk about that a little bit?

Reed Coke: Yeah. It's, as someone who used to work on chatbots, you might find that I surprisingly don't believe in chatbots, which is to say to it maybe a surprising degree. I really think that just sort of the open text box of input is a siren song. It sounds really nice, but then when you actually get into it, it makes a lot of assumptions about what your users want to do, can do, won't do. I've seen a lot of applications where people wanted to put an open text box in front of people and let them do anything, and the very first thing they do is say, what do I do? And then on the flip side, people will say, oh, I know what I want to do. I want to get a car for a dollar.

Reed Coke: So I really like tap back responses, actually. I'm very happy to be able to kind of quickly get a like, oh, did you want to say thank you? Did you want to say see you soon? And just pick from one of those. And that still levers the encoding part, the understanding in order to surface relevant suggestions, but now you're not leaning on the LLM for that same kind of totally open decoding.

Ron Green: Yeah, totally. Essentially having just a complete open possibility, open possibility around the input.

Reed Coke: Or like I recently got an Apple watch, right? And it gives you suggestions when my friend completes his exercise in the afternoon. And you know, it'll say like, good job, way to go, or I need to call you later.

Ron Green: Right.

Reed Coke: I'm never gonna use that third one, but if a third of all my users are getting that suggestion, eventually someone's gonna click on that. Right. And you have an open text box, it's just, you know, there's infinity weird answers.

Ron Green: Right, yeah. Infinite.

Reed Coke: Even if someone's not trying to be bad.

Ron Green: Right, infinite corner cases.

Ron Green: Yeah. Okay. Um, okay. I'd love to, um, kind of as we wrap up here, be concrete. I understand that large language models are moving at a blistering pace. So, um, your recommendations now will probably be stale in six months. But if you had to pick, if you had to pick an open source, large language model for a project or a closed source commercial project, uh, like large language model rather for a project, which would you go with right now? Just at this point in time.

Reed Coke: Yeah. Sticking to kind of this gut feeling that I have that the prompt you send in, like there's all of them are extremely capable if you find the right way to use them. I'm actually very interested in the environments that are getting built up around some of these models. So on the open source side, Lama has been really interesting. Lama Lama 2, all those things have been very interesting. And this sort of like open Lama framework around it has been very cool for stuff like retrieval augmented generation or all sorts of other things. And I'm sure much more to come. But I found that to be really easy to work with. So I've really enjoyed kind of the Lama models because of how they connect into this whole ecosystem.

Ron Green: I was gonna say the ecosystem within Lama is so strong already.

Reed Coke: And then kind of a same -spirit answer, AWS Bedrock is doing a lot of really impressive stuff, and Claude is one of the models that kind of commonly gets used in there, but it really seems like that's also set up for a lot of really exciting development around the LLMs themselves. Because again, I'm interested in the end coding more than the decoding.

Ron Green: Right. Right. Okay, so let's wrap up, just a fun little last question. Personally, if you could have AI automate something for you in your life, what would you want?

Reed Coke: I like to cook, I like to eat, I don't mind grocery shopping, I really hate deciding what I'm gonna eat. And specifically sitting down on Sunday and trying to plan out a whole week's worth of meals. I would be thrilled to have basically an AI coach to figure out based on what I've got in the fridge, based on what I typically want, based on what my health goals are to build out sort of a meal plan, just send me the grocery list. I don't even want to approve it really. Just send it to me, I'll go buy everything, I'll cook it all, that's all fine. I would be thrilled.

Ron Green: Okay, that answer, I think your answer is probably the closest to coming to fruition. I think something like that is imitably achievable. It's probably just getting your inventory correct.

Reed Coke: not that different from a lot of surfaces that are already out there. So you might say, why aren't you just using HelloFresh or something like that? I don't have to answer that question.

Ron Green: I love it. Reed, this was fantastic. Thank you so much for coming on board. I really appreciate it and I hope everybody learned a lot today. Yeah, I hope so too.

Reed Coke: Thanks for having me.