Small Language Models and the Future of Production AI with Karun Thankachan

When general-purpose LLMs are overkill, small language models trained on specific tasks may be the smarter bet

Mar 26, 2026

This conversation with Karun Thankachan is a practical tour through small language models in production, starting from the limitations of general-purpose LLMs and repeatedly returning to a single constraint. Cost-effective reasoning for specific tasks is a different engineering problem than general-purpose reasoning, and good engineers choose their tools accordingly.

Thankachan is a Senior Scientist at Walmart, where he works on language model systems for retail AI applications. He has a background in machine learning research from Carnegie Mellon University and has spent time at Amazon before his current role, building production AI systems at scale.

In our conversation, we also talked about ReasonLite, an open-source library that brings chain-of-thought distillation, program-aided reasoning, self-consistency, and trace-budget control under one unified interface, making SLM training feel more like hyperparameter tuning than a collection of disconnected scripts.

He also covers SLM-Fusion, a multi-model orchestration framework that handles routing, merging, and serving across multiple specialized SLMs, including an OpenAI-compatible FastAPI gateway that abstracts the entire reasoning layer as a microservice. Finally, the conversation turns to where the industry is headed and why RAG and context engineering are winning over fine-tuning right now, and what to watch as diffusion models become more mainstream.

You can watch the full conversation below or read on for the complete Q&A transcript.

1. Currently, your career spans both academia and industry—and perhaps things in between—starting from advanced research, for example, at CMU, to applying AI at Walmart. Can you perhaps share with us how this journey led you to eventually focus on SLMs?

Karun Thankachan: Yeah. So I started my career as a software development engineer back in India. There, I had the opportunity to work under a director who was starting the data science and machine learning team there.

I had the opportunity to work on a bit of data analytics, big data science, and eventually machine learning. That sort of sparked my curiosity in machine learning, leading me to do my master’s in machine learning from Carnegie Mellon.

And that’s where I got a bit more interested in the research side of things. I had the opportunity to work under a few professors there. Professor John Stanford, in particular—I was able to publish a pretty good paper in AAA, and got into the weeds of deep learning. Eventually, that led me to land a role at Amazon and now currently a senior scientist at Walmart.

It was, however, at Walmart, with the ChatGPT/LLM wave, that I got involved a little bit more in the field of language modeling and NLP.

Our current director had a vision for agents that could solve specific business problems, and we started trying to develop toward that mission. During that time, I realized that a lot of building agents is a little bit more software engineering than machine learning. It was a lot more about designing evals that could give you feedback on how your LLMs are performing, building guardrails that could make sure that your LLMs are behaving the way that you want them to behave, and optimizing for cost and latency. A lot more engineering focus than, let’s say, machine learning focus.

And I missed a little bit of that machine learning flair. And that’s when I started investigating on my own a little bit more how I could be involved in the machine learning side of things within the AI wave.

And I stumbled upon small language models—language models that you could actually fine-tune and optimize to the specific task that you want. It felt a bit more in the domain of machine learning. It felt more like you were understanding how a model was working and helping the model learn patterns in your data, which was sort of what got me interested in the field. That’s what got me interested in SLMs.

And it’s sort of my opinion that right now, we are in a race to see who can define this new customer experience that would be based on LLMs. And that’s why we are relying a lot more on foundational models, and we’re hitting them via APIs, plugging and playing them into our experience to figure out what kind of new, reliable experience we can provide to the consumer. And whoever provides that new, better experience to the consumer will take over a huge amount of the market.

But afterwards, once this new experience is well-defined, then we will go back to that age of cost optimization. And that’s where SLMs are going to come back into the picture, because they are able to reason more cost-effectively on specific tasks, as opposed to LLMs, which are more general-purpose reasoning. So that’s sort of why I still keep very invested in this domain of SLMs, and I’m hoping that converts in the next five or six years. Yeah.

2. Let’s talk about ReasonLite which was introduced as a way to perhaps tackle the fragmented, unclear evaluation practices and high token costs that hinder reasoning in small models. What gaps did you see in existing SLM distillation toolchains that made you create ReasonLite?

Karun Thankachan: Sure. So maybe to take a step back and explain what ReasonLite is: LLM models are models that have a tremendous reasoning capacity—general-purpose reasoning capacity. And since they are models with billions of parameters, they’re also able to, in very layman’s terms, remember a lot more details to be able to reason out a solution for a question that you ask.

For instance, if you ask an LLM how do you bake a cake, an LLM might be able to remember all the steps that are required to bake a cake—preheat the oven, mix flour and sugar, add X, bake for X amount of minutes. So it’s able to remember a great amount of detail.

An SLM, in comparison, doesn’t have as many parameters, so it won’t be able to remember as much. It’ll be able to maybe remember things like flour, eggs, cake.

And how these language models work—they’re essentially what we call autoregressive models. So what it has generated thus far will influence what it will generate next. So if you can’t remember a lot of the prior steps to baking a cake, like preheat the oven, flour, egg, cake mix, stuff like that, you might not continue generating the correct answer. An SLM might say flour, eggs, brownie instead of flour, eggs, cake, because it didn’t have the correct prior before it.

But even though LLMs have billions of parameters and have a lot of memory, it may be overkill when you want reasoning for a specific task that you have in mind. It’s really great for general-purpose reasoning. But for specific reasoning for a particular task that you have in mind, an SLM might be able to get you there. And the only thing that you have to do is update its parameters, which are currently now for subpar general-purpose reasoning. Update it to become good for specific-task reasoning.

So how do you teach these SLMs how to build out that reasoning? There are a lot of techniques in the market, like CoT distillation. To provide an example, here you ask the LLM, “Hey, I’m going to ask you a question. Let’s say a person went to a store where they’re selling apples for $3. He bought four apples. He gave the shopkeeper $20. How much does he get back in return? Tell me the answer, but show me the steps also.”

So the LLM will write out: the cost for an apple is $3, four into $3 is $12, he gave 20, so 20 minus 12, eight back. Your answer is 8. It will show you each of the steps and then give you the answer, the 8. So like we mentioned earlier, these are autoregressive models. So what it generated earlier will help it generate more accurate answers in the future. So if I’m able to get the SLM to output responses in this similar manner, or at least think in this similar manner, then it’s more likely to get at the final answer.

So what we do is we ask the LLM to generate its entire chain of thought, and we feed this chain of thought into SLMs and ask it to generate a similar chain of thought. We don’t do this for general purpose. Rather, we do it for our specific tasks. If you try to do it for general purpose, again, the parameters will get updated in every which way, and it won’t be able to generate good answers again. But if you do it for a specific task, the parameters will get updated to reflect that specific task. It will start to be able to solve that specific task.

Similar to CoT distillation, there are other techniques, like contrastive rational training, which is essentially you tell it, “This is the answer that I want, like four into three is equal to $12. That’s what I want. Four into three is equal to $11. That’s not what I want.” So you push it toward what you want and push it away from what you don’t want. So there are a lot of these techniques out there for helping train SLMs to perform or provide reasoning on a specific task.

But when I was building out SLMs to reason on specific tasks, I realized a lot of these techniques were written in notebooks. They were written in scripts. And what I wanted was something similar to what ML practitioners are familiar with today, which is a kind of hyperparameter tuning, where you have all these knobs and you can turn them on and off. You can adjust the parameters, and you can figure out what set or what combination of techniques helps the model learn the pattern the best so that it can generalize in the future.

So I wanted all these techniques under one roof, like CoT distillation, self-consistency, program-aided distillation, contrastive rational training, curriculum scheduling. I wanted all these techniques in one package, which I could control similarly to hyperparameter tuning. And that’s why I developed ReasonLite. Everything was split out in files, and I wanted to bring it into one package. And with this now, hopefully, practitioners can call the package, tune it just like they would with HP tuning, and that sort of, I feel, solves a pain point in current SLM training.

3. ReasonLite integrates program-aided distillation, using external symbolic tools or code to verify intermediate reasoning steps. How does this approach work in a real-world training pipeline? Can you give an example of using a tool, say a calculator or knowledge base, during distillation, and how it improves a student model’s reasoning accuracy without overly complicating the production workflow?

Karun Thankachan: o maybe to explain what program-aided distillation is, maybe taking our previous example of a person going to a store, buying four apples, which are worth $3 each. They pay the shopkeeper $20. How much do they get back in change?

If you give that question to an LLM and ask it to give you an answer with a sort of chain of thought, then what happens is it does the calculation. It says that, hey, four into three is $12, and 20 minus 12 is $8.

So here, the LLM doesn’t actually have a calculator doing the calculation. What it’s doing is it’s looking at four into three, and it’s saying that 12 is probably the most likely answer. But at times, an LLM could generate a chain of thought that says that four into three is $11, just because it’s not actually doing the calculation. It’s just predicting what’s the most probable answer.

Same thing with 20 minus eight. It might not give you $12. It might give you 20 minus eight as $9, because it’s just predicting what’s the most likely number. It’s not doing the actual calculation.

So this is a little bit harmful when you are trying to do chain-of-thought distillation at scale. Just to refresh everyone’s memory, what is chain-of-thought distillation? You ask the LLM to show the steps that got it to the final answer, like four into three is 12, 20 minus 12 is 8. Those steps—show it. So that’s the chain of thought.

Along with the answer, that’s the entire chain of thought. That chain of thought, you feed it to the SLM. And then the SLM tries to generate that same chain of thought using its much smaller parameters. But it’s only being trained on chain of thought for a specific task, so it will be able to capture that limited amount of chain of thought.

Now the problem is, if these chains of thought that we are generating from the LLM itself are incorrect—like four into three is 11, or 20 minus eight is 9—if it’s from the LLM itself and it’s incorrect, then the SLM obviously can’t be expected to learn the correct answer. And you can’t sit and verify all your chains of thought, especially when you’re training at scale.

So how do you make sure that the intermediate, especially math-oriented, steps are correct? You ask the LLM to generate it in a way that is like Python language. The code would be related as four into three, and c is equal to four into three. Answer is equal to 20 minus c. So your answer is 20 minus four into three. It’s eight.

So instead of the LLM actually doing the calculation, it just writes the code with the inputs that you provided. The code is taken and run in something like Python or a calculator. The answers are then attached back to the chain of thought, and then you feed it into the SLM.

So this way, with this kind of external program that’s embedded into your distillation—i.e., program-aided distillation—you can make your chain of thought a little bit more accurate, and you can get your SLM to learn only on the correct answers instead of any incorrect answers.

4. One feature of ReasonLite is a trace-budget controller to constrain the token usage of chain-of-thought traces during training and inference. In a production deployment, why is controlling the length of reasoning traces important for cost and latency?

Karun Thankachan: When you’re actually serving answers to users, you run into actual engineering concerns. One is obviously latency. When a user types in a question, you want to give them an answer in a fairly short amount of time so that the user doesn’t drop off on the site. And you maybe don’t want to provide a very verbose answer unless the user is explicitly asking for it. If they’re asking for something simple, you want to give them an accurate answer, a comprehensive answer, but it doesn’t need to necessarily be verbose.

So during that time, if your model is trained to think in this chain-of-thought manner, where it’s trying to explain breakdown steps and then get the answer—which is generally good practice—but if it’s trained on fairly long chains of thought, that might kick up your latency and increase your cost, because each token costs a little something to produce, even from the SLM.

So you might want to have some kind of guardrails around it, so that the latency doesn’t increase to an unbearable amount, and the cost doesn’t become very cumbersome or essentially very expensive. The compute doesn’t run up, essentially.

So it’s to prevent that that you have these trace budget controllers. And how it sort of works is, you can enforce it in different manners. You can enforce it by saying that, hey, for any particular inference call, you shouldn’t take more than X amount of tokens. If you’re starting to hit your X amount of tokens, cut your chain of thought short with whatever you have and provide an answer. It might not be an accurate answer, but it helps you make sure that your token cost won’t grow beyond a particular point, and your latency also won’t exceed a particular value.

Now, an obvious question that people might have is, hey, if I limit my token usage and if my chain of thought isn’t allowed to grow, won’t I get bad answers? Which is a very reasonable question. So typically, yes. If your model is trained to produce very verbose chains of thought, you will run into that token-limit issue again and again with the token budget controller. You’ll run into the issue again and again, and your model won’t be allowed to express all its thoughts and therefore give a good answer.

So typically what we do is we have eval metrics that track things like whether the model is being useful to the user or not, like a thumbs-up, thumbs-down feature in ChatGPT. So if you get a lot of thumbs-down features, and if you’re seeing that for all those requests your token budget controller was cutting off your chain of thought, then you can understand from your evals and from your logs that your model is actually being too verbose.

We need to retrain the model so that the chain of thought is shorter, it’s less verbose, and it’s able to get to the answer quicker. So that’s why this kind of trace-budget controlling is important in production settings, and how it helps you limit token usage.

5. Techniques like chain-of-thought prompting and self-consistent decoding—generating multiple reasoning paths and aggregating answers—can significantly improve reasoning accuracy. However, they also increase compute cost and latency by running the model multiple times. How do you balance these trade-offs for production systems?

Karun Thankachan: So, maybe I can take a step back. What do we mean by self-consistency? Essentially, like we mentioned, LLMs are not actually doing the specific calculations or specific reasoning. They’re not actually understanding, or they’re not able to derive meaning. They are autoregressive models. So based on whatever they’ve seen, whatever they’re generating right now, they’re going to generate the most likely answer next. So sometimes it may be generating incorrect things.

But if you ask the model to generate an answer to a specific question 10 times, then the majority of the time, it might actually give you the right answer. So what you can do, or what is a decent practice, is: with your LLM, you generate maybe not just one chain of thought. You ask it to generate 20 chains of thought. Then you check the final answer for these chains of thought, and if the final answer across those chains of thought is similar in a majority of cases, that is your final answer.

For instance, let’s go back to our apples example. Four into three, 12. Twenty minus 12, eight. Let’s say 12 chains of thought generate eight, the other five generate nine, and the remaining three generate something like 10. So the majority vote is eight. Now you know that this is probably the right answer. Let’s pick up all these chains of thought and use that to train our SLM.

So self-consistency is a fancy way of saying majority voting. You’re essentially asking the same model to generate the response to a question again and again and again until, in the majority of cases, it starts giving you a specific answer instead of different answers all the time.

Now, if you try to use self-consistency during inference, you’re essentially asking the SLM to answer the same question multiple times—let’s say 10 times. You’re picking the majority answer, and then you’re giving that majority answer to the user. The problem is, if you do it at inference time, instead of answering one question once, you’re answering it 10 times. So the cost becomes 10x. The latency becomes 10x as well during inference.

So typically, you don’t want to use self-consistency at inference time, so that you can control the latency. You only typically use self-consistency during your training loops, so that you can figure out what chains of thought to actually feed into your SLM. So the simple rule of thumb is: self-consistency is better for training time, where latency isn’t the concern—and cost also, to a certain extent, because you’re doing a lot of these things in batch, and you can append other techniques on top. You can actually manage the cost and manage latency. So use it at training time. It’s not something you need to use at inference time.

6. In ReasonLite, you emphasize not just final answers but also intermediate reasoning quality—providing targeted reasoning probes and symbolic verifiers to assess a model’s thought process. In a practical setting, how do you evaluate whether a distilled small model is truly reasoning well versus just guessing the right answer?

Karun Thankachan: Got it. So essentially, when you are trying to train an SLM, and like we discussed thus far, what it generates is based on what it has learned so far—what it is currently generating in its chains of thought, or in its thinking, essentially. So when you are evaluating an SLM, it might not be enough to evaluate it on whether it’s generating the final answer correctly or not.

Again, going back to our apples example: four into three, 12; 20 minus 12, eight. Eight is the final answer. Does that mean you evaluate it only on eight? You could. But let’s say it did four into three, 11; 20 minus 11, eight. It still got to the final answer, but it’s because it’s not really doing the correct things. It’s just, again, making guesses about what’s the most probable answer. And it somehow stumbled on the correct answer.

So your model might actually deviate from the behavior that you desire, but your answer is still correct. We don’t want those kinds of things to spread into production. So we want to not just evaluate the final answer. We want to also make sure that we evaluate the steps in between.

So how do we actually do that? We can check what we call stepwise behavior. There are a few things that you can inject into the model—symbolic verifiers or reasoning probes. These are, I think, two things that we have implemented in ReasonLite.

So maybe to give an example of how these things function: a reasoning probe is trying to figure out if the SLM is able to do a specific substep well. For instance, we have mathematical reasoning probes that help you test if your SLM is learning math very well. Take 17 plus 8 is equal to 25. When you do that addition, there is this behavior called carry behavior, where seven plus eight is 15, so you need to put the five and carry over the one. Then one plus one is two—25. That’s how you do the math. So this carry behavior is something you want to see specifically if your SLM is learning. Within ReasonLite, there are functions that help you test specifically for carry behavior within your SLM. So that’s a way of evaluating if your SLM is behaving well on substeps.

Similarly, step verifiers are another way of evaluating if your SLM is behaving the way you want it to. For instance, the apples example again: four into three is equal to 12. That is a substep. You want to verify if that substep is accurate or not. So you take the output of the substep. The step verifier takes whatever it is doing, runs that code, generates an actual answer, and matches it up. So it’s able to see, at the substep level, if it’s giving you an answer correctly.

So these kinds of reasoning probes and step verifiers are things that you can maybe add on to the final-answer evaluation, and they’ll give you a little bit more information about how your model is actually behaving.

7. Let’s shift into talking about SLM-Fusion and multi-model orchestration. Modern AI deployments often involve multiple models, from small domain-specific SLMs to large general LLMs. But traditional serving frameworks usually assume a single static model, which leads to inefficiencies under dynamic workloads. SLM-Fusion, from what I understand, is an open-source library proposed to bridge this very gap by unifying model merging, routing, and serving in one system. Can you explain the impetus behind SLM-Fusion and also talk to us about how it works in a real scenario?

Karun Thankachan: Got it. So SLM-Fusion—just to give a little bit of context—this paper was written somewhat a while ago, before multi-agents became a little bit more popular and that kind of architecture became a little bit more popular. I would say maybe it’s a little bit outdated at this point.

But SLM-Fusion, the idea essentially is: typically, with LLMs, with their general-purpose reasoning capability, you can, with enough RAG and context engineering, get them to answer questions within multiple domains, even with a single LLM, as long as you have your RAG engine built well. You have a retrieval layer that gets you specific context related to this new domain, and you do context engineering well enough that only the relevant info stays inside the context of the LLM.

But if you’re working with SLMs, since you are fine-tuning your SLMs on a specific task, that fine-tuned SLM is only able to reason very well for that specific task at hand. You can’t just use one SLM and be hopeful that it will be able to pick up or be able to reason in a new domain as well, because you are training it—its parameters are limited. You’re only able to reason for one specific task.

So in this case, what you typically do is train multiple SLMs, and then you figure out a way, based on the questions that are coming in, how to route the question to the appropriate SLM. So that’s essentially the idea behind SLM-Fusion. The key thing is: how do you route it to the correct SLM? How do you evaluate when the routing was inappropriate? And how do you not base the routing on just hard-coded rules, but learn the routing from user behaviors?

Sort of like: hey, it routed to SLM A, the user didn’t particularly like that response, but based on whatever rules we have, that was the SLM to route it to. So now, how do you reconcile the fact that that was the SLM to route it to versus the user behavior? Was it one question and then a follow-up question that switched it to another domain?

So those kinds of things—how do you actually evaluate that, how do you learn it from telemetry, and try and update this routing over time—that was the core of SLM-Fusion. Now with multi-agent architectures, it’s becoming a little bit easier, but if you’re working in SLMs, some kind of routing is good. There have been multiple papers, both within ICLR, ICML, and AAAI, that have come around this routing concept as well. But it’s a lot more updated at this point.

8. In production, when would we prefer merging models over just choosing one? Can you perhaps discuss an example use case where merging two specialized models could yield better results than using a single model alone?

Karun Thankachan: Sure. So again, it comes to how different the reasoning is that these two different SLM models would have to learn.

For instance, let’s say within a retail scenario, you have a reasoning model that—let’s say it’s an anomaly detection model—that sort of needs to decide, looking at sales of an item, why the sales dropped anomalously. So if sales of an item drop anomalously, there could be multiple reasons that drove it. So if I were building an SLM model, if I saw sales dropping, then the next thing I would have the SLM model do is be able to generate multiple hypotheses and then figure out what is the appropriate one to chase down and try to answer.

Within that same context, if, let’s say, I wanted to fix the anomaly, I would want an SLM model where I would give it some context on, “I want you to go and hit this system, change the value to something like this, and hit the system, change the value to something like this.” Here, the SLM model doesn’t need to have that kind of broad thinking in terms of hypothesis generation. It needs a little bit more specific tool understanding, a little bit more integrated with API calls.

So the reasoning between both these models would be very different. One would be a bit more broad—generate hypotheses, figure out which one is the right one, and then tackle it, I mean run those hypotheses, figure out what is an appropriate answer. And this one is a little bit more tool-oriented, a bit more in-depth, a bit more specific. It can’t afford inaccuracies because it’s interacting with the tool.

So the reasoning would be very different. In these cases, rather than trying to build one SLM that could maybe do both, it might be a better idea to separate it out, and it might be a good idea to bring these two into a routed format where you generate the hypothesis, you tell the user that, “Hey, I evaluated X hypotheses. These two seem to be the most likely root reasons.” Then the user sort of tries to understand, okay, maybe I also think that, okay, out of all the ones I’ve evaluated, this is probably the reason. Let’s try and fix this. Let’s adjust these metrics or adjust these settings here, and then you route it to the second SLM.

And that SLM sort of makes all the necessary tool calls, all the necessary adjustments, and it has more in-depth, specific reasoning built into it. So that might be a good scenario for routing.

9. One of the core features of SLM-Fusion is an adaptive routing layer that can be rule-based, learned, domain-specific, or cost-aware in deciding which model or ensemble handles a request. How do these routing policies work under the hood? For instance, what would a cost-aware router consider—latency SLA, API throughput costs, query complexity, etc.?

Karun Thankachan: Sure thing. So within the router, we have a few ways you can decide what SLM to route it to. The simplest way, and the easiest way to get started, is just rule-based routing. You see certain domain keywords, and you can route it to a specific domain SLM. The slightly more advanced manner is getting an embedding out of the user query and figuring out which sort of base embedding it matches the most. So each SLM would have a domain-encapsulated embedding associated with it. So it’s everything related to that domain in an embedding. So if the user query matches this domain-specific embedding, route it to this SLM.

Now, the advantage with this sort of embedding-based matching is that, if the user asks a specific question that is maybe multi-domain, and you routed it to the wrong SLM, or it might be the case that you need to split that question—route the first portion of the question to this SLM, get a response, route the second portion of the question to the next SLM, get a response.

So instead of this embedding being static, what SLM-Fusion does is provide you the opportunity to adjust those embeddings based on how well you have done on the questions users have asked in the past. So using your logs, you can pull in your logs. The ones that you didn’t do well on, those ones you can narrow down on. You can figure out how to update your embeddings for those specific ones that you didn’t do well on.

And for a particular question, if you feel like it’s a multi-domain question, within the router itself you have a tinier SLM that can split multi-domain questions into separate questions. So with these sorts of knobs, you are not just hard-coding how to route it, but you are able to learn over time how the routing should evolve. And you are also able to address multiple-domain questions by using the routing module to split them into different questions and orchestrate it in a manner where you can still use your SLMs, and you don’t need to try to condense everything or try to get SLMs to interact across domains. So that would be the core way to use this sort of routing more.

10. Let’s talk a little bit about telemetry-driven feedback loops. What signals, according to you, are most valuable for such a loop in a production setting? And how do you feed this feedback back into the system?

Karun Thankachan: Got it. So it really depends, but the most critical one, I would say, is how the user is responding to the queries. So just like ChatGPT’s thumbs-up, thumbs-down—some kind of user satisfaction score. That would be the best way to assess any sort of generative system, because the responses being generated are evaluated by the user. And if it’s not a helpful one, there’s no point in any of these generative systems.

So being able to track user satisfaction scores and attach them to your logs—your chain of thought, your final answer, your user satisfaction scores—that sort of logging system is what we call telemetry. And once your logs are all stored and generated, being able to search through your logs and figure out which ones you didn’t do well on, and having enough logging to figure out which SLM it routed to, why it routed to that SLM, why it tried to split the question into separate portions—having all of those logs in one place is what is going to help you build that feedback loop and improve your routing over time.

Apart from user satisfaction, you could also use things like token usage. Is the compute cost actually building up? Is maybe a question that was designed for one domain being unnecessarily split into multiple questions and maybe just sent to the same SLM again and again? I’ve seen that happen also. So checking if your token cost for any of the responses you are giving is spiking. Similarly, if your latency is spiking.

So these three, I think, would be the top metrics to attach to your telemetry, or have tracked along with your telemetry, with timestamps and request IDs, so that you can map it properly. And then you can improve your routing layer over time.

11. Thank you for that. So now, quantization is a common way to reduce inference cost, but mixing models of different precision—or even merging quantized weights—can be tricky. So what did you build in SLM-Fusion, again, to use it as a case study, to handle quantized models effectively?

Karun Thankachan: Got it. So, I guess, just to explain why quantization is tricky: quantization is nothing but using different integer formats. So with quantization, what you’re doing is you can represent things as 32 bytes, 16 bytes, 8 bytes, or 4 bytes. The lower you go, the smaller your models become, the faster the multiplications become, and therefore the faster your SLMs become. So you can make your models smaller and faster the more quantized they become. But again, as you make them more quantized, you lose a little bit of information, so they won’t be as accurate.

So how it helps you use different SLMs that might be in different quantization modes—it gets a little bit tricky here—but we have these things called tensors. When these calculations are taking place, we do them in these large-scale 3D matrices called tensors. And how the calculation within an SLM works is, you sort of align these tensors, or align these channels, pad them as necessary to get them to the same quantized integer formats, and then sort of carry forward the calculation.

So, a little bit more on the math side, but the key thing is aligning the tensors so that you’re not assuming that all the models are at the same level of quantization. You try to identify whatever quantization it is at, then sort of work through packages that we already have. It’s not something new that SLM-Fusion is providing, but most of the popular deep learning packages already provide this. But aligning per tensor, per channel, so that the calculations actually flow through.

And in terms of, apart from just the quantization, building adapters into your models is another way to perhaps mitigate this. Adapters are still, I would say, a little bit unproven in terms of the value they add for the number of parameters they introduce. But in some very few scenarios, where the domains are similar enough, but you need a slight change in parameters so that it adapts—not to a completely new domain, but maybe a complementary domain—in those cases, I think adapters work. But for quantized models, if they’re in different quantized states, having adapters can help you maybe bridge that gap as well.

So, a little bit on the math and technical side: alignment of tensors. A little bit less mathy, but more on the modular side: adapters to help bridge the gap. So those are the two things that I think SLM-Fusion had that help you work with different quantized SLMs.

12. All right. Now, SLM-Fusion also introduced a FastAPI-based Fusion Gateway that is even OpenAI-compatible for inference requests. So how do you see a system like this being deployed in a production microservice architecture? Could it sit alongside existing serving frameworks, perhaps?

Karun Thankachan: Yep. Yeah, definitely. So the FastAPI backend is essentially there to support that same thing. The idea being that, within microservice architectures—again, maybe taking a step back—the core idea is that anything that has to do with one specific function is split out, modularized, and kept separate. So your reasoning engine, if it is like this multi-SLM model, you can keep it separate from everything else. You can update it as required without impacting any of the other microservices in that environment.

And with the FastAPI backend, the key idea is that you can hit it just like you would any other kind of service that you can abstract away. So what we typically call, I guess, reasoning as a service—RaaS, if you want to call it a new domain. So whenever you need a little bit of, “Hey, I think I need a little bit of human reasoning at this particular stage to make a decision on what to do next,” then just hit the API endpoint like you would in any kind of microservice architecture.

It abstracts away all the reasoning. It will do the routing within, it will pick the SLM, it’ll generate an answer, and it will send you back a specific API that follows the contract. And that API isn’t just something that’s generated by the SLM—it’s filled in so that the contract is always maintained between whatever service is calling the reasoning-as-a-service microservice.

So yeah, that way, you can just abstract the whole thing away, and you can put it in any kind of production environment, with the typical guardrails that you have—like trace-budget controllers, latency holders, and everything. It will actually stick to the SLAs that you typically expect in a multiple-microservice architecture system.

13. Now, quantization is a common way to reduce inference cost, but mixing models of different precision—or even merging quantized weights—can be tricky. So what did you build in SLM-Fusion, again, to use it as a case study, to handle quantized models effectively?

14. SLM-Fusion also introduced a FastAPI-based Fusion Gateway that is even OpenAI-compatible for inference requests. So how do you see a system like this being deployed in a production microservice architecture? Could it sit alongside existing serving frameworks, perhaps?

15. Finally, Karun: any emerging trends, perhaps in governance or tool integration, that you believe will significantly impact how we deploy language models in production?

Karun Thankachan: I think, right now, diffusion models are becoming a little bit more commonplace, and that might be a trend worth checking out. Apart from that, I guess the main thing to focus on is that, within LLMs, maybe six months ago, there was a split between: is investing in parameter-efficient fine-tuning—LoRA, QLoRA—along with alignment techniques like DPO and PPO, a good investment of time versus just focusing on RAG and prompt engineering? It looks like the industry is shifting a lot toward RAG and context engineering. One, because maybe it’s cheaper. And for the other things, you need specific hardware, and you need to hire people who know how to do it. But it also seems like you can actually get fairly accurate answers and fairly good reasoning from your LLM models if you actually set up a good RAG pipeline, and if you bolster it with good retrieval—a way to improve or rerank the retrieved documents and again select the best ones on top of it. So don’t just have a simple RAG pipeline. Fit a model on top, maybe improve the accuracy of your retrieved documents with the reranking model, and also focus a lot more on context engineering. So don’t bloat your context with a lot of information. Look into context compression. Look into eliminating things from your context if they are irrelevant. Just having irrelevant things increases hallucinations. So a lot of investment in good engineering, I would say, combined with good retrieval, seems to be giving a lot more accurate answers, a lot less hallucination, and a lot better reasoning as well. So that seems to be where the industry is focused right now. It would be interesting to see if it switches back to fine-tuning, or if it switches back depending on how this diffusion trend plays out and how the cost-versus-LLM trend plays out. I think those are some trends to keep an eye on to see where we need to switch next.

Small Language Models and the Future of Production AI with Karun Thankachan

When general-purpose LLMs are overkill, small language models trained on specific tasks may be the smarter bet

Ready for more?