Maxime Labonne: Edge AI and the Future of Localized Intelligence with Private, offline LLMs

Design ThinkingUser ExperienceProducts

Aug 15

Written By Founder, Alp Uguray

**Today’s guests** — Maxime Labonne
Maxime Labonne is an AI researcher, educator, and Head of Post Training of Liquid AI, where he focuses on advancing the field of local, on-device artificial intelligence.

Amazon Music (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy)

Apple Podcasts (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy)

Deezer (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy)

Masters of Automation (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy) (Copy)

The following is a conversation between Alp Uguray and Maxime Labonne.

Summary

In this episode of the Masters of Automation podcast, host Alp Uguray interviews Maxime Labonne, discussing the challenges and innovations in running large language models (LLMs) on edge devices. They explore the importance of post-training techniques for enhancing small models, the future of local AI models, and the integration of AI into everyday applications. The conversation also touches on the role of context in AI performance, architectural considerations, and the dual paths of AI development. Maxim shares his journey from cybersecurity to AI, the use of AI in spam detection, and the potential of agent-to-agent communication. The episode concludes with insights on the future of AI in gaming and the importance of community in AI development.

Maxime Labonne is an AI researcher, educator, and Head of Post Training of Liquid AI, where he focuses on advancing the field of local, on-device artificial intelligence. With a background spanning cybersecurity, optimization, and practical machine learning, Maxime is recognized for his innovative work in building efficient, specialized AI models for real-world applications. He is also passionate about education, having developed hands-on PyTorch courses to empower the next generation of AI practitioners. Through both his technical and educational efforts, Maxime is committed to making cutting-edge AI more accessible, customizable, and impactful across industries.

Takeaways

Running LLMs on edge devices presents challenges like latency and model quality.
Post-training techniques are crucial for enhancing small models' performance.
Local AI models can provide privacy and customization for users.
Agentic workflows can enhance AI's functionality in applications.
Context windows are vital for AI reasoning and performance.
Model architecture significantly impacts AI capabilities and efficiency.
There are two paths in AI development: AGI and interpretable models.
Maxime transitioned from cybersecurity to AI due to the open community.
AI can be effectively used in cybersecurity for spam detection.
Agent-to-agent communication in AI is still in its infancy.

“My north star is simple: Make the model run on your phone. It’s cleaner for privacy, greener for the planet, and keeps AI from being locked away in somebody else’s data‑center.”

— Maxime Labonne

“There’s always a gap between the marketing and the benchmarks—reality is usually much uglier than the claims about giant context windows.”

— Maxime Labonne

“It will always outperform dumping a million tokens of noise into the context window. Despite the hot‑takes, RAG is still not dead—people keep using it because relevance wins. ”

— Maxime Labonne

“The other end game is the exact opposite: models running locally with high interpretability. Owning the model is powerful.”

— Maxime Labonne

Chapters

00:00 Introduction to Liquid AI and LLMs

02:52 Challenges of Running LLMs on Edge Devices

05:04 Post-Training and Model Specialization

07:59 The Future of Local AI Models

10:42 Context Windows and Model Limitations

13:20 The Role of Architecture in AI Models

16:05 Cybersecurity and AI Integration

18:49 The Dual Paths of AI Development

25:19 AI Agents and Communication Protocols

26:31 The Future of Agent-to-Agent Communication

28:09 Enhancing Device Communication

29:34 Specialized Models for Specific Tasks

31:45 AI in Creative Industries

35:13 Challenges in Game Development

37:50 Getting Started in AI and LLMs

43:39 The Influence of Community in AI

46:46 The Future of AI Efficiency

Transcript

Alp Uguray (00:01.486)

Hey everyone, welcome to Masters of Automation podcast. Today I have the pleasure of hosting Maxime Labonne. Maxime, welcome. Great to have you here.

Maxime Labonne (00:12.363)

Thank you. Thanks a lot. Hi, everyone. And thanks for the invitation.

Alp Uguray (00:16.046)

Yeah, of course. It's going to be a fun conversation. To get started on large language models and post-training, one question that I've been thinking hard on is running LLM zone devices. So can you tell us a bit about how does liquid AI work? And also, what is the most important thing

an aspect of pulse training that makes it so much more powerful and different.

Maxime Labonne (00:54.529)

Yeah, it's good question. So running LLMs on edge devices has always been something that people try to do. I don't know if you remember all these videos about Raspberry Pi being able to run powerful language models. But of course, in practice, when you try to really do it, you are confronted with a few challenges. The first one is inference speed. Of course, this

Edge devices are not very powerful, so they will be very slow. Related to that, latency is also a big problem. If when you enter your prompt, you have to wait for like 10 seconds to get the first token, worst users will not be very happy with the user experience in general. And finally, another problem that is very recurrent is the quality of the models that is very low. So in this space, I think the architecture

that we've designed at Liquid for LFM2 addresses the inference problems quite well, because it's really optimized for these use cases. And finally, post-training is mostly here to recover as much as possible and improve the quality as much as possible to squeeze all the knowledge and all the reasoning capabilities that we can in these very, very small models, because they're just like 350 million parameters.

Alp Uguray (02:05.006)

Mm-hmm.

Maxime Labonne (02:22.743)

700 parameters and 1.2 billion parameters. those are very, very small models that can run pretty much everywhere, microwave included.

Alp Uguray (02:35.48)

So do you envision a feature where even our microwave will be able to run the LLMs and we'll be able to talk to them?

Maxime Labonne (02:45.143)

Yeah, hopefully not microwaves. But I do think that there's a future, and I see it mostly, for example, on the phone, where instead of running GPT-4.0 and VABAR for language models to do very simple tasks, we can have offline, local, private, and customizable models running directly on the phone. And you can really touch their weights.

change them if you want. And that I think is really, really exciting. And that's the kind of future that, I would like to see happening in real life.

Alp Uguray (03:25.452)

And in, the perspective of the model training and preparing them to, be able to, to run on a machine and then on that state. like what, what are some things like, is, cause I know this has been discussed a lot, but why, is post training and, red teaming and looking in mold is so important.

Maxime Labonne (03:51.681)

think that the main thing is that usually these small models are an afterthought. For example, you have Quen, and they really is a family of models. But you already know that the favorite kid is actually the biggest model. And then you have smaller versions of the big model, basically. And we took the opposite approach.

And we try to treat these small models as a first-class citizen. And that's why we're also able to squeeze so much performance out of them. So to me, this is the right approach. You need to really craft all your post-training data and efforts to really target these specific small models, because they're not that smart. You cannot trust them.

not to hallucinate, you cannot trust them to have a lot of pre-existing knowledge. So you need to adapt your techniques and I can go into more details with a concrete example if you want. But yeah, basically you need to really do this adaptation strategy to make sure that you're not training them the way that you would train a frontier model because that wouldn't work.

Alp Uguray (05:10.21)

What would be one, one example like to, that will make them more focused and, for that specific task.

Maxime Labonne (05:19.767)

So an example that we made is LFM 1B math. And this is a model that is very small. It's a 1B model. And we trained it specifically for math reasoning. So it's quite funny because the model is great at math, but it's bad at everything else, right? It's only focused on math, and that's it. And the challenge was, OK, how good can it be with 1B parameters?

And the second challenge was, you know, these models, these reasoning models, they tend to be extremely verbose. And especially for math, they tend to take like 32K tokens to answer a single question. And we thought, well, that's impossible. Really, strictly speaking, if you deploy it on an edge device, 32K tokens is like very, long to generate. So let's try to compress this into reasoning trace that would be

max 4K tokens, right? And so to do that, we did a very large scale supervised fine tuning around, and we really adapted, selected the prompts from open source data sets to make sure that this was a very good blend for these models that could really leverage its existing knowledge and add some more. So in the end, we trained them all on 100 billion tokens. So it's like mid-training.

in terms of volume. This is really, really big for supervised fine-tuning. And then we had this reinforcement learning stage with GRPO. And here, we really looked at the success rate of these prompts. If we ask our supervised fine-tune model to solve a problem, what's the accuracy if we ask it to generate 10 answers? Is it like?

10%, is it 90 %? So then we tried to really select the samples where the success rate was not too high and not too low either, because those are the samples that you can really try to learn better and do a better job at answering. And this allowed us to compress the reasoning tokens from 32k to 4k tokens while minimizing the performance degradation.

Alp Uguray (07:46.254)

So in a it's like getting, specializing the model to say more by saying less and, capturing the value and then what it's trying to say in a few words, a few tokens.

Maxime Labonne (07:59.691)

Yeah, they tend to rent quite a lot with reasoning. If you look at the reasoning traces of some models, you can see them double checking everything or doing calculations, but very manually and showing every possible combinations. And that's exactly what you try to remove and adapt in your model so the traces are lot more concise and they're also a lot more focused, which helps the model in the end producing better answers.

Alp Uguray (08:30.35)

It is very interesting. then in just so that like I'm, I'm imagining like where we will have devices and then the models fully hosted on the device where I don't need to make a call to the cloud or somewhere else. So I could ask my questions to what extent that reaches to a diminishing returns when it comes to intelligence. are there certain questions that I ask unexpected of the model?

that it runs on a certain machine, where the expected output is not as, as I expected as a user, Like I wanted to make a phone call, for example, or I wanted to analyze all my docs in the Google drive and give me the, give me one analysis. Like what is the spread today that things that work or like things that you see will work maybe two to three years down the line that in a way also

makes it a competing case with the frontier models as well, where they throw in compute and running everything on cloud.

Maxime Labonne (09:40.503)

Yeah, this is a very good point. I think that right now, the situation, if you try to use a local model on your phone, you're going to see a lot of chatbots. And chatbots are nice. Like, for example, if you're on a plane and, yeah, you're offline, so this is like the best solution that you can have at this moment. But I don't think that chatbots are enough. We can be a lot more creative in what we do.

Something that small models can do very well, for example, is translation. If you take a model and fine tune it on translation task, it can get really, really good at this, very competitive with even frontier models. And that would be a way to do it. And I think there are a lot of different tasks that you could try to do with this. And in your question, you talked about agent-ic AI, if I understood correctly.

And I think that for this specifically, you need to have the framework and the tooling to do it properly, which exists currently for bigger models, for cloud models. You have all these length chain and other popular frameworks that allow you to create these agentic workflows. But currently with edge models, don't think that there is,

good solution that are super focused on this. So at Liquid, we try to create our own solution to deploy Edge models easily on the phone. It's called Leap, and we have an Android SDK, and we have also a iOS SDK. And the goal is to provide all the tools to app developers so they don't have to focus so much on machine learning specific stuff.

And they can focus on what they want to build on the app itself, because this is what provides value. So I think this is currently something that is working in progress, basically. But at the end of the year, I can definitely see very, very small, agentic models that are super specialized to do these function calls and be able to really crawl your Google Docs.

Maxime Labonne (12:01.503)

and connect it with your Google Calendar, for example. That is something that I think a lot of people would be interested in.

Alp Uguray (12:10.082)

And one thing that I see is that like it takes some configuration work from like to create an agent to, to, tap into designing those agentic workflows and then maintenance observability and then designing all that. And in one rhetoric is that when the models reach maybe unlimited context window or like trillions of context window at some point.

And it could do a lot of complex reasoning. like, is that a, actually a realistic thing? Like at that point, we would still have to have this agentic scaffolding and then making sure things work and like tie it together. Like how much of it is a dream that is getting sold versus the reality actually building them.

Maxime Labonne (13:03.135)

Yeah, this is very connected to the question of, rag is dead, which is something that we hear about every two months, it feels like. I think, yeah, context windows are great. And I really enjoy a model like Gemini 2.5 Pro, for example, that is really, really good with this huge context window. But it has some limits, And we know that having a long context window doesn't mean that it's very effective.

Alp Uguray (13:09.134)

Yeah, it comes up a lot.

Maxime Labonne (13:32.289)

There are benchmarks related to this, such as longbenchv2 or ruler. And you see that often there is a claim and there's the reality. And the reality is often quite ugly compared to the claims. And in general, this is a problem with having a lot of information in your context window is one thing. But being able to reason over the entirety of the context window requires like,

very specific training and quite powerful reasoning capabilities. And we don't have that right now. Gemini 2.5 Pro is probably the model that is the best performing in this category. But if you look at even Cloud 4, the models are not very good. They can only handle, I believe, 200k or something like this. So it is quite limited. They're very good with this context window.

But what I want to say is that it's probably not enough to just rely on this. And it means that we probably still need rag or some kind of retrieval and then injection in the context to also help the model and maybe do a bit of the heavy lifting for them so we pre-process the data. And then they can more easily process it to answer the question.

Alp Uguray (14:52.77)

And it's like giving them a point in time in the large data set to focus on and then double down on reasoning to focus on that parts of the data. And right. to, to, that end it's.

Maxime Labonne (15:08.341)

Yeah, it will always outperform just dumping all the data in the context window. We know that. We have quite some experience with it, despite all the claims that rag is dead, rag is still not dead, people still use it. And the answer, the reason why is very simple to understand. It's because if you give all the relevant information to the model instead of the opposite and providing a lot of irrelevant stuff,

It helps the model, so I don't think that will change in the coming year, at least.

Alp Uguray (15:43.97)

And in terms of the context window, what would really influence getting a model to go from having 200K to 2 million tokens of context window? Is that just model having more parameters and more compute power behind it? Like what is the, what is the parameter that that will take the next leap of faith?

Maxime Labonne (16:07.639)

It's a very interesting topic. So there are different ways of approaching it. For example, you mentioned the number of parameters. It does play a role. Like it's easier to have long, least effective long context windows with bigger models. And it's a lot more challenging with smaller models. So there's an effect here. In terms of compute, it's true. You also need a lot of compute.

But I think this is also related to the architecture that you use and the operators that you have inside of this architecture. For example, the vanilla attention mechanism, the way that it was introduced in the original Transformer paper, is terrible in terms of scaling. So yeah, if you want to have 1 million context windows with the vanilla attention mechanism, it's going to cost you like

really a ton of money and a ton of compute, so much that it's too expensive. It doesn't make any sense. Since then, we patched the attention mechanism quite a few times. And now we're even improving on the architecture of the models. So processing long context windows is not as expensive as it was. For example, with LFM2, what we did is that we introduced specific convolution layers.

inside of the architecture. So all three models, LFM2 models, have 10 convolution layers and only six attention layers. And that really, really helps us in terms of scaling because while a traditional transformer will really use more and more VRAM with longer context, the LFM2 architecture allows you to

it will still increase, but just not as fast. And this is very, very noticeable. And we did it because we wanted to use it on edge devices where you don't have all the VRAM in the world. You're very constrained. So that's a good example of how to use it. So I would say that the number of parameters is very important. The compute is very important. But more importantly, the architecture is very important. And having some operators

Maxime Labonne (18:28.183)

like recurrence or like convolution or some local retention is very, very important if you want to really scale your context window.

Alp Uguray (18:38.7)

And it makes it more specialized and the device that it runs on. then the use case, like the data that it sees as part of the context window, then that makes a lot of sense. Like from the, from the perspective of the, like the future UC, right? Like, yeah, like especially that is one topic that I'm thinking about deeply, right? For the some, like some LLM.

providers are tying themselves to social media, right? Like for example, XAI is bit integrated to Twitter. Now they're integrated with Kalshi with prediction markets as well, like feeding that data in to make them all smarter. OpenAI has this hardware angle now coming in as well, like with Johnny Ive and working together. What's the end game that you see?

that all the models, we want them to be smarter and then help us to be better. But like in terms of usability or how we consume them changes a lot from user interface to just the hardware or computer human design interface comes with like based on your research and then the work that you do, like how do you see that evolve?

Maxime Labonne (20:07.371)

think there two different end games, right? There's one that is AGI, and that is the most popular and mainstream one. This is the one that OpenAI is going after. And I believe that the hardware will just be cloud-based, and it will still require an API. And then the model behind the API, they don't want you to know what's there. They will do some routing and just trust us.

No worries, we will give you the best answer, but you don't have to know what's behind it. And the other end game is the exact opposite. It's having models running locally and with very high interpretability because you know exactly what's running there. So obviously, this is more like my approach. And I think this is overall more connected to the open source community, being able to have

ownership of the model is something that is very powerful. And I don't think that these two end games and approaches are necessarily incompatible. On the contrary, I see them as very complementary because you will not be able to run AGI on a phone, at least not in the near term. So you will still need to have some cloud-based API. And on the other hand, there are a lot of applications that you just cannot have with a cloud model.

For example, if you have a car, you cannot assume that you have online connectivity at all times. So this is one of these examples with Edge AI is extremely important. It's required. There's no other solution. So I think that we will see the development of these two approaches going forward. And it doesn't mean that one is better than the other. I think they're just complementary.

Alp Uguray (22:01.838)

And that's that'd be a huge, I imagine just being hiking and then, and then falling and having an injury and just trying to ask, Hey, how can I get better? And says no connectivity retry again. And I think that that would be the point of, of failure.

Maxime Labonne (22:20.823)

That should be a great marketing material for AGI. I can see the video.

Alp Uguray (22:25.898)

Yeah, I can see that. We could generate it now as well with the products out there. In terms of your background, what drove you to this field? While in your research and while you were in school to your previous research experiences?

Maxime Labonne (22:50.229)

Yeah, so I come from cybersecurity. This is what I studied at uni. And then I decided that I was willing to explore something a outside of my comfort zone. And AI looked very exciting. So that was in 2017 when I started my PhD. I did it in applied AI.

or more like machine learning applied to cybersecurity. And that was a great way for me to combine my experience with cybersecurity and also discover AI step by step, starting with really basic neural networks and then going forward implementing more complex solutions. 2017 was also the year that the Transformer Paper was released. So that was an interesting time in NLP.

At the beginning, I really did not make the connection between my own research that was more focused on network-based intrusion detection, so finding attacks in a computer network with what happened with NLP. But near the end of my PG, it became very obvious that it was actually the same problem and that the protocols I was looking at were just another form of communication. And you could totally

represent them and learn them with a transformer architecture. So this is what drove me to abandon cyber security and really embrace AI to do it first with computer networks a bit more. That was my work at Airbus and then outside of it, really like full blown LLM and NLP. And this is why I wanted to leave Airbus and I joined JP Morgan afterwards.

to really focus on LLMs because that was my favorite thing at the time. And just after that, Chagy Pt was released and suddenly everybody was very excited about LLMs.

Alp Uguray (24:57.934)

And then, do you see like some, cybersecurity is really interesting, I think, and also like you touched on the problems that both touch are very similar, like at times as well, like the, for example, do you see LLMs also being an,

a pathway and an interface for a new infrastructure that maintains the cybersecurity all around us, just because there's reasoning and compute drones getting

Maxime Labonne (25:33.301)

It's super interesting idea. Actually. Yeah. And in some way I see exactly that. I would even say that it's not just about cyber security. It's about computer networks. Something I worked on at Airbus was this BERT model that was especially fine-tuned. would say now back then it was called transfer learning. Now we would just call it training, you know.

I just trained a BERT model from my checkpoint to do like a protocol, network protocol understanding. And having a model that can do network protocol understanding allows you to build a lot of applications on top of it. So you can do routing based on your understanding of what happens in the network. You can try to detect

attacks, you can try to classify the network flows, you can do a lot of different applications. It's really like a neural way of doing computer networks. Back then, I think it was not fast enough, basically, to do it really well. But I was really excited about it. I'm not too sure, actually, if this is now something that people do.

But yeah, that was the first but model fully dedicated to network protocol understanding. And that was a really great time.

Alp Uguray (27:09.41)

Yeah, because as the infrastructure gets better, I could see that even the technology relying on, like the spams are going to go up, increase over time. And there will be also bad players as well as good players using the models to like cloning someone's voice and then calling their grandmother to get their money. I guess basic as that to...

to large scale of that's where.

Maxime Labonne (27:41.663)

a super interesting project that JP Morgan talking about spam. So I read the paper there called spam T5. And it was a T5 model that was specifically fine tuned to detect spam emails. And the idea behind it is that JP Morgan has a lot of super wealthy clients, including Bill Gates, I think. And these super wealthy clients, get a lot of attacks all the time. And this is really bad.

they could see a lot of these attacks. And yeah, some of them, they're super difficult to detect. Even as a human, when you read it, you're like, OK, everything sounds good. But actually, no, it's just one thing that is changing. And that is really the attack. So yeah, that was a super fun project to do. I hope they still use it. But yeah, this T5 fine-tune for spam detection was great.

Alp Uguray (28:36.142)

you

Alp Uguray (28:40.428)

Yeah, I can imagine. It's like just everyone trying new ways to get attention or get them to click once. then it's in a way like AI agents speaking to AI agents, like you have an AI agent spamming and then another AI agent trying to detect if it's an AI. And then like, there's like this phenomenon of A to A connection. then like tying to that actually to the A to A, like Google released that framework A to A.

to enable that agentic communication between two AI agents. And of course, OpenAI has their own agents SDK and then other platforms have their own SDKs to communicate. But at the end of the day, what does it really matter in that communication protocol? Because everyone has their own framework.

But at the end of the day, what is the unique niche there that will actually make it special? To that end, there's a project at MIT called Nonda on the project. They tried to build decentralized network AI agents. So you could bring your own agent and then connect it to another agent.

So from that perspective, where is it going?

Maxime Labonne (30:12.023)

I really don't know. Agent-to-agent communication hasn't been a big focus to me. And every time I tried it, it was kind of underwhelming, to be honest. I'm sorry, people are hyping it up on LinkedIn and Twitter. But honestly, I find it quite underwhelming. I'm not saying that it doesn't have legs. I think that we're just not there yet. And the main problem might not even be

Alp Uguray (30:18.542)

Mm-hmm.

Maxime Labonne (30:40.267)

be the models themselves, although they're not the best in terms of function calling in general. But the tooling and the engineering that goes behind it is super important for this kind of stuff. And we're just very early, right? So I wouldn't necessarily discard it. I would just say, I'm waiting. I'm waiting to be surprised. I'm waiting for a killer application of it.

But right now I don't see use cases that really benefit from it.

Alp Uguray (31:15.724)

And I told the same thing. It's just that it's, think, just an object that speaks to another object that is pretty not as complicated protocols going on in there. From the perspective of now to VQA.ai, once the model works good on the device, and then it could be any device,

What is next? Like do you think more the models in different devices will speak to each other as a different protocol or like the microwave example, like my MacBook talking to my microwave to cook the food. So how is that models hosted on devices will...

then scale up that internal things around us and then make it work actually this time.

Maxime Labonne (32:20.277)

Yeah, that's super interesting. Actually, I haven't really thought about that, but I'm writing it down. I think that the microwave talking to the MacBook is a great idea. I would like to see it happening in real life. Yeah, it's a good point. You could have some agent-region communication with a kind of IoT framework, but I'm not sure I see the point in doing this, Unless you have super specific niche cases and yes,

Alp Uguray (32:27.938)

Ha ha ha ha ha ha ha ha ha ha ha ha

Maxime Labonne (32:49.579)

The MacBook will tell the microwave to maybe do something, but honestly, my microwave is very dumb. It's not the smart microwave, so I'm not sure this would help me in real life. What we want to focus on first is really enhancing the capabilities of what these Edge devices can do. So we're interested in text, but also in other modalities.

We're interested in vision. We're interested in audio. We're interested in scaling it up too so it can run on laptops with very good quality. Yeah, there's a lot of things that can be done. And something else that we are exploring is providing models that are already fine-tuned for specific applications. For example, you want to do function calling. If you just want to do function calling,

it's better to use a model that is just trained on this, right? It doesn't need to have check capabilities. It doesn't need to be able to translate languages. It just needs to be very, very good at following your instructions and doing function calling. And we think that providing these checkpoints that are very specialized in one task is another way to help developers making LLM powered applications that are truly useful.

because they reach a certain level of quality that you really need before relying too much on these models.

Alp Uguray (34:26.382)

And you will know there like specific use case, why to, why to leverage it. and capture it does, does it then do I have to have more of an agentic like scaffolding agentic framework where maybe one model speaks to another model to go and do the function coding and brings back the results. Like how, how to think about that full framework.

Maxime Labonne (34:53.557)

Yeah, you could definitely have this kind of agenting framework where you have like a centralized model that will just like route the queries to like other models that are more specialized. A model for data extraction, a for function calling, a model for rag, for example. My problem with that is in practice, the beta lesson tells us that just having one bigger model is probably going to up the form this super complex framework. So usually.

it's more recommended to have one big model. But if you have one application that is very narrow and all you want to do is just write, for example, a tiny rag model is probably what you need because it will probably outperform the bigger model for this precise task. It's just not going to outperform the bigger model on all the other tasks.

Alp Uguray (35:47.276)

Yes, it's like there's a trade-off to win and lose by leveraging each of them. So when it comes to different industries right now, and this is like beyond your work at Liquid or JP, we talked about before about gaming as being one area as well. Where do you see the AI influence other than enterprise stuff that is going on?

like influenced the world.

Maxime Labonne (36:21.375)

Yeah. So the basic examples are like having our phone, consumer electronics in general, everything can like embed AI. can be a PlayStation, it can be an iPhone, it can be really everything. But yeah, as you mentioned, there's also like more creative industries. And for example, having models

For game design, it's something I'm really excited about. Due to my past making video games, I can see the potential of it and how you could leverage the models, not in a chatbot way, but truly as a game master way. So the model would not just write dialogues for NPCs. I'm sorry, but I do not want to read these dialogues. If it's translated by AI,

Thank you very much. I don't need to play a game to read that. What I would be interested in is having the model really tuning the internal logic of the game to create challenges or even to review what the player does. For example, if the player inputs some kind of description of what they want to do, for example, they say, I'm going to use my sword and I'm going to...

slash the dragon, blah, blah, blah, blah, blah. The model can then review this answer and assign it a score and say, OK, you have a good answer. This is what's going to happen. And this is what you have if you play tabletop RPG games with your Game Master. You have these kind of exchanges where you kind of negotiate, like, can I do that? And the Game Master will see if it's logical, if it's consistent, and then approve or disapprove. And this is what the models.

can allow you to do and use that way. think, yeah, it's a lot more creative and it brings a lot more value than having just another chat bot or having synthetic dialogues that nobody wants to read.

Alp Uguray (38:27.074)

Yes, like the game is evolving with you and based on how you behave in the game, giving you like new challenges. And then even if you have a sword and you're trying to slay a dragon, next thing dragon has a shield or like next and it's not part of the game. And it suddenly adapts to it to create more difficulty.

Maxime Labonne (38:49.045)

No, absolutely. No, I think this is good example. Indeed, the game can evolve. And I think as a user, as a player, you don't even need to know that it's LLM powered at all. It's just the game. And you just play the game, just having fun. It doesn't have to be tech first. It has to be gameplay first. The game has to be fun first before the tech is interesting. Otherwise, I probably don't want to play it.

Alp Uguray (39:16.236)

Yeah, otherwise nobody uses it in the first place. The game is an interesting one because I feel like there's a storyline that the directors write and then the AI then makes it more personalized based on how you are and then how you play, the experience you want to get. then everyone would have a different piece of the game for themselves.

are unique to them and that will make it very special, of course. What's the limiting factor there? Like today, that's... Is that more on the 3D graphical design generation or is it more on just having such a strong model, high reasoning model to actually compute those variations that could happen if you slay the dragon or not slay the dragon?

What are some things there that are limiting today the industry?

Maxime Labonne (40:19.639)

I think there are three elements, those are the ones that I listed previously pretty much about like edge AI. It's really about the quality of the model. you say, you need to have a certain level of quality for it to be useful. Otherwise, if it's unreliable or if it's like even 10%, 5 % unreliable, it's probably not a very good experience as the player. Then you have

inference performance in terms of throughput, in terms of latency. And it's something that is reactive and it's something that is fast enough to be able to have this real time interaction with the game and with the model. And finally, there's tooling, tooling engineering. That takes time. That requires a lot of iterations converge to an optimal solution. I think we're getting there, but we're still very early. There's a lot of tooling in terms of how to run the model on

a lot of different platforms and ensures that it's just going to work right. When you install a game on Steam, you don't have to compile Lama CPP to make it work. It just works for you, right? So you need to be able to deliver the same kind of experience to the players. And then there's tooling for the developers themselves, because video game developers in general are not machine learning people, and they cannot

Alp Uguray (41:30.158)

you

Maxime Labonne (41:46.645)

Like the complexity of developing a game to me is already like much higher than AI. They cannot like do machine learning on top of it. So they need to be helped and guided and have like higher level abstractions to be able to develop it as efficiently as possible. And so it doesn't become like the main hurdle of the video game development. It has enough complexity as it is.

Alp Uguray (42:12.512)

And to that perspective, when I attended your lecture at MIT, I was thinking about what would it take someone to learn the basics of fine tuning and training a model. And maybe not training, they need a lot of money. But at least fine tuning a model for a specific task and then create an agentic application for gaming, music, and whatever it may be that

that's their passion, also like they want to bring value to the world by solving the problem. Like today, most of like, and like an entry level task is I think pretty automatable right now, right? Like I could spin off an agent or use it in tandem with myself to get to a point. What is the best way for a student like who's graduating today to...

pick up on which tasks, how to get started. Is it just like building, learning, understanding the nature of the models? How do you see that evolve in the pretty young?

Maxime Labonne (43:20.223)

Yeah, I have a good recommendation. It's a bit of shameless self-promotion, but I made the LLM course on GitHub. It's a very, very popular course about large language models. It has over 57, 58K stars. And this gives you an overview of pretty much everything that you need to know, either in terms of LLM science or LLM engineering.

And if you want to dive deeper into LLM engineering, also wrote a book, the LLM engineers handbook on that precise topic. There are a lot of resources that are linked in this GitHub. It's not just about my work, nor is this is a lot about curating high quality educational resources to give a path to people who want to get into this field. So they understand like they can map already what's

what's there, what they need to learn. And then you can have what I would recommend is having a high level overview of the different categories. For example, you want to know what quantization is because it's very important. You want to know what fine tuning is. You want to know what inference optimization is about. But then you can specialize into something that probably resonates with you, that is useful to you.

And here you can start having a project on this. You can try re-implementing stuff from scratch. This is a great way to learn when you're a bit more comfortable with this concept. And yeah, like really specializing and creating your own niche, because this is what we want the most in the industry. It's people who are specialized enough. So you specialize, for example, data generation. You specialize in post-training. You specialize in

In evaluation, you're specialized in inference optimization, infrastructure. I think this is a great way to really map the entire field and then find what interests you the most.

Alp Uguray (45:24.971)

And in terms of the pulse training, what drove you to pulse training?

Maxime Labonne (45:31.249)

Post-training is really connected to my early work. I talked about this BERT model that I trained that was already post-training. didn't know back then. I didn't have the name, but it was already about that. To me, post-training is really nice because you can really interact with the model. And that's something that is really important to me. You can take a checkpoint and kind of make it yours. So

There's a lot of nice things since the release of Llama that you can do with post-training. This is where I really started doing fine-tuning, as we would call it right now. Before that, I did it with GPT-2 quite a lot. It wasn't that good, honestly. You could really quickly reach the limit, but it was really interesting already. One of my first projects with Traintive AI was in like...

2019, I think, and it was a GPT-2 model fine-tuned to write scientific articles. And I had the dream of having this GPT-2 model helping me to write my own articles. It was half consistent, like it was mostly hallucinated. So it was really, really funny to read. You could prompt it with like,

very important questions and get like surreal answers, but it was not really useful in the end. So when the models got to a state where the quality was high enough for them to be truly useful, yeah, this is what excited me the most.

Alp Uguray (47:13.026)

you will have that interactivity with actually something that's like it's talking, it's there, that it's set a state that it's giving responses. And yeah, I feel like the also the progression sometimes we don't see it, but like maybe GPT-2 to GPT-3 or maybe another model that comes up a little better, then we completely forget what it was like before. But then actually looking back like, two, three years ago.

They were hallucinating 50 % of the time and then making stuff up. And now it's in a much better world with all the, like to your point, specialties that people have and then that investment specialty dangles into investing into model development, refinement and release as well, which is huge. I know a lot of people ask about that.

Like, well, how can I be relevant today and do things?

Maxime Labonne (48:18.379)

Yeah, an example of this is I had this project where we tried to automatically generate unit test, Python Java code, that kind of stuff. And we worked really hard, like fine tuning a model. We worked really hard, like benchmarking against baselines, like vendor solutions and stuff. And then ChaiJPT dropped. It was much better than anything else.

No comparison possible, right? And that was the original Chatcha GPT, which is now really, really bad compared to everything else you can find online. So it shows that the speed of progress has been quite unreal. And even now, the models keep getting better and better, more and more efficient. And yeah, that's really excites me, I think, in this field.

Alp Uguray (49:13.41)

It's never stagnant, I think in a way that like it keeps going and then keeps evolving. So for the last part of our conversation, I want to ask you about like a few things that told you things that made you choose this field. Like for example, what is like one book or one person that you met that

major, I want to take this path and I need to change my direction into where I am today.

Maxime Labonne (49:51.511)

It's difficult to really find one person in particular. To me, coming from cybersecurity, the thing that really surprised me with AI, and that was in 2017, was how the AI community was super open, welcoming, and they had a ton of great educational resources online, which is really not what you would find in cybersecurity.

People were not super welcoming, to be fair. And knowledge was also quite gated. So I think I would say the entire AI community to me was what truly drove me into this direction, because I could learn very fast. If you're motivated, you have everything that you need online to be able to do whatever you want to do. And that's really, really amazing. I think that we...

forget how incredible Wikipedia is, how incredible the internet is in general. And even when you have that, community like the AI community is really fantastic if you have this need to learn because everything is available. And to me, that's what really motivated me to delve deeper and deeper. And yeah, this is where I am now.

Alp Uguray (51:15.342)

That's a very good point because I remember like three, four years ago, the main conversation was that you could watch anything on YouTube and learn it now. So it's now it's you can watch anything on YouTube and you also have an AI assisting you to do things on top of it. It's just this, there's so much accessibility to actually not to have any excuse to. And yeah, maybe the only thing that is missing in the puzzle is.

Maxime Labonne (51:32.055)

True.

Alp Uguray (51:44.226)

Since the barrier to entry is really low now, like, everyone is welcome. Everyone's helping each other out. The community is growing exponentially. People are helping each other and AI can do things. perhaps the biggest thing is like how, like what, is the passion of the person and then what, how creative they want to be to handle a problem. Probably, that's dear to them.

Maxime Labonne (52:14.847)

That's a very good point. think that if you're passionate about it, you will have this drive to learn more and explore all the resources that are available, which is why it's probably important to try to find your own niche, something that you own, something that is really yours and that you feel comfortable about, which is probably not what everyone can do, right? Maybe...

you realize that you don't like AI that much, and that's completely fine too. But I think it's a good point to reiterate here. It's something that you need to find for yourself, I think.

Alp Uguray (52:55.862)

And the last question is if there, if let's say we reach the point of AI and then it runs on cloud machine wherever it could be, but it's very smart. What would be the number one, and it could solve one problem because maybe we have only one data center left. What would be the number one problem that you think to ask it? So it will go ahead and solve it.

in today's world.

Maxime Labonne (53:28.599)

How can I make it run on a phone?

Alp Uguray (53:31.662)

It will make the job easier. Yeah, absolutely.

Maxime Labonne (53:39.383)

I think in terms of efficiency, this is like, I joke about it, but it's really important. It's really important for environmental reasons. It's very important for like just like overall efficiency reasons. And also kind of privacy and governance issues with AI. We don't really want to live in the world where...

this AI would indeed be just in one data center and gated by a company. So yeah, that would be my answer.

Alp Uguray (54:16.686)

would definitely agree. feel like if that problem is solved, lot of, maybe thousands of other problems are solved with it. Thank you very much once again. It was great to have you and it was a great chat.

Maxime Labonne (54:27.031)

hopefully.

Maxime Labonne (54:36.353)

Thanks a lot. Yeah, really enjoyed it. Thank you for the invitation.

Alp Uguray (54:39.724)

Yeah, likewise. Stop to record.

Founder, Alp Uguray

Alp Uguray is a technologist and advisor with 5x UiPath (MVP) Most Valuable Professional Award and is a globally recognized expert on intelligent automation, AI (artificial intelligence), RPA, process mining, and enterprise digital transformation.

https://themasters.ai

Maxime Labonne: Edge AI and the Future of Localized Intelligence with Private, offline LLMs

Today’s guests — Maxime Labonne

Transcript

Stephen Wolfram: Computation, AGI, Language & the Future of Reasoning.

Masters of Automation