Why Voice Will Be the Fundamental Interface for Tech ft ElevenLabs’ Mati Staniszewski

Mati Staniszewski, co-founder and CEO of ElevenLabs, explains how staying laser-focused on audio innovation has allowed his company to thrive despite the push into multimodality from foundation models. From a high school friendship in Poland to building one...

Featured in

THings

@nicholas

Published: Published Jul 1, 2025
Uploaded: Uploaded Jun 11, 2026
File type: YouTube
Queried: 00
Source: youtube.com

Full transcript

Showing the full transcript for this video.

AI-generated transcript with timestamped sections.

Title: Why Voice Will Be the Fundamental Interface for Tech ft ElevenLabs’ Mati Staniszewski Platform: youtube Creator: Sequoia Capital Source: v=EWXTZZzL1vg Transcript source: deepinfra Summary: Mati Staniszewski, co-founder and CEO of ElevenLabs, explains how staying laser-focused on audio innovation has allowed his company to thrive despite the push into multimodality from foundation models. From a high school friendship in Poland to building one... Transcript: late 2021, The inspiration came from Peter was about to watch a movie with his girlfriend. She didn't speak English. So they turned it up in Polish. And that kind of brought us back to something we grew up with, where every movie you watch in Polish, every foreign movie you watch in Polish, has all the voices.

So whether it's male voice or female voice, still narrated with one single character. Like monotonous narration. It's like a horrible experience. And it still happens today. And it was like, wow. We think this will change. Bye. Greetings. Today we're talking with Maddy Staniszewski from Eleven Labs about how they've carved out a defensible position in AI audio, even as the big foundation model labs expand into voice as part of their push to multimodality. We dig into the technical differences between building voice AI versus text. It turns out they're surprisingly different in terms of the data and the data.

and the architectures. Maddy walks us through how Eleven Labs has stayed competitive by focusing narrowly on audio, including some of the specific engineering hurdles they've had to overcome, and what enterprise customers actually care about beyond the benchmarks. We also explore the future of voice as an interface. The challenges of building AI agents that can handle real conversations. and AI's potential to break language barriers. Matty shares his thoughts on building a company in Europe and why he thinks we might hit human-level voice interactions sooner than expected. We hope you enjoy the show.

Matzi, welcome to the show. Thank you for having me. All right, first question. There was a school of thought a few years ago. when 11 labs really started ripping. that you guys are going to be roadkill. for the foundation models. And yet here you are, still doing pretty well. What happened? How were you able to stave off the multimodality, big foundation model labs and kind of carve out this really interesting position for yourselves? It's exciting last few years, and it's definitely true. We still need to keep on our toes to be able to keep winning the Fido Foundation models.

But I think the... The usual and definitely true advice is staying focused and staying focused in our case on audio, both as a company, of course, the research and the product. But we ultimately stayed focused on audio, which really helped. But the, you know, probably the biggest question. Another question is, Through the years, we've been able to build some of the best research models and outcompete the big labs. And here, credit to my co-founder, who I think is a genius, Piotr, who has been able to both do some of the first innovations in the space, and then assemble a rockstar team that we have today at the company that is continually pushing what's possible We're in audio.

And I was like, you know, when we started... there was very little research done in audio. Most people focused on LLMs, some people focused on image, you know, a lot more easy to see the results, frequently more exciting for people doing research to work in those fields. So there's a lot less focus put onto audio. And the set of innovations that happened in the years prior, the diffusion models, the transformer models, weren't really applied to that domain in an efficient way. And we've been able to bring that in, in those first years, where for the first time, the text to speech models were able to understand the context of the text.

and deliver that audio experience in just such a better tonality and emotion. So that was the starting point that really differentiated our work to other works, which was the true research innovation. But then fast following after that first piece was building all the product around it to be able to actually use that research. We've seen so many times, it's like, it's not only the model that matters, it also matters how you deliver that experience to the user. And in our case, whether it's narrating and creating audiobooks, whether it's voiceovers, whether it's turning movies to other languages, when it's adding text to speech and in the agents or building the entire conversational experience.

That layer keeps helping us to win across the foundational models and hyperscalers. Okay, there's a lot here, and we're going to come back and dig in on a bunch of aspects of that. But you mentioned your co-founder, Mark. Peter, I believe you guys met in high school in Poland. Is that right? Can you kind of tell us the origin story of how you two got to know each other and then maybe the origin story of how this business came together? We started an IB class in Poland, in Warsaw. and took all the same classes.

So, kind of everything, and we hit it off pretty quickly on some of the mathematics classes. We both love mathematics. So we started both sitting together, spending a lot of time together, and that kind of morphed from outside the school, time together as well. And then over the years, we kind of did it all from living together to Studying together, working together, traveling together, and now 15 years in, we are still best friends. The time is on our side, which is helpful. Has building a company together strengthened the relationship? There are ups and downs for sure, but I think it did.

I think it did. Good. It's battle-tested it. Definitely battle-tested it. And it was like... When the company started taking off, We... Okay. And it's hard to know how long the horizon of this intense work will happen. Initially, it was like, okay, this is the next four weeks. We just need to push, trust each other that we'll do well on different aspects and just continue pushing. And then there's another four weeks, another four weeks. And then we realized, like, actually, this is going to be the next... 10 years and there was just no real time for anything else.

We were just like just do 11 laps and nothing else. And over time, and I think this happened organically, but looking back at it, it definitely helped. We now try to still stay in close touch in what's happening in our personal lives, where we are in the world. And... and spend some time together still speaking about work, but outside of the work context. And I think this was very healthy. for for... Now I know Piar for so long and I kind of, seen him evolve personally through those years, but I can still stay in close touch to do that as well.

It's important to make sure that your co-founder and your executives and your team are able to bring their best self to work and not just completely ignoring everything that's happened on the personal front. Exactly. Yeah. And then to your second question, you know, part of the inspiration for, for 11 labs came. So, so maybe the longer story. So there are two parts. First for the years, um, when he was at Google, I was a palantir. We would do hack weekend projects trying to explore new technology for fun. And that was everything from building recommendation algorithms.

So we tried to build this model where you would be presented with a few different things. And if you select one of those, the next set of things you're presented with gets closer and optimizes closer to your previous selection. Deployed it a lot of fun. Then we did the same with crypto. We tried to understand the risk in crypto and build like a risk analyzer for crypto. you They're fully work, but it was a good attempt in the first, one of the first crypto heights to try to provide like the analytics around it.

And then we created a project in audio. So we created a project which analyzed how we speak. and gave you tips on how to-- - When was this? Early 2021. Okay. Early 2021. That was kind of the first opening. This is what's possible across audio space. This is the state of the art. These are the models that do diarization, understanding of speech. This is where the speech generation looks like. And then late 2021... the inspiration came from and like the more of the aha moment from from poland from where you're from where And in this case, Peter was about to watch a movie with his girlfriend.

She didn't speak English, so they turned it up in Polish. And that kind of brought us back to something we grew up with, where every movie you watch in Polish, every foreign movie you watch in Polish, has all the voices. So whether it's male voice or female voice, still narrated with one single character, like monotonous narration. It's like a horrible experience. And it still happens today. And it was like, wow, we think this will change. this will change. We think that technology and what will happen with some of the innovations will allow us to enjoy that content in the original delivery, in the original incredible voice and let's make it happen and change it.

Of course, then expanded since then, it's not only dubbing realized the same problem exists across most content not being accessible in audio, just in English. how the dynamic interactions will evolve, and of course, how the audio will transmit the language barrier too. Was there any particular paper or capability that you saw that made you think, okay, now is the time for this to change? Well, attention is all you need is definitely one, which was so... So... crisp and clear in terms of how it's possible. But, um, Maybe to to give like a different angle to the answer.

I think the interesting piece was less so than the paper. There was this incredible open source repo. So that was like slightly later in, as we started discovering like, is it even possible? And there was a Tortoise TTS, effectively, which is an open source model that was Kind of created at a time. It provided incredible results of replicating a voice and generating speech. It wasn't very stable, but it kind of had some glimpse into like, wow, this is, this is, this is incredible. Um, and that was um that was already as we were deeper into the the company so like maybe a first year in um so in 2022 um but that was that was like another element of like okay, this is possible, some great ideas there.

And then of course we've spent most of our time, like, what other things we can innovate through, start from scratch, bring the transformer diffusion into the audio space. And that kind of healed it just another level of... of like human quality where you could actually feel like it's a human voice. Yeah, let's talk a bit about how you've actually built what you've built as far as the product goes. What aspects of what works in text port directly over to audio, and what's completely different, different skill set, different techniques? I'm curious how similar the two are and where some of the real differences are.

The first thing is that there's kind of those three components that come into the model. There's the computes, there's the data, there's the model architecture. And The model architecture is... has some ideas but it's very different. But then the data is also quite different in both in terms of what's what's accessible and and and and how you how you need that data to be able to train the models and then compute the models are smaller so you don't need as much compute which which allows us to given a lot of the innovations need to happen on them on the model side or the data side you can still out compete foundational models rather than just just that you don't know the big compute disadvantage exactly yeah but the data was i think the first piece which is different where In text, you can reliably take the text that exists and it will work.

In audio, The data, first of all, there's much less of the high quality audio that actually would get you the result you need. And then the second, it frequently doesn't come with transcription or with a high accurate text of what was spoken. And that was kind of lacking in the space where you need to spend a lot of time. And then there's a third component, something that will be coming across in the current generation of models, which is not only what was said, so the transcript of the audio, but also how was it said.

What emotions did you use? Who said it? What are some of the non-verbal elements that were said? That kind of almost doesn't exist, especially at the high quality. And that's where you need to spend a lot of time. with additional set of manual labelers to to do that work. And that's very different from text where you just need to spend a lot more cycles. And then the model level, you effectively, you know, you have this step of in the first generation of text-to-speech model of understanding the context and bringing that to emotion.

But of course, you you need to kind of predict the next sound rather than the predict the next text token. And, and, and that, and that, both depends on the prior, but can also depend on what happens after. It's like an easy example. It's like, you know, what a wonderful day. Let's say it's a passage of a book. Then you kind of think, okay, this is positive emotion. I should read it in a positive way. But if you have a what a wonderful day, I said sarcastically, then suddenly it changes the entire meaning and you kind of need to adjust that in the audio delivery as well, put a punchline in the different spot.

So that was definitely different where that contextual understanding was a tricky thing. the text to speech element, but then you have also the voice element. So the kind of the other innovation that we spend a lot of time working on is how can you create and represent voices to hire accurate way of what was in the original. And we found like this decoding and coding way, which was slightly different to the space. We weren't hard coding or predicting any specific features. So we weren't like trying to optimize is the voice male or is the voice female or was the age of the voice.

Instead, we effectively let the model decide what the characteristics should be. And then I found a way to bring that into the speech. So now, of course, when you have the text-to-speech model, it will... Take the context of the text. as one input and the second will take the voice as a second input. And based on the voice delivery, if it's more calm or dynamic, both of those will merge together and then give the kind of the end output, which was, of course, very different type of work than the text models.

Amazing. What sort of people have you needed to hire to be able to build this? I imagine it's a different skill set than most AI companies. It kind of changed over time, but I think the first difference, and this is probably less skill set difference, but more approach difference, we've started fully remote. We wanted to hire the best researchers wherever they are. We knew where they are. There's probably like 50 Thank you. based at least on the open source work, or the papers that they release, or the companies that they worked in.

That's what we admire. So the top of the funnel is pretty limited because so much fewer people worked on the research. So we decided let's attract them and get them into the company wherever they are. And that kind of really helped. The second thing was, you know, given we want to make it. exciting for a lot of people to work. But also we think this is the best way to run a lot of the research. We try to make the researchers extremely close to deployment, to actually seeing the results of their work.

So the cycle from being able to research something to bringing it in front of all the people is super short. Yeah. And you get that immediate feedback of how is it working? And then we have a kind of separate and research. We have research engineers that focus less on like the innovation of the entire kind of new architecture of the models, but taking existing models, improving them, changing them, deploying them at scale. And here, frequently I've seen other companies call our research engineers researchers, given that the work would be as complex in those companies.

But that kind of really helped us to... create a new innovation, bring that innovation, extend it and deploy it. And then the layer around the research that we've created is probably very different, where we effectively have now a group of voice coaches, data labelers that are trained by voice coaches to... understand the audio data, how to label that, how to label their emotions, and then they get re-reviewed by the voice coaches, whether it's good or bad, because most of the traditional companies didn't really support audio labeling in that same way.

But I think the biggest difference is you needed to be excited about some part of the audio work to really be able to create, um, and, and dedicate yourself to, to the level we, we want. And, um, and we, You know, we're a special at the time, small company, you... Would... be willing to embrace that independence, that high ownership that it takes, that you are effectively, you know, working on a specific research theme yourself. And, and of course, there's some some interaction, some guidance from others, but a lot of the heavy lifting is individual and creating that work, which takes a different mindset.

And I think we've been able to... Now we have like a team of 15 research and research engineers, almost, and they are incredible. What have some of the major... kind of step function changes in the quality of the product or the applicability of the product then over the last few years. I remember kind of early, I think it was early 2023 ish when you guys started to explode or maybe late 2023, I forget. And it seemed like some of it was on the heels of the Harry Potter Balenciaga video that went viral, whereas an 11 labs voice that was doing it.

It seems like you've had these moments in the consumer world where something goes viral and it traces back to you. But beyond that, from a product standpoint, what have been kind of the major inflection points that have opened up new markets or spurred more developer enthusiasm? You know, it's what you mentioned is probably one of the key things we are trying to do. And continuously, even now we see this is like one of the key things to really get the adoption out there, which is. Have the prosumer deployment and actually bringing it to everyone out there when we create new technology, showing to the world that it's possible, and then kind of supplementing that to the top down, bringing it to the specific companies we work with.

And the reason for this is kind of twofold. One is these groups of people are just so much more eager and quick to adopt and create that technology. And the second one, frequently when we create a lot of both the product and the research work, that might be created. We have, of course, some predictions, but there's just so many more that we wouldn't expect, like the example you gave, that wouldn't have come to our mind that this is something that people might be creating and trying to do. And that was definitely something where where we continuously even now when we create new models, we try to bring it to the entirety of the user base, learn from them and increase that.

And it kind of goes in those waves where we have a new model release, we bring it abroad, then kind of the prosumer adoption is there, and then the enterprise adoption follows with additional product, additional reliability that needs to happen. And then once again, we have a new step release and a new function. and kind of the cycle repeats. So we tried to really embrace it. Through the history, the first one, the very first one was when we had our beta model. So you're right, when we released it publicly early 2023, late 2022, we're iterating in the beta with subset of users.

And we had a lot of book authors in that subset. And we had this like literally a small text box in our product where you could, where you could, input the text and get the speech out. It was like a tweet length effectively. And we had one of those book authors copy paste his entire book inside this box, download it. Then at the time you wouldn't you it was most of the platforms banned AI content. He managed to upload it. They thought it's human. And he started getting great reviews on that platform and then came back to us with a set of his friends and other book authors saying like, hey, we really need it.

This is incredible. And that kind of triggered his first like mini virality moment with book authors, very, very keen. Then we had another similar moment around the same period where that was one of the first models that could laugh. We released this blog post that the first AI that can laugh. And people picked it up like, wow, this is incredible. we got a lot of the early users. Then of course the theme that you mentioned, which was a lot of the creators, and I think there's like a completely new trend that started around this time where it shifted into, No face channels.

Effectively, you don't have the creator in the frame. And then you have narration of that creator across something that's happening. And that started going like wildfire in the first six months of the work where, of course, we were... providing the narration and the speech and the voices for a lot of those. a lot of those use cases and that was great to see. Then late 2023, early 2024. We released our work in most other languages. That's one of the first moments where you could really create the narration across different other most famous European languages and our dubbing product.

So that's kind of back to the original vision. We created finally a way for you to have the audio and bring it to another language while still sounding the same. And, um, And that kind of triggered this other small virality moment of people creating the videos. And there was like this, you know, the expected ones, which is just the traditional content, but also unexpected ones where we had someone trying to dub singing videos. Okay. Which the model we didn't know will work on. And it kind of didn't work, but it gave you like a drunken singing result.

So then it went a few times viral too for that result, which was fun to see. Thank you. And then in 2025, at the early time now, and we are seeing kind of recurrently now, everybody's creating an agent. We started adding the voice to a lot of those agents. And it became both very easy to do for a lot of people to like have the entire orchestration, speech to text, the LLM responses, text to speech, to make it seamless. And we have now few use cases which started getting a lot of traction, a lot of adoption, most recently.

We worked with Epic Games to recreate the voice of Darth Vader. I saw that. players, there's just so many people using and trying to get the conversation of Darth Vader and Fortnite, which is like just immense, immense scale. And of course, you know, Most of the users are trying to have a great conversation, use him as a companion in the game. Some people are trying to like stretch whether he will say something that he shouldn't be able to say. So you see all those attempts as well. But luckily the product is holding up and it's actually keeping it relatively both.

performative and safe to actually keep him on the rails. I think about some of the dubbing use cases, one of the viral ones was when we worked with Lex Friedman and he interviewed Prime Minister Narendra Modi. And we turned a conversation which happened between English through Lex and Narendra Modi spoke Hindi. And we turned the conversation into English so we could actually listen to both of them speaking together. And then similarly, we turned both of them to Hindi. So you heard Lex speaking Hindi. US people were watching the English version.

So that's like a nice way of tying it back to the beginning. but i think they especially as you think about the future the agents and like just seeing them pop up in in new ways uh is is going to be so frequent like both Early developers building everything from Stripe integration and being able to process refunds through to the companion use cases, all the way through to the true enterprise is kind of having probably a few viral moments ahead. Yeah, say more about what you're seeing in voice agents right now.

It seems like that's quickly become a pretty popular interaction pattern. What's working? What's not working? You know, where are your customers really having success? Where are some of your customers kind of getting stuck? And before I answer, maybe a question back to you. Do you see a lot more companies building agents across the companies that are coming through? Yeah, we absolutely do. And I think most people have this long-term vision that it's sort of a agent-style avatar powered by an 11 Labs voice, where it's this human-like agent that you're interacting with.

And I think most people start with simpler modalities and kind of work their way up. So we see a lot of tech-based agents sort of proliferating throughout the enterprise stack. I imagine there are lots of consumer applications for that as well, but we tend to see a lot of the enterprise stuff. It's similar, definitely, what we are seeing, both on the new startups being created, where it's like everybody is building an agent. And then enterprise side, too, is like so helpful for the process internally. And like taking a step back, what we think and believe from kind of the start is voice will fundamentally be the interface for interacting with technology.

It would be one of the... Most, you know, it's probably the modality we've known from when the human... when Geno was born, it was the kind of first way that humans interacted. And it carries just so much more than text does. Like it carries the emotions, the intonation, the imperfections. We can understand each other. We can... We can, based on the emotional cues, respond in a very different ways. So that's where our start happened, where... where we think the voice will be that interface, and build not just the text-to-speech element, but seeing our clients try to use the text-to-speech and do the whole conversational application, can we provide them a solution that helps them abstract this way.

And, you know, we've seen from like the traditional domains, and to speak for a few, it's like in healthcare space, we've seen people try to automate some of the work they cannot do. With nurses, as an example, a company like Hippocratic will automate the calls that nurses need to take to the patients to remind them about taking medicine, ask how they are feeling, capture that information back. So then doctors can actually process that in a much more efficient way. And voice became critical where a lot of those people cannot be reached otherwise.

and the voice call is just the easiest thing to do. Then very traditional, probably the quickest moving one is customer support. So many companies both from the call center and the traditional customer support. trying to build the voice internally in the companies, whether it's companies like Deutsche Telekom all the way through to the new companies. Everybody is trying to find a way to deliver a better experience. And now Voice is possible. And then what is probably one of the most exciting for me is education, where could you be learning through having that voice delivery in a new way um i'm uh i used to at least be a chess player or like a amateur chess player and um and we work with

com where you can I don't know if you are a user of com. I am, but I'm a very bad chess player. Okay. So maybe, so that's a great cue. One of the things is we are trying to build effectively a... a narration which guides you through the game, so you can learn how to play better. And there's a version of that where hopefully you will be able to work with some of the iconic chess players where you can like have the delivery from Magnus Carlsen or Garry Kasparov or Hikaru Nakamura to guide you through the game and get even better while you play it.

Which would be phenomenal and I think this will be one of the common things we'll see where like everybody will have their personal tutor for the subject that they want. with voice that they relate to and they can get closer. And that's on the enterprise side, but then on the consumer side too, we've seen kind of completely new ways of augmenting the way you can deliver the content. like the work of the Time Magazine, where you can read the article, you can listen to the article, but you can also speak to the article.

So it worked effectively during the person of the year release where you could ask the questions about how they became person of the year, Tell me more about other people of the year and kind of dive into that a little bit deeper. And then we ask the company every so often are trying to build an agent that people can interact and see the art of possible. Most recently, we... We've created an agent for my favorite physicist, or one of the two, with working with his family, Richard Feynman, where you can...

actually is my favorite too okay great great he's i mean he is amazing such an amazing way to like both deliver the knowledge and educational like simple way and humoristic way and and just like the way he speaks is also amazing and the way he writes is amazing so um so that was that was that was amazing and i think this will like alter where maybe in the future you will have like you know his his cultic lectures or um one of his book where you can listen to it in his voice and then dive into some of his background and understand that a bit better.

Like, surely you're joking, Mr. Feynman, and dive into this. I would love to hear a reading of that book in his voice. That'd be amazing. Yeah, 100%. For some of the enterprise applications or maybe the consumer applications as well, It seems like There are a lot of situations where the interface is not the interface might be the enabler, but it's not the bottleneck. The bottleneck is sort of the underlying business logic or the underlying context that's required to actually have the right sort of conversation with your customer or whoever the user is.

How often do you run into that? What's your sense for where those bottlenecks are getting removed? and where they might still be a little bit sticky at the moment. The benefit of us working so closely with a lot of companies where we where we bring our engineers to work directly with them, frequently results in us kind of diving into seeing some of the common bottlenecks. When we've started, you know, as you think about a conversational AI stack, you have the kind of that speech-to-text element of understanding what you say, you have the LM piece of generating the response and then text-to-speech to narrate it back.

And then you have the entire turn-taking model to... to deliver that experience in a good way. But really, that's just the enabler. But then, like you said, to be able to deliver the right response, you need both the knowledge base, the business base or the business information about how you want to actually generate that response and what's relevant in a specific context. And then you need the functions and integrations to trigger the right set of actions. And in our case, we've built that stack around bring that knowledge base relatively easily, have access to RAG if they want to enable this, are able to do that on the fly if they need to, and then, of course, build the functions around it.

And the sort of very common... Common themes is definitely coming across where... that the deeper enterprise you go, the more integrations will start becoming more important, whether it's, you know, simple things like Twilio or SIP trunking to make the phone call, or whether it's connecting to the CRM system of choice that they have or working with the past providers or the current providers where those companies are deployed like Genesis. That's that's definitely a common theme where that's probably taking the most time of like, how do you, have the entire suite of integrations that works reliably and, and a business can easily connect to their, to their logic.

In our case, of course, this is increasing and like kind of every next year. company we work with already benefits from a lot of the integrations that were built. So that's probably the most frequent one, the integrations itself. Knowledge base isn't as big of an issue, but that depends on the company. Like if we work with a company that You know, we've seen kind of it all from like how well organized the knowledge is inside of the company. If it's a company that has been spending a lot of effort on digitizing already and creating like some version of source of truth where that information lies and how it lies.

relatively easy to onboard them. And then as we go to a more complex one, and I don't know if I can mention anyone, but it can get pretty gnarly. And then we work with them on like, okay, that's what we need to do as the first step. can provide us in an easy standard way. - Well, and you mentioned Anthropic. One of the things that you plug into is the foundation models themselves. And I imagine there's a bit of a coopetition dynamic where sometimes you're competing with their voice functionality. Sometimes you're working with them to provide a solution for a customer.

How do you manage that? How does that... I imagine there are a bunch of founders listening who are in similar positions where they work with foundation models, but they kind of compete with foundation. Well, I'm just curious. How do you manage that? I think the main thing that we've realized is... Most of them are complementary to like work like conversational AI. Yeah, and we're trying to stay agnostic from using one provider and But I think the main thing is true and happened over the, especially last year, now that I think about it, is that we are not trying to rely only on one.

We are trying to have many of them together in default. And that kind of goes to both, like one. What if they develop into being a closer competition? where, you know, maybe they won't be able to provide the service to us or their service becomes too blurry or we, you know, we of course are not using any of the data back to them, but could that be a concern in the future? So kind of that piece. But also the second piece is when you develop like a product like conversational AI, which allows you to deploy your voice AI agent, All our customers will have a different preference for using the LLM, but frequently or even more frequently, you want kind of this cascading mechanism that what if one LLM isn't working at a given time, go through and have that kind of the second layer of support or third layer to perform pretty well.

And we've seen this work extremely successfully. So to a large extent, treat them as partners, happy to be partners with many of them. That'll be a good competition too. Let me ask you on the product. What do your customers care the most about? One sort of meme over the last year or so has been people who keep touting benchmarks are kind of missing the point. You know, there are a lot of things beyond the benchmarks that customers really care about. What is it your customers really care about? The, and they're very true on the benchmark side, especially in audio, but if our customers care about three things, quality, both like how expressive it is in both English and other languages.

And that's probably the top one. Like if you don't have quality, everything else doesn't matter. Of course, the thresholds of quality will depend on the use case. It's a different threshold for narration, for delivery in the agentic space and dubbing. Second one is latency. if the latency isn't good enough. But that's where the interesting combination will happen in between. What's the quality versus performance? latency benchmark that that you that you have. And then the third one, which which is especially useful at that scales reliability, like can I deploy at scale like the Epic Games example, where millions of players are interacting with and the system holds up It's still performative, still works extremely well.

And time and time again, we've seen that the kind of being able to scale and reliably deliver that infrastructure is critical. Can I ask you, how far do you think we are from... highly or fully reliable, human or superhuman quality effectively zero latency voice interaction. And maybe the related question is, How does the nature of the engineering challenges you face change? as we get closer and inevitably surpass that sort of threshold. The ideal, like we would love to prove that it's possible this year. This year, like cross the Turing test of speaking with an agent and you just would say like, this is like speaking another human.

I think it's a very ambitious goal, but I think it's possible. Yeah, I think it's possible. and If not this year, then hopefully early in 2026. But I think we can do it. I think we can do it. You know, you will have, you probably have different groups of users too, where some people will... will kind of be very attuned and it will be much harder to pass the Turing test for them. But for majority of people, I hope we are able to get to that level. this year. I think the biggest question, and that's kind of where the timeline is a little bit more dependent, is will it be the model that we have today, which is a cascading model where you have the speech-to-text, alarm text-to-speech.

So like three separate pieces that can be performative. Or do you have like the Omni model where you train them together, truly duplex style, where that delivery is much better. And that's effectively what we are doing kind of trying to assess. We are doing both. We are now then the one in production is the Cascading model. Soon the one will deploy will be a truly duplex model. And I think the main thing that you will see is the kind of the reliability versus expressivity trade off. I think latency we can get pretty good on both sides, but similarly there might be some trade-off of latency where the true duplex model will always be quicker, will be a little bit more expressive but less reliable, and the cascaded model is definitely more reliable.

can be extremely expressive. But maybe not as... contextually responsive and then latency will be a little bit harder. So that would be a huge engineering challenge. I think no company has been able to do it well, like fuse the modality of LLAMS with audio well. So I hope we'll be the first one, which is the internal big goal. But we've seen probably the opening eye work, the meta work that are doubling in there. I don't think it passed the Turing test yet. So hopefully we'll be the first. Awesome. And then you mentioned earlier that you think of and you have thought of voice as sort of a new default interaction mode for a lot of technology.

Can you paint that picture a little bit more? You know, let's say we're five or 10 years down the road. How do you imagine just the way people live with technology, the way people interact with technology changes? as a result of your model getting so good. I think the first, like, you know, there there will be this beautiful part where kind of technology will will go into the background so you can really focus on on on learning on human interaction and then you will have it like accessible through for voice versus through the screen.

I think the first piece will be the education. I think there will be an entire change where all of us will have the the kind of the guiding voice, whether we are learning mathematics and are going through the notes or whether we are trying to learn a new language and interact with a native speaker to guide you through how to pronounce things. And I think this will be the first theme where in the next five, 10 years, it will be the default that you will have the agents, voice agents to help you through that learning.

Second thing, which... Which will be interesting how this like affects the whole cultural exchange around the world. I think you will be able to... to go to another country and interact with another person while still carrying your own voice, your own emotion, intonation, and the person can understand you. There will be an interesting question how that technology is delivered. Is it the headphone? Is it Neuralink? Is it another technology? But... But it will happen. And I think we hopefully can make it happen. If you read Hitchhiker's Guide to Galaxy, there's this concept of Babelfish.

I think Babelfish will be will be there and the technology will make it possible. So that'll be a second huge, huge theme. And I think the generally, like, you know, we've spoke about this personal tutor example, but I think there'll be other set of assistants and agents that all of us have that just can be sent to perform tasks on our behalf. And to perform a lot of those tasks, you will need voice, whether it's, you know, booking a... booking a restaurant or whether it's jumping into a specific meeting to take notes and summarize that in the style that you need, you want to be able to perform the action or whether it's calling a customer support and the customer support agent responding.

So that'll be an interesting theme of like agent to agent interaction and how it's authenticated, how do you know it's real or not. and generally how we learn things will be so dependent on that. the kind of the universal translator piece will have voice at the forefront. And then the general services around the life will be so crucially voice driven. Very cool. And you mentioned authentication. I was going to ask you about that. So one of the fears that always comes up is impersonation. Can you talk about how you've handled that to date and maybe how it's evolved to date and where you see it headed from here?

Yeah. Yeah, we've started, and that's like a big piece for us from the start is, for all the content generated at 11 labs, you can trace it back to the specific account that generated it. So I have a pretty robust mechanism of tying the audio output to the account and it can take action. So the provenance is extremely important. And I think we'll be... increasingly important in the future where you want to be able to understand what's the AI content or not AI content, or maybe it will shift even a step deeper, where you will rather than authenticating AI, you'll also authenticate humans.

So you'll have on-device authentication that, okay, this is Mati calling AI. another person. The second thing is the wider set of like the moderation of is it a called trying to do fraud and scam? Or is this a voice that might be not authenticated? Which we do as a company and that kind of evolved over time to what extent we do it and how we do it. So moderating on the voice on the text level. And then the first thing, kind of stretching what we've started ourselves on the provenance component, is how can we train models and work with other companies to not only train it for 11 labs, but also open source technology, which is, of course, prevalent in that space, other commercial models.

And... And it's possible, of course, as open source develops, it always will be a cat and mouse game, whether you can actually catch it. But we worked a lot with other companies or academia, like University of Berkeley, to actually deliver those models and be able to detect it. And that kind of the guiding, especially now that the more we take the kind of the leading position in deploying new technology like the conversational AI, soon a new model, we try to spend even more time on trying to understand, like, the safety mechanism that we can that we can bring in to make it as useful for good actors and minimize the bad actors.

So that's the usual trade-off there. Can we talk about Europe for a minute? Let's do it. Okay. So you're a remote company, but you're based in London. What have been the advantages of being based in Europe? What have been some of the disadvantages of being based in Europe? That's a great question. Having the advantage for us was the talent, being able to attract some of the best talent. And, you know, frequently people say that There's like a lack of drive from the people in Europe. We haven't felt that at all.

We feel like these people are, So passionate we have like I think such an incredible team. We tried to run it with small teams, but everybody is And it's just like pushing all the time, so excited about what we can do. And some of the most hardworking people I had a pleasure to work with and such a high caliber of people too. So talent was an extremely positive surprise for us of like how the team kind of got constructed. And especially now as we continue... hiring people, whether it's people across broader Europe, central eastern Europe, like just that calibre is super high.

Second thing, which, which, um, which I think is true, where there's this wider... feeling where Europe is behind. And likely, in many ways, it's true, like AI innovation is being led in the US, countries in Asia are closely following, Europe is behind. But the energy for the people is to really change that. And I think it's shifted from over last years, where it was like a little bit more cautious over when we started the company. Now, like we feel the keenness and we want to be at the forefront of that.

And I think getting that energy from people and that drive was a lot easier. So that's probably an advantage where we can just move quicker. The companies are actually keen to adopt increasingly, which is helping. And as a company in Europe, really as a global company, but with a lot of people in Europe, it helps us deploy with those companies too. And maybe there's another flavor and last flavor of that, which is Europe specific, but also Global specific. So when we started company, we didn't really think about any specific region, like we are, you know, uh, a Polish company or British company or US company.

But one thing was true where we where we want it to be a global solution. Yeah. And not only from deployment perspective, but also from the core of what we are trying to achieve, where it's like, how do you bring audio? and make it accessible in all those different languages. So it kind of was that through the spine of the company from the start, from the core of the company. And that definitely helped us where now when we have a lot of people in all of different regions, they speak the language, they can work with the clients.

And that I think likely helped that we were at Europe at the time because we were able to to bring up people and optimize for that local experience. On the other side, what was definitely harder is in the US, there's this incredible community of You have people with the drive, but you also have the people that have been through this journey a few times. Yeah. you can learn from those people so much easier. And there's just so many of people that created companies, exited companies, led a function at a different scale than most of the companies in Europe.

Um, So it's kind of almost granted that you can learn from those people just by being around them and being able to ask the questions. That was much harder, I think, especially in the early days, to just be able to ask those questions. not even ask the questions, but know what questions to ask. Yeah. Uh, of course we, we've been lucky to, to partner with, with, with incredible investors for the years to help us through, through, through those questions. Uh, but, um, but that was, that was harder, I think in, in, in, in Europe.

Um, And then the second is probably the flip side of, you know, well, I'm positive there is the enthusiasm now in Europe. I think it was lacking over the last years. I think the US was lacking. not contribute to us accelerating, which which people are trying to figure out. There's the enthusiasm, but I think it's slowing it down. Yeah. But the first one is definitely the bigger disadvantage. Yeah. Hmm. Should we do a quickfire round? Let's do it. Okay. What is your favorite AI application that you personally use? And it can't be 11 labs.

or 11 reader. Thank you. it really changes over time but perplexity is was I think and is one of my One of my favorites. Really? What is it? What is and for you, what does perplexity give you that ChatGPT or Google doesn't give you? Yeah, the chat GPT is also amazing. Chat GPT is also amazing. I think for a long time it was being able to go deeper and understand the sources. That was I hesitated a little bit of the was is where... where I think ChatGPT now has a lot more of that component.

So I tend to use both in many of those cases. For a long time, a non-AI application, but I think they are trying to build AI application. Like my favorite app, it would be Google Maps. I think it's incredible. It's such a powerful application. Let me put my screen. What other applications do I have? Well, sorry, while you're doing that, I will go to Google Maps and just browse. I'll just go to Google Maps and explore some location that I've never been to before. It's great. as a niche application, um, I like FYI, this is a

am startup. Oh, okay. Which is like a combination of... Well, it started as the communication app, but now it's more for the radio app. Okay. Like, when curiosity is there, Cloud is great too. I use Cloud for very different things than chat GPT, like any deeper coding elements prototyping, I always use Cloud, and I love it. Actually, no, I do have a more real recent answer, which is lovable. Okay. Lovable was... Do you use it at all for 11 Labs or do you just use it personally to... No, that's true.

I think, like, you know, the... My life is 11 labs. Like, it's... So, the truth is for all those applications. Yes. It's like all of these I use partly for... Big time for 11 laps too. But yeah, Lovable I use for 11 laps. But like exploring new things too every so often, I will use Lovable, which ultimately is tied to 11 laps, but it's great for prototyping. Very cool. and pulling up a quick demo for a client. It's great. Very cool. So a little not related, I guess. Yeah. All right.

Who? What was your favorite one? My favorite one? You know, it's funny. So yesterday we had a team meeting. And everybody checked with ChatGPT to see how many queries they'd submitted in the last 30 days. And I'd done like 300 in the last 30 days. I was like, oh, yeah, that's pretty good. Pretty good user. And Andrew similarly had done about 300 in the last 30 days. Some of the younger folks on our team was a thousand plus. And so not only I'm a big DAU of ChatGPT and I thought I was a power user, but apparently not compared to what some other people are doing.

I know it's a very generic answer, but it's... unbelievable how much you can do and one app at this point. Do you use Cloud as well? I use Cloud a little bit, but not nearly as much. The other app that I use every single day, which I'm very contrarian on, is Quip, which is Brett Taylor's company from years ago that got sold to Salesforce. And I'm pretty sure that I'm the only DAU at this point, but I'm just hoping Salesforce doesn't shut it down because my whole life is in Quip.

We use Palantir. I like Quip. Quip is good. It's really good. Yeah. No, they nailed the basics. Just nailed the basics. Great experience. All right. Who is... Who in the world of AI do you admire most? These are like hard, not rapid fire questions, but I think I really like Demis Kasapis. Tell me more. It's... you know, both his the I think he is always straight to the point. He, he, he can speak very deeply about the research, but he also has created for the years so many incredible work himself.

And I was, of course, leading a lot of the research work, but I kind of liked that combination that he has been doing the research and now leading it. And whether this was, the alpha fold which which i think is like truly a new new it's like everybody i think agrees here but like a true frontier for for the world and and like kind of taking what you know while most people focus on on part of the ai work he is kind of trying to bring it to to to um to biology i mean dari amade is of course trying to do that too so it's it's going to be incredible like what this evolves to uh but then that he was like you know creating games in the early days was incredible trying to find a way for AI to win across all those games.

It's like the versatility of how he... He both can lead the deployment of research can is probably one of the best researchers himself yeah stays extremely humble um and just like honest intellectually honest i feel like you know you were speaking with with them as he he or sir demis you you would uh You would get an honest answer. And yeah, he's amazing. Very cool. All right, last one. Hot take on the future of AI. Some belief that you feel... medium too strongly about that you feel is under hyped or maybe contrarian I feel like it's an answer that you would expect maybe to some extent.

But I do think the whole cross-lingual process aspect that's still like totally under hyped. Like if you will be able to go any place and speak that language and people can like truly speak with yourself, um, and whether this will be initially like the delivery of content and then future delivery of communication. I think this will like change the world of how we see it. Like, I think one of the biggest barriers is in those conversations that you cannot really understand the other person. Of course, it has a textual content component to it.

like be able to translate it well, but then also the voice delivery and, um, I feel like this is completely under hyped. It's like, no, no, no, but you think the device that enables that exists yet? - No, I don't think so. - Okay. It won't be the phone, won't be glasses, might be some other form factor. - I think it will have many forms. I think people will have glasses. I think headphones will be one of the first, which will be the easiest. I mean, glasses for sure will be there too.

But I don't think everybody will wear the glasses. And then, you know, like, is there some version of a non-invasive Neuralink that's... that people can have while they travel. That'll be an interesting attachment to the body that actually works. Do you think it's underhyped or do you think it's hyped enough? this use case. Um, I would probably bundle that into the overall idea of sort of ambient computing. where you are able to focus on human beings, technology fades into the background, it's passively absorbing what's happening around you, using that context to help make you smarter, help you do things, you know, help translate, whatever the case might be.

Yeah, I think that that absolutely fits into my mental model of where the world is headed. But I do wonder what what will the form factor be that enables that? I think it's pretty interesting. what are the enabling technologies that allow for the business logic and that sort of thing to work starting to come into focus what's the form factor is still Agree. To be determined. But I absolutely agree with that. Yeah, maybe that's the reason it's not hyped enough that you don't. Yeah, people can't picture it. Yeah. Awesome. Marty, thanks so much.

Pat, thank you so much for having me. That was a great conversation. It's been a pleasure. you

Want to learn more?

Ask about this video