Zapier’s Mike Knoop launches ARC Prize to Jumpstart New Ideas for AGI

As impressive as LLMs are, the growing consensus is that language, scale and compute won’t get us to AGI. Although many AI benchmarks have quickly achieved human-level performance, there is one eval that has barely budged since it was created in 2019. Google researcher François Chollet wrote a paper that year defining intelligence as skill-acquisition efficiency—the ability to learn new skills as humans do, from a small number of examples. To make it testable he proposed a new benchmark, the Abstraction and Reasoning Corpus (ARC), designed to be easy for humans, but hard for AI. Notably, it doesn’t rely on language. Zapier co-founder Mike Knoop read Chollet’s paper as the LLM wave was rising. He worked quickly to integrate generative AI into Zapier’s product, but kept coming back to the lack of progress on the ARC benchmark. In June, Knoop and Chollet launched the ARC Prize, a public competition offering more than $1M to beat and open-source a solution to the ARC-AGI eval. In this episode Mike talks about the new ideas required to solve ARC, shares updates from the first two weeks of the competition, and shares why he’s excited for AGI systems that can innovate alongside humans. Hosted by: Sonya Huang and Pat Grady, Sequoia Capital Mentioned: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models : The 2019 paper that first caught Mike’s attention about the capabilities of LLMs On the Measure of Intelligence : 2019 paper by Google researcher François Chollet that introduced the ARC benchmark, which remains unbeaten

Published: Published Jul 2, 2024
Uploaded: Uploaded Jun 11, 2026
File type: POD
Queried: 00

Full transcript

Showing the full transcript for this episode.

AI-generated transcript with timestamped sections.

0:00-1:32

[00:00] Right now, I think the [00:02] what I see happening is there's sort of this mythical story of a very bad outcome. Once we get to like super intelligence, right? [00:08] It's a very theoretical-driven story. It's not grounded in sort of empirical evidence. It's basically based on sort of reasoning our way to like this outcome. And I think the only way that we can really, really effectively and truly set good policy is by you have to look at what the systems can and can't do and then regulate or decide and make decisions at that point about what it can or can't do. I think anything else is sort of like, you know, you're cutting off potential really, really good futures way too early. [00:38] Bye. [00:46] - Hi, and welcome to Training Data. We have with us today, Mike Newp, co-founder of Zapier. [00:51] Mike has recently stepped up to co-found and sponsor the ARK Prize, which is one of the most unique benchmarks in AI that measures a machine's ability to truly learn things intelligently. [01:01] versus just to parrot patterns in the training data. [01:04] We're excited to ask Mike for an update on how things are going with the ARC Prize two weeks in. [01:08] and to hear his views on why we need radically different approaches and benchmarks to achieve true general intelligence. [01:15] Mike, thanks for being here today. So... [01:20] We're excited to talk about the ARC AGI initiative. [01:23] Before we get into that. [01:25] I'd love to spend a few minutes on your background at Zapier because I think Zapier has emerged as...

1:32-3:05

[01:32] Probably one of the best examples of what... [01:34] an existing [01:36] application company can do with the power of AI. [01:39] And the way that you guys were sort of early to that, the way that it's now kind of interwoven in the product has been really interesting to watch. So... [01:47] Maybe can you just say a few words on what is Appier and what has your approach to AI been as Appier? [01:51] Yeah, Zapier is a workflow automation platform. We support 6,000 different integrations, things from Salesforce to Gmail to [01:59] you know, basically SaaS software that you can imagine we connect with. [02:03] And I think the unique thing about Zapier is it's intended to be very easy to use for non-technical users. So the majority of Zapier customers... [02:09] users would not self-identify with being a programmer or being technical, being an engineer, something like that. [02:15] Even though I also, I kind of think of myself as an engineer, and I find lots of interesting use cases for Zapier, but the large majority of our users... [02:22] I don't and um [02:23] I think that's what is quite special about Zapier and why people tend to fall in love with it is the feeling of power and leverage that you get. [02:31] as a non-technical user, being able to [02:34] you know, have software do work for you basically. And, [02:36] I think that's [02:37] in an interesting way, the exact same promise of AI. [02:40] Right? Like that's what people want of AI. They want software that's just going to do more work for them. [02:44] And so in many ways, I think the mission of Zapier and sort of the mission and purpose of AI intersects. [02:50] And I've been sort of like, [02:52] I guess call me AI curious since all the way back to college. [02:55] and I give like a whole [02:57] all hands at Zapier, I can't remember what year, but like when the GPT-3 paper came out and showed that to the company. And so I've been sort of tracking and following along the progress

3:05-4:37

[03:05] But it really wasn't until I think January of 2022 when I think this is the chain of thought paper when that one came out. [03:12] that. [03:13] I saw that and really surprised me because I thought I'd priced in everything that sort of AI language models could sort of do up to that point. [03:19] And this idea of like, you know, let's think step by step, this chain of thought technique, breaking down these language models as just tools for reasoning instead of, you know, just a one shot sort of completion or chat engine. [03:30] Um... [03:30] felt felt very special and something that I think most people didn't expect that they could do, even though the technology had been out there for over a year at that point. [03:38] And so that actually, that kind of moment, [03:40] caused me to give up my exec team role. I was running all of our product engineering at the time, and I went back to basically be an individual contributor at the company as an AI researcher alongside my co-founder, Brian. [03:50] And I'm happy to talk more about sort of that journey, but yeah, I think that's what caused Zapier to... [03:54] be relatively early in terms of AI. [03:57] What are some of the things that you've put into the product that you're most proud of at Zapier? [04:01] in terms of AI features. [04:05] You know, at this point, I think... [04:08] Zapier is pretty... [04:09] I'd say there's like probably two main places we've gotten a lot of value from AI. [04:14] the first is that over half the company at an individual level uses AI on a daily basis. And [04:19] I know this because we're actually measuring Zapier's own platform usage of our own company. So this is like we have over half the company is like building Zapps, building automations that use an AI step like a ChatGPT step in the middle. [04:29] either to do so content generation or data extraction over unstructured text. [04:34] All sorts of really interesting use cases that we can sort of talk about. In fact, one of the

4:37-6:08

[04:37] top internal use cases is probably getting us about, um, [04:41] I think like 100x labor enhancement rate. Right. Which has been phenomenal. What is that? [04:47] Yeah, you want to talk about it? Yeah, yeah. I mean, 100x improvement. Yeah, I got to talk about that. I think it's our personal high watermark for what we've been able to achieve using AI internally for an operational perspective. [04:59] And? [05:00] So Zapier has these things on our website called Zap templates. They're effectively recipes that help [05:06] users figure out what can Zapier do and help them get started. [05:09] And these templates are [05:11] in order to make them, they've all been historically handmade, because they require a bit of right brain and a bit of left brain. [05:17] They're very... [05:18] you know, creative, you have to inspire the end user, the customer, the would-be user of what, what could Zappier do for you? What's the outcome? What's the ROI you might get? [05:26] And then there's also a very technical way they have to be crafted and built as well. You have to map JSON fields from one integration app to another integration app to make sure it actually works. And together, that's what actually helps users get started and activate some of the product. [05:39] and [05:39] We had a whole [05:41] of maybe a million of these things that we knew we wanted to build that we hadn't built yet because they're just, they take so much effort. The sort of rate of production for our contractors was about 10 a day. [05:50] up until last summer. [05:52] And we had a person, a member of our, I think it was our partner marketing team actually, [05:57] A background, by the way, in freelance writing. [05:59] who built a zap, a system of actually several zaps using OpenAI, some middle steps here. [06:05] and built an end in the system that whenever a new integration got launched,

6:08-7:41

[06:08] on Zapier, it would automatically try to like figure out what are the most interesting zap templates that could be built. [06:13] write the inspirational use case behind it because we've told them we have millions of these things already today with lots of training data and then also do the exacting field mapping as well and [06:22] We moved. [06:23] the human... [06:24] in that workflow from the do loop to the review loop. So instead of having those contractors now [06:29] actually generating them thinking really hard about what these cases should be in building the certain field mappings. [06:34] now [06:35] They're [06:36] basically reviewing output from the system in a spreadsheet and saying, yes, this one is good, this one's bad, this one's good, this one's bad. [06:42] And the funny thing is, because the cost of production of generating these things is so low, we don't even try to fix the bad ones. We just throw them away and say, well, just generate another stack and throw them on top. [06:51] And so that rate of production is about a thousand a day now. So we've gone from, you know, 10 a day to a thousand a day per sort of contractor. [06:57] We've been chipping away steadily at that hole of a million and keeping up with sort of the launch of like new integration zones out here. [07:04] I think one of the... [07:05] mean things that that showed me was... [07:08] um [07:09] you know, probably the space you want to look for in, you know, businesses. If you're thinking about how to deploy, I is like really up, like, [07:14] Top of funnel or bottom of funnel tend to be like, you want to get something that's really close to like an important conversion rate for your business. [07:19] And then, you know, I think if you can identify [07:22] any manual work that your organization does that has iVolume [07:26] Um [07:27] where human is doing the work, I think those are sort of opportunities to like introspect and say, okay, is there an opportunity to get that human out of a do loop and, [07:35] and sort of craft a system that can do the work, but then put the human in the review loop, which is still quite needed at the sort of maturity level of the technology today. But,

7:41-9:13

[07:41] um still phenomenal from an ri perspective [07:44] Are there any metrics you can share on the impact Zapier AI has had on the overall Zapier business? [07:50] the [07:51] The biggest one today is we're just about to hit 10 million AI tasks per month. [07:57] I think we're at a run rate for about 10 million AI tasks per month. [08:00] And, [08:01] I think what's... I think... [08:03] I would love to be shown wrong or if corrected, if you know examples. [08:08] But I think at this point, Zapier might be the biggest... [08:11] automated AI platform in the world. In the sense, you know, there's a lot of researchers [08:15] entrepreneurs, builders who are trying to build these like agentic AI systems where the AI is sort of working without sort of human, you know, in the loop. [08:22] and [08:23] Um, yeah, I think at this point with 10 million tasks a month, we zapro may be the like biggest example of that in the world right now. [08:31] Really cool. [08:32] Can we talk about ArcGIA? Let's do it. [08:34] Um, [08:36] Maybe start with a recap of what is ArcGGI? Why did you and Francois set out to establish this prize? [08:43] This was a follow on to my like AI curiosity. [08:48] You know, the reason I gave up my exec team role back in 2022 [08:51] um was i kind of wanted to know for myself like are we in path for ag or not i felt like very important to know for zapper's mission [08:56] But also just like as a human, I was very curious and wanted to know, like, is this going to happen? Like, you know, there's definitely some interesting scaling that's happening. Is that... [09:03] sufficient to um like get to i think you know what naively i had in my head is like this you know super intelligent cgi [09:10] E-and. [09:11] You know, surprisingly, what I learned is the answer is no.

9:13-10:44

[09:13] Um. [09:14] I... [09:17] I actually got to first know Francois Cholet, who is my co-founder on ARC Prize. [09:21] I first heard about them [09:23] and got exposed to his research back during COVID actually. It was during 2020. He did, I think, another podcast where... [09:28] He was explaining his paper, 2019 paper, "Unmeasure of Intelligence," where [09:33] he tried to formalize a definition of what H.I. [09:36] actually is. Yeah, I thought it was interesting at the time, but I kind of parked it with lots of other stuff going on. It's happier and doing our own AI product building. [09:43] But as Appier got more into sort of building with AI, I built my intuition of what language models could do, what the apparent limits were, [09:50] I started getting more into AI evals and trying to like understand where it's where the limits were. What could we expect from a product on a product building perspective? Like where are our products going to like tap out? [10:00] where should we invest our engineering and research effort versus just wait for the technology to keep scaling and mature? [10:05] And, um... [10:06] You know, the thing I found was that most AI evals were saturating up to human level performance, and it was accelerating. [10:11] And when I went back to look at the ARC eval back from 2019... [10:16] expected to see a similar trend. [10:18] And instead, what I found was basically the opposite. Not only had it not reached human performance yet, but it was actually decelerating. [10:24] over time. [10:25] And this was like really supremely sort of surprising to me. [10:30] And maybe it's worth defining, you know, [10:32] we use these terms of AGI, but like what's the actual correct definition for it, right? I think there's a [10:37] kind of popular definition in the world today. Actually, there are probably two schools of thought. I think one school of thought that I see is, AGI is undefinable and we shouldn't even try.

10:44-12:16

[10:44] This is a quite popular perspective. And I think the other school of thought is, um, [10:49] Yo, uh... [10:50] that AGI is the system that can do the majority of economically useful work humans do. This was popularized by OpenAI and the Microsoft deal. This is actually like in their deal together. [11:00] like once this is achieved, like OpenAI retains all the future IP, is very interesting. I think the Node Coastal actually might get credit for coining that definition. [11:07] But nonetheless, I think because of OpenA's success, [11:10] that definition has sort of become accepted by a lot of people and as a target and goal we should shoot for. [11:16] The challenge is, I think it's a fine goal, by the way, and I think current model architecture may be within spitting distance of it. [11:22] um i think it says way more probably about like what the majority of humans do for work if it's a true goal than what agi actually is though [11:29] And Francois defines AGI as the efficiency of acquiring new skill. [11:36] That's it. [11:36] And [11:37] Here's like a quick... [11:38] Thought experiment, I think you can use to chart like kind of grok this. [11:42] is we've had AI systems now for many years, five plus years, that can beat humans at games like Go. [11:48] chess, poker, diplomacy even. [11:51] And [11:52] um [11:53] The fact remains that [11:55] you cannot take any one of the systems that was built to beat one of those games [11:58] and simply retrain it with new data, new experience, [12:02] to beat humans at another game. Instead, what researchers and builders and engineers have to do is they have to like [12:06] go back to zero. They have to tear it all down, [12:09] rethink of new algorithms, new architectures, new ideas, of course, new training data as well, [12:14] often new amounts of scale

12:16-13:46

[12:16] in order to beat that next game. [12:19] And yet this is in complete contrast to how you two both learn, right? I could like sit you both down here, teach you a new card game in probably about an hour. I could probably show you a new board game and get you up to proficiency within a couple hours. [12:31] and [12:31] That fact, [12:33] is what makes, I think it's highly representative of what makes you generally intelligent. It's your ability to very quickly and efficiently [12:40] sort of gain skill in order to accomplish some open-ended or novel task that you've like never encountered before in your life. [12:46] And that's what's special about ARC. So ARC AGI is an eval that tries to take that definition and actually measure it. [12:52] and it was designed specifically to resist the ability to memorize [12:56] the benchmark, which is very different from most other AI evals that are out there. [13:00] Every task is completely novel and there's a private test set that no one's seen outside of a handful of people that have like taken it to verify that, you know, all of the puzzles are solvable. [13:08] And that degree of novelty and that degree of not having ever been seen before is what makes ARK a really, really strong benchmark for trying to distinguish between, you know, this more narrow AI that can be beaten largely through memorization techniques and EGI, which is, you know, a system that can sort of very, very rapidly and efficiently acquire the skill at test time. [13:25] What is the definition of efficiency? [13:29] I imagine there's a compute component, a data component. What's the definition of efficiently and efficiently acquire new skill? Yeah, Francois's a... [13:36] I'll probably do a bad job trying to like summarize his research if you want to read more by the way his on measure of intelligence paper is like the source of truth for all of this stuff I think it's really really good [13:45] Um, and

13:47-15:19

[13:47] I think one of their... [13:48] Before I get to the answer, I think one other important thing to sort of see is that ARC has been unbeaten since 2019. [13:53] and I think it's endurance to date [13:55] is probably the strongest set of empirical evidence that the underlying concepts of the definition are correct, which is why I think it's worth paying attention to and why it's such a specialty value. [14:03] and special set of research. [14:05] So I think Francois would sort of describe efficiency as the ability for a system to sort of translate from core knowledge priors. [14:12] to being able to attack... [14:15] sort of like the sort of search space or task space around it. And so a very weak generalizable system is only going to be able to take on sort of very near-term adjacent tasks to sort of the core data, the core knowledge that it was, that system was trained on. Whereas like a highly generalizable system is going to be able to have a much larger field of sort of tasks and novelty that it's able to attack and be able to [14:38] be able to effectively do with a small set of training data. [14:42] And that's what we hope to see with the eventual solution for RKGI as well, is that [14:47] If someone's able to beat it, the goal is to get to 85% on the eval. [14:50] Today's state of the art is, I think, 39% as we record this. [14:54] And I think what's special is if [14:55] If someone can actually beat ARK at the 85%, [14:58] Um... [14:59] that would mean that you've created a computer program that can, um, [15:03] be trained on this very small set of core knowledge priors things like goal directedness objectness symmetry rotation [15:09] These are sort of things that even that emerged very early in childhood development. [15:13] and be able to use those core knowledge priors and recombine them and synthesize them into new programs in order to solve

15:19-16:51

[15:19] Tasks with exacting accuracy. [15:21] that um that system has never seen before never been exposed to and it's trained in and that would be a really really important thing to [15:27] particularly the application layer where like the number one problem today is like hallucination, accuracy and consistency. And that results in this low user trust, which limits deployment of like real AI right now. [15:37] You have some peculiar rules for competing for the prize. I think there's a limit on how much computer you can use. [15:42] Can't use the internet. [15:44] I don't know if you can use GPT-4 and closed models. [15:47] Why set those? Why put those limits in place? [15:51] Yeah, so the two big ones are, you're right, so on the competition shows on Kaggle, [15:57] and Kaggle enforces no internet. [15:58] and you have limited compute. [16:00] So specifically you get 1P100 for 12 hours. And no internet means you can't use frontier close models. They're available through APIs like Claude, Sonnet, or Gemini, or GPT-40, or 4.0. [16:13] Um... [16:14] Maybe they'll take them in order. I think the compute one is maybe more interesting. The reason for the compute limit is to target efficiency, first and foremost. [16:22] Because if there wasn't any compute limit at all, [16:24] then [16:25] You could simply define AGI as saying it's just a system that can acquire skill. [16:30] with no degree of efficiency attached to it. And if that was true, [16:33] That would mean that this system could brute force basically every possible program, think through every possible future. [16:38] outcome here, generate every possible single archetype puzzle, [16:41] and use that in order to sort of win the challenge. And we know that's not actually what happens in human general intelligence. [16:46] you can read more in sort of Francois paper about why, but the way that I think about it is,

16:51-18:25

[16:51] you know you can think about it [16:52] You can introspect even yourself while you're taking the art puzzles. [16:55] and see that when you're trying to solve one of these [16:58] that you're not. [16:59] brute forcing every possible transformation from the pattern, trying to recognize the pattern and apply it to the test. [17:04] instead you're using your intuition you're using your prior experience [17:07] to try and identify maybe three, four [17:10] five possible possibilities of like what the pattern is and then you check them right in your head. [17:15] and I think this shows that like humans are [17:17] the sort of efficiency humans have is not brute forcing every possible, you know, solution and checking. It's actually, there's the degree of efficiency. So, [17:23] the compute limit, [17:25] Um. [17:26] It sort of forces researchers to reckon with that definition. [17:30] Now, I think it is worth important acknowledging like we don't know exactly how much compute is necessary to beat our kit and we're going to like keep upping the compute bar over time is what I expect. [17:39] for example, [17:39] we already upped it over to exit from prior versions of the competition. So I think a priority is you got somewhere between like two and five hours to run on the GPU. And we bumped that up to 12. [17:50] Interestingly, all of the state of the art techniques are now actually maxing out that 12 hour runtime as well. [17:54] So I do expect we'll continue to increase it over time, but I think it is important. [17:58] tool in order to like force the generality out of the sort of full solution that we're looking for [18:03] And the new internet is a little more of a practical reason. You know, we're trying to reduce cheating, reduce contamination, reduce overfitting, not be able to leak the private test set. [18:11] And [18:13] Largely just increased confidence that when we reach the 85% grand prize mark that someone has actually... [18:18] BNARC. [18:19] and be able to sort of say that with some sense of sort of authority and confidence. That's that's a true statement.

18:25-19:56

[18:25] one of Francois and my goals for ArcPrize is to establish [18:30] a public benchmark of progress towards, or maybe towards AGI or maybe the lack of progress towards AGI. [18:36] and have it be sort of a trusted public tool that you know policymakers at you know students [18:42] entrepreneurs venture capitalists [18:44] employs everyone can look at to get a sense of [18:47] how close or far are we away from this sort of important technology existing? [18:52] And then using that insight in order to help try to drive more AI researchers to work again on exploring new ideas, which is something that's. [18:58] unfortunately, [18:59] kind of fallen out of favor in the last several years as LMs have taken off. [19:03] What have you seen, or maybe what do you expect to be true? [19:07] about the efforts that are successful... [19:09] or more successful toward ArcGGI [19:13] that makes them different from what we're seeing out of the frontier models and the big research labs. [19:17] Yeah. [19:18] So it gets into like the details of like, how does an LM work? Because that's kind of the bed most frontier AI researchers lives have been taking the last several years is we're going to scale up language models. And that's going to more scale, more data is going to get us AGI. [19:29] Even though that's the dominant story, I actually don't think it's what most of the... [19:33] labs actually believe internally. Most of them are working on new ideas. [19:36] So I think there's like an interesting story there, but it is definitely in their interest to sort of promote a very strong narrative of like scale is all you need. Don't compete with us. Yeah. You know, we're just going to steamroll you. Nothing to see here. Yeah. Yeah. [19:46] I think there's true competitive dynamics that have emerged in the market that have [19:50] that are [19:51] Unfortunately, I think shaping... [19:54] a lot of attention, investment,

19:56-21:39

[19:56] Um... [19:57] effort away from exploring new ideas and if it is true [20:01] that new ideas are needed, which I believe it is. And I think our AGI and Archprize show that [20:06] like at least some new idea is needed [20:08] then due to the sort of competitive dynamics that emerged in the market over the last couple of years, we're kind of like headed in the wrong direction. [20:13] right there's like [20:15] um... [20:16] All the frontier research has basically gone close source. The GPT-4 paper had no technical details shared in it. The Geminine paper had no technical details shared on its longer context innovation. [20:25] things like that. [20:26] And yet this is like in... [20:28] direct contrast to the history of how we even got here today. [20:31] the sort of innovation set that led, the sort of chain of research that led from, you know, Iliad sequence to sequence paper at Google, out to Jacobs University, back to Google, then to Alec Radford and back to Iliad OpenAI, [20:43] Like there's like a six or seven year chain of research that [20:46] only happened because of open sharing, open progress, and open science. [20:50] And I think that's [20:52] a bit unfortunate that we don't we don't really have that right now um again somewhat just due to the market dynamics and commercial success of of language models kind of [21:00] forcing a lot of that close-front-share research up. [21:03] So, yeah, one of the goals of our prize is to help counterbalance a lot of those things. [21:07] You were asking about the difference between what it might look like. What you said resonates because it seems like a lot of the... [21:15] foundation model companies are going down very similar [21:18] somewhat clearly defined paths. [21:21] And I'm sure that internally there's all sorts of work being done to find the next breakthrough in architecture. But in terms of what's working today, they're all fairly similar paths. And I imagine that what works for the sake of ArcGGI is going to be a little bit of a different shape. And I'm wondering if you're starting to see what shape that may take.

21:39-23:14

[21:39] Got it. And have a sense for what may be different about this. [21:43] more general architecture than what we're seeing out of the foundation models. [21:46] Great. So I think, um, [21:49] you know the [21:50] A useful shortcut on how to think about language models is that they are... [21:56] effectively doing very high dimensional memorization, right? They're able to [21:59] train and memorize tons and tons of examples and apply them in slightly adjacent contexts. I don't want to under... [22:05] I don't want to dismiss language models too much because I think that they are very special, simply very magical and something that has lots of economic utility. Zapier is an existence proof of that fact alone, so... [22:14] um you know i don't want to like um throw it under the bus too much i think there's some like really good things that it has like unlocked um [22:19] as the technology goes. [22:22] but there are limits to it. And you know, there are the sort of limits are [22:26] Um. [22:27] you know, not being able to effectively leverage its training data to [22:31] compose it or combine it at test time to go attack and like accomplish novel tasks that had never seen before. [22:38] in its training data. And that's what Arc sort of shows, right? Is that this is like a skill that these language models don't sort of possess. [22:45] I think it's kind of maybe... [22:46] useful to look at the history of the high score so far and maybe where we expect it to go. So from when the eval was first introduced in 2019, 2020, there was a small capital competition that ran to kind of get a baseline when it was 20%. [23:00] And from 20% to 30%, [23:01] the techniques that worked were effectively [23:04] Um... [23:05] researchers crafted a like handcrafted domain specific language by looking at the puzzles that were in part of the public test set. There's two test sets. There's a public set and then the

23:14-24:46

[23:14] Say the R's measure on. [23:15] And they looked at the public test set and they tried to infer and write down programs in Python code or C# or whatever. [23:21] What are the like individual transformations that like you do in your head to go from the, you know, one puzzle to the next? [23:28] And so that they called this a DSL and then they wrote a brute force search. [23:31] to try and search through all possible like permutations and combinations of those sub programs or to like find the general pattern and then apply it at real time. [23:39] And that got to about 30%. [23:41] What's gotten from 30% up to close to 40% now is a slightly different technique. [23:46] This is Jack Cole and his approach is effectively using [23:49] a code-based open source language model [23:52] and doing test time fine tuning. [23:55] He has some pre-training down on the CodeGen model, and then at test time, [23:58] taking the puzzles they get, the novel puzzles that's never been seen before, and [24:03] permutating variations of it and then training this like code gen based model in order to write that program then and find a program that fits the pattern and then apply it at test time and that's gotten to 40%. [24:14] I suspect that we probably have... [24:18] I bet we have the ideas in the air already to get to like the 50% mark. [24:22] maybe a little beyond the 50% mark without a lot of, [24:26] new innovation. I bet just ensembling or combining [24:29] the sort of existing idea sets that have already worked toward dark probably gets you about halfway. [24:34] I think to get to the 85% mark or beyond, I think the ultimate solution probably looks more like the shape of, [24:43] at least to solve arc, something that looks like a

24:46-26:17

[24:46] a deep learning guided DSL generator, where you have some sort of, instead of hand coding and hand crafting the DSL, [24:53] like ahead of time by trying to infer from the public test set, what those like sub programs would be. You need someone to generate that DSL dynamically, right? By looking at the puzzles in real time, [25:02] and being able to learn from past puzzles [25:05] and apply that towards future puzzles. This is also another important thing humans do when they're going through the ArcSat [25:09] Sometimes the first or second puzzle are actually a little trickier because you're orienting yourself around what am I doing? What task am I doing? [25:14] what does the possible solution space look like? And then as you get further into the task set, they tend to get a little easier because they start, you know, some of the... [25:21] the sort of space of possible transformations is just finite. So you start kind of recognizing patterns there. [25:26] and then combining that with some sort of deep learning based program synthesis engine. [25:30] something that can [25:32] not brute force all possible programs of how to combine those DSLs together, but something that has some sort of [25:38] deep learning approach to shape which [25:41] program traces do you try to generate or test and try against the pattern and then it kind of goes back to this [25:47] human introspection of how how we take the puzzles which is we're not brute forcing all possible [25:52] programs in our head and said we're trying to identify just a handful of likely candidates and then testing those deterministically in our head and applying the one that works. [25:59] It's really interesting that cogeneration and program synthesis kind of underlies [26:03] all of the methods you just talked about, [26:05] and there's something something very special there like [26:08] Program synthesis is very general and allows you to actually [26:11] Get closer to that definition of generalized intelligence that you mentioned at the beginning. [26:15] It's very exacting.

26:17-27:49

[26:17] And I think this is... [26:18] one of the reasons why the solution to ArcGi is going to be [26:23] Useful. [26:24] um [26:26] very quickly. So there's, you were talking this before, there's a history of like toy AI benchmarks, [26:31] over the last 10-15 years that [26:33] you know, kind of looked like Ark. There were games, there were puzzles, [26:36] and really never amounted to much in terms of being beaten. They all got handily beaten as sort of scale emerged, and [26:41] they really didn't add to our understanding of how to build useful AI systems. [26:47] And so what's one of the common questions I've been fielding last couple weeks is like, what's different about ARK? What is that? You know, isn't that likely to just happen here again? [26:54] And... [26:55] I think the reason why we're likely to see [26:58] something much more useful, assuming we get a really good solution arc from the first grand prize win. [27:05] is that the number one problem at the application layer, and we see this with Zapier too, with our new AI bots that we launched a couple months ago, [27:13] Um. [27:14] been surprising to me in how that has gotten adopted actually by our users. Um, [27:18] you know, there's like how I kind of describe it, there's a lot of concentric rings of use cases that you can use AI automation for. [27:25] And what we're seeing is people are sort of restricting the use cases for the AI bots where they're sort of fully automated, totally hands off. [27:32] to the use cases where there's sort of a very low... [27:36] need for user trust or where the sort of [27:39] Let me say that a different way. [27:40] If, um, [27:42] If it goes wrong, it's not catastrophically bad. [27:44] So they deployed for use cases like personal productivity or team based like workflow automation.

27:49-29:19

[27:49] Things where, you know, if it's wrong or it's right only nine out of ten times or it takes me... [27:54] you know, maybe a couple days to like really work with the system to like do the prompt engineering to steer it towards getting maybe 95 99% reliability. [28:01] that's acceptable because the risk of being wrong is just quite low. In order to get [28:06] much higher up and expand the number of concentric rings to, you know, moderate risk to high-risk deployment scenarios where we want these systems working autonomously. [28:13] We're going to need... [28:14] that the main thing that is missing is user confidence and the exacting nature of what it can do and what it can't do. [28:20] This is what Zapier core classic is. [28:23] gives us, right, is like it's a deterministic engine to execute automation. So, you know, once you build it and set up, it's going to do the exact same thing every single time. [28:29] But that's also what makes it fragile and hard to use. [28:32] And on contrast, you know, these like AI core based LM systems that are totally autonomous have the opposite set of tradeoffs, right? They're much easier to use. [28:40] just steer them, guide them, and fix them entirely through natural language, but [28:43] because the accuracy is still inexact, [28:46] confidence is low. And I think that's what [28:47] ARC gets us a solution to ARC at 85% or 100%. [28:51] It means that you've written up again, you've written up a computer program that can generalize from like, [28:54] very simple Cornell of priors to solve with exacting accuracy, 100% reliability, these like sight unseen puzzles. And I think that, [29:02] that tool. [29:04] as a [29:04] that will be a new tool in like the programmers toolkit, basically in terms of building products and building systems that can achieve that same thing. [29:11] We're two weeks in, I think, to when you launched Arc AGI Prize. What have you learned so far? What types of people are working and competing?

29:20-30:50

[29:20] on this? Is it like the pedigree researchers at the big labs? Is it [29:24] scrappy hustler types like who's competing [29:26] How many teams are submitting solutions? Yeah, let's see. [29:30] The response, by the way, after launch was phenomenal. It was much bigger than we expected. I think we were trending on Twitter twice during the launch week. [29:37] the number one cargo competition in the world. [29:39] over a million social views, I think, over all the launch channels. So just very phenomenal. I'm really thankful for everyone who helps sort of [29:45] promote and help share ARK. Hopefully we actually like can get a solution here and in some, some short time. [29:51] um [29:53] I think they're like... [29:55] The most interesting thing about the folks that are working towards, probably a historic glance here and then what I've seen over the last two weeks. [30:01] So the historical answer is most of the people that have worked on ARK are outsiders to the field. [30:06] they like this is not actually the first year that there's ever been a contest but there was a past [30:10] competition called archathon it was a little small i was hosted on a um [30:13] this Lab42 AI lab in Switzerland. [30:16] Um, [30:17] and so last year there was actually 300 teams that worked on trying to beat Ark and again no one sort of beat it [30:24] and almost [30:25] to my knowledge, all of those teams were effectively individuals or affect, you know, outsiders in some way. They're not, you know, people like big AI labs. They're, [30:34] folks with backgrounds and [30:36] you know, engineering, mechanical engineering, or video game programming, or physics. [30:40] um books that just kind of like got curious and interested in the problem at hand and [30:44] And I actually think that's more likely than not where the breakthrough for ARK is going to come from. I think it's going to come from an outsider, somebody who...

30:51-32:22

[30:51] Like. [30:52] somebody just thinks a little bit differently or has a different set of life experience so they're able to like you know cross-pollinate a couple like really important ideas across fields [30:59] Um... [31:00] That's one of the reasons why I put as much money as we did at Archprize. I felt like the progress was idea rate limited, actually. [31:07] And one of the best ways to sort of increase the amount of ideas is to try and blow up awareness, which is what the launch kind of did. [31:13] Over the last two weeks the [31:15] Um... [31:16] I've kind of seen like two probably like camps with people, at least on Twitter emerge. I think there's one camp of people who are sort of the, uh, [31:22] you know, they're in it for the mission. They agree with the underlying concept. They think that, like, we do need some new ideas. They're excited to, like, try and figure out what those are. [31:30] And then there's like a second group of people that are sort of like, I'm going to prove you wrong. LMs are definitely enough. Scale is definitely what we need. [31:36] And I'm going to do my best to go like beat this benchmark just using like existing off the shelf technology. [31:40] and sort of prove you wrong. So I'm actually quite happy for both those camps to exist. One of those approaches is currently up in the leaderboards, right? So, yeah, we can break some sort of news here. So this week, this Thursday, we're launching, or I guess when this comes out, I'll have launched just a couple days in the past. [31:57] a brand new public task leaderboard. [32:00] So we talked about how ARC doesn't allow internet access and there's compute limits. [32:05] I know personally how unsatisfying that is to not be able to use Frontier models, though. I also want to know, how good can GPT-40, how good can Claude Sonnet, do against... [32:13] This Benchmark. [32:14] um and also because um you know no compute owner it kind of also is a bit of a barrier entry right you have to use open source models you have to do um just like

32:22-33:55

[32:22] It's quite a bit of interesting work you have to do before you can even start just testing and experimenting. [32:26] So we're launching a new public task leaderboard. It's going to be a secondary leaderboard. We're going to be committing about $150,000 for a reproducible fund. [32:35] towards this secondary leaderboard [32:37] I won't be officially part of the competition this year. [32:39] We want to maintain that aspect of assurance on cheating and contamination overfitting with the private test set. And that's also the test set that has sort of the most empirical evidence against it over the last four years. [32:52] The secondary leaderboard is going to allow folks to... [32:56] basically submit scores towards it against the 400 public task set. And we'll verify and reproduce the scores locally to sort of ensure good, like, fitting with the approach. [33:04] and we'll publish that. And you're right, I think the top score, or one of the top scores on that is this guy, Ryan Greenblatt. [33:11] He came out a couple days after the competition launched [33:14] with a pretty interesting novel approach actually towards beating it and he's using GPT-4-0 [33:19] and [33:20] but not just GBD4. I think the interesting thing is he created a [33:24] like an outer loop around 4-0, where he is... [33:27] using 4.0 to generate [33:30] programs or sample from GPT-4R, these like program, these reasoning traces, [33:35] to, um, [33:37] beat the task or identify the patterns, then testing these patterns against [33:40] against the demonstrations and then finding the one that works on applying it. [33:45] And that approach seems to be getting in the like, 40, like low 40s, maybe 40%, 41, somewhere in that range. [33:51] and [33:52] It's pretty interesting because I think it's

33:55-35:40

[33:55] You know, someone might look at that, I think, and it sort of... [33:58] At first blush, say, well, isn't that evidence that scale's all you need? [34:01] And I do think there is something interesting there, right? It's like just showing that, hey, the more training data these things have, [34:05] the more sort of, um, [34:07] you know, programs that they can spit out that might be kind of right. [34:10] Uh. [34:10] but also shows that I think [34:12] that new ideas are needed still. [34:14] Huh. [34:14] Like this outer loop is novel. Like that might actually be frontier... [34:19] LLM reasoning that Ryan published. [34:22] And we're going to make all the approach whenever we put, similar to ArchPrize, we're going to open source all the code for all the reproducible solutions so folks can take these and apply them and try to reproduce them and sort of using open source models for the closed private data set. [34:38] But yeah, I do think it's pretty interesting what [34:42] how much innovation you get when you, [34:44] how much innovation we've gotten over the last few weeks is just a result of putting even just the awareness against the public to [34:50] Are the folks at the big research labs, like, why are they not working towards this benchmark? Because it almost... [34:55] Like when you explain the benchmark, it seems so clear that [34:59] that obviously this is the thing you want to solve. You don't want to solve memorizing the textbook. [35:03] use case [35:04] Why do you think the folks at the big research labs aren't trying to solve this benchmark, or are they? [35:09] So I am aware of a handful of big I labs that have tried in the past several years ago. So like, you know, this was perhaps at smaller scale with weaker models and things like that. I would. [35:21] One of the things I would hope is that actually more do in the future, actually. I'd love to see if we could make ARK AGI an actual measure on some of the model cards that get reported against future models. I think that would be a really cool thing. We're willing to do it. So if anyone is listening to this and wants to reach out and make that happen, I'm more than happy to work with them and find some way to do that.

35:40-37:12

[35:40] Um [35:42] If I had to guess... [35:44] Well, let me tell you, [35:46] Let me not guess unless I say what I have more sort of confidence in. [35:52] I've been surveying, once I got exposed to Francois' work again and was sort of thinking deeply about RKGI, I started surveying a lot of my friends and researchers in SF in the Bay Area. [36:02] about had they heard of Francois and had they heard of ARK. [36:05] And Francois is pretty good name recognition because he's really, really big on Twitter, been big on Twitter for many years. [36:11] Probably 9 out of 10 people I talked to knew who Francois Cholet was. [36:15] maybe 1 in 10, 2 in 10, had heard about the RKGI eval. [36:19] And probably half those [36:21] We're confused because there's like five other AI evals called ARK. [36:25] And I had to like sort of do some, I had to like, you know, disambiguate with them. [36:30] So it had really low awareness. [36:32] This is one of the first things I asked Francois about when I met him for the first time in person this year. [36:37] Why do you think that is? Why do you think you have such high awareness, but ARK has such low awareness? And... [36:43] His answer effectively was, [36:46] that it's hard. [36:47] you know, the way that benchmarks [36:50] gain popularity and notoriety is we make progress towards them, right? Researchers are working against it. Somebody has an idea, they have a breakthrough, they publish that in a paper, that paper gets picked up and cited by others, that generates awareness and attention. Other researchers say, ooh, ooh, interesting, okay, something might be possible now on this really hard benchmark. [37:08] And so you get the snowball effect, right, of like attention and

37:13-38:47

[37:13] because ARC has sort of endured with very low rates of progress, in fact, decelerating progress, [37:18] over the last, you know, four years. [37:20] I think anyone kind of... [37:22] In a lab, looking at that would just say, well, maybe the time's not right for you. Maybe we don't have the idea set in the world. Maybe we don't have the scale we need yet in order to sort of beat this thing. [37:30] And it looks like a toy and it doesn't like, you know, I don't fully understand why I don't get the necessary importance or how it's qualitatively different. I haven't just spent that much time. I've got a million other benchmarks I could use. [37:40] I think that's somewhat of the dynamic that has existed in the past and [37:46] it is one of the again reasons why we launched dark prize, right? I think they're, [37:49] There are some, there's lots of like market, [37:52] tools you can use to shape markets and shape innovation. And I think prizes do have a narrow... [37:57] Um... [37:58] there's a narrow spot where prizes can be like outrageously effective. And it's where like the idea is small and it's like, and it's like idea rate limited. One person or a small team can make that breakthrough and, [38:08] it's very quickly and easily inspectable, reproducible, and you can build on top of it very rapidly. [38:14] And all those sort of boxes got checked. [38:18] Yeah, one of the reasons why I decided... [38:20] to go, Eric Press. And you've mentioned... [38:23] curiosity around AI or AGI dating back to college that was sparked a few years ago in the context of Zapier and [38:30] has kind of been nurtured ever since. [38:32] Um... [38:34] Beyond the curiosity, I'm curious why this is important. [38:37] Meaning if you could paint the picture of... [38:39] What life looks like. [38:41] for the world. [38:43] Post-AGI, where we've defined it as the ability to efficiently acquire new skills.

38:48-40:20

[38:48] Um... [38:49] What do you think that version of the future looks like? [38:51] Why is this an important thing to... [38:54] to solve. [38:55] The thing that I... [38:56] I feel like I have a unique insight into at this point, having spent a lot of time thinking about ARC and this AGI definition. [39:03] is. [39:04] I suspect the advent of AGI is going to look very differently than most people expect, especially of the group who are in the camp that AGI is undefinable because it's so mythical and [39:16] you know, scary or big or awesome that like we can't even hope to ever define it. It's just going to be this like magical special thing. [39:22] And. [39:24] You know, it turns out like, you know, or something that I believe quite deeply is the definitions are really important. [39:29] because definitions [39:31] allow us to create benchmarks. And benchmarks allow us as a society to measure progress [39:37] and set goals towards things that we care about. [39:41] and what to happen. [39:42] And, [39:43] this idea of efficiently acquiring skill, one of the, you know, we've talked about a handful of times today, but one of the direct, [39:48] near-term things that you get from that is you get systems that can do exacting accuracy generalization from a small set. [39:55] of core priors and apply it towards novel solutions. [39:58] That is, again, the number one problem that... [40:01] rate limits AI adoption for more real world use cases today. [40:05] And so that's what you're gonna see. You're gonna see basically like the application layer of AI [40:10] get like amazingly good at accuracy, consistency, low hallucination rates. [40:14] which is going to allow us to use it, [40:16] in a much more unfettered way, in a much more trusted way,

40:20-41:50

[40:20] because... [40:22] because of the underlying way in which it's built. [40:26] So I think that's like, the reason I think that's an important [40:30] I think that's the reason why I think that's important. [40:32] is, [40:33] you know, [40:34] I think there's a lot of [40:36] we don't know what that set of capabilities is going to build on top of into the future, right? There's lots of unknowns, I think, of how AGI evolves beyond... [40:45] the actual inception moment of a system that can efficiently acquire skill [40:49] But I think it's going to be... [40:51] a much more gradual and incremental rollout, where there's a lot of contact with reality as we build and engineer these systems, [40:58] which is going to give us as like a society a lot of time to update based on what those capabilities, what it can do, what it can't do, [41:05] and make decisions at that point about [41:08] how do we want to like deploy this technology? Where might we as a society say, "We don't want to deploy it for this set of use cases." [41:15] Yeah, I think that's one of the [41:16] reasons why, um, [41:18] I've been sort of such a proponent, I think, of open source AGI progress with ArcPrize. [41:24] is like, [41:24] Right now, I think the [41:26] what I see happening is there's sort of this mythical story of a very bad outcome. Once we get to like super intelligence, right? [41:32] It's a very theoretical driven story, not grounded in sort of empirical evidence. It's basically based on sort of reasoning our way to like this outcome. [41:39] And I think the only way that we can really, really effectively and truly set good policy is by you have to look at what the systems can and can't do and then regulate or decide and make decisions at that point about what it can or can't do. I think anything else is sort of like

41:50-43:22

[41:50] you're cutting off potential really, really good futures way too early. And that's sort of what's happening, I think, with a lot of this early AI regulation where [41:57] Um... [41:58] I'll try to be the [42:00] paint the good side of this picture it's like hey i care you know maybe the risk of this bad outcome is so high in the future that we should like pause here [42:06] I think the risk of that is you've trimmed off every possible good path of the future way too early. [42:13] And the reason it's way too early is because we still need new ideas. [42:16] We need new ideas from researchers. We need new ideas from students. We need new ideas from young people. We need new ideas from labs. [42:23] um otherwise we're actually there's a like a chance that we've like never actually reached the like [42:28] high degree of useful AGI that we actually want. [42:30] Um. [42:32] And so that's kind of my nuanced take, I think, on probably what the advent of AGI looks like. I think it's much less likely to be a moment in time and much more likely to be a... [42:41] stair step of technology that we build on. [42:44] on top of past technology and [42:46] That creates a lot of moments to sort of update beliefs based on... [42:49] what a can can do. [42:51] Do you have any predictions on when we'll cross 85% on our price? [42:56] You know, before the competition started, there was a... [43:01] the first data scientist we hired at Zabby, gave me this idea a long time ago, it stuck with me. He said, [43:06] Um, the longer it goes, the longer it goes. [43:09] And so it's this idea that like the longer something takes, the more you should update your prior about it's going to take longer. So coming into this year, my expectation was like, hey, at least three or four years, probably before we get to the grand prize mark based on sort of like past years.

43:23-44:57

[43:23] based on the past track record. [43:24] Having seen what we've seen over the last two or three weeks though, [43:27] I think it is quite likely we get to 50% during this competition period. [43:33] Very surprised. I'd be surprised in a good way if we actually get to the 85% grand prize in this competition period. [43:40] But I think it is... [43:41] Uh. [43:42] Not. [43:43] unlikely that we [43:44] crest the 50% mark before, um, [43:47] the uh the end of or middle of november which is when the contest period ends for for 2024. [43:51] And is there a good way now? Because people have been trying at this for five years now. And, you know, you're galvanizing interest around it. [43:57] And now, you know, a lot more researchers around the world are interested in AI and solving hard problems. [44:02] But is there a good why now in terms of enabling techniques, technologies, etc. that's different now than... [44:07] than five years ago when Francois first kind of defined the benchmark? [44:11] If it is true that [44:13] Um... [44:15] deep learning is an important part of the solution, right? A deep learning guided programs of this Ascension, or a DSL that is generated on the fly, [44:24] through deep learning technique, [44:26] If that's true, the world has a lot of experience now on building and engineering and scaling such systems over the last three or four years. And there's a lot more compute online. [44:35] which brings the cost down into a territory where some of these things may have just been out of practical cost. [44:40] before. [44:42] For example, actually Ryan Greenblatt's solution right now is [44:46] Maxing out our cost limits we're going to have against the public leaderboard costs $10,000 to generate. [44:51] the 8000 reasoning sample traces from gpd4o that he then deterministically checks

44:57-46:30

[44:57] And so that would have been a technique that would, [45:00] not have been possible, you know, three or four years ago in any way of this regard. So I think [45:04] If it is true that there's a minimum amount of scale that's necessary to beat ARK, I think, hey, we've gotten more of that in the last three or four years than we had when the first competition ran. [45:12] And then I think the other why now is just... [45:14] can see largely due to awareness. If it truly is, [45:18] Um, [45:20] Actually, let me answer the opposite. [45:22] Like, I think the risk that it isn't... The reason we launched Ark Prize... [45:25] is that it is actually not right now. [45:28] Right [45:29] It's like, actually, it's not going to happen is the problem. [45:32] It's not a why now. It's not like, "Oh, the ideas are in the world, we just need to get people to work on it." The risk is that it's actually not why now. [45:39] And not why now is I think a much more interesting story right around this like LM driven. [45:43] you know, focused attention on sort of LLM solutions only, the closed research due to the competitive dynamics, [45:50] all of these things have like shifted and shaped [45:52] attention away from the new ideas, [45:54] and towards scale, towards LMs, towards like, you know, um, [45:58] Yeah, application layer AI. [46:01] And I think that we... [46:03] I think we need some shaping, reshaping back towards the new idea set. So hopefully... [46:07] The why now is because ARK has now lots of attention. [46:11] To be seen. [46:12] Do you think LLMs will be part of the solution? [46:16] I'm curious to think of it seems like in the big research labs right now, [46:19] All of the frontier research is around let's merge kind of [46:23] with the insights that you get from the inference time computes and the QSTAR... [46:27] AlphaGo stuff. I'm curious what you think of that kind of direction of research.

46:31-48:01

[46:31] I... [46:32] There's some pretty interesting research I've come across with Transformers. [46:35] that the [46:37] Transformers are capable as an architecture of [46:40] representing [46:42] very deep deductive reasoning change with 100% accuracy. [46:45] I think this is interesting. And the challenge with them is actually we just don't have the learning algorithm. Backpropagation is an ineffective learning algorithm. [46:51] in order to teach a transform architecture, a set of network weights that can do deductive reasoning with 100% accuracy. [46:57] And so I think it's possible that the systems that we kind of, or at least the core concepts that underlie language models, [47:03] have sufficient [47:04] capability, [47:06] In order to do this type of reasoning, [47:08] And we have not yet discovered the, like... [47:11] algorithm that can train the model in the right way. We haven't quite discovered the right outer loop around the transformer. [47:16] that is going to do the program synthesis engine or the DSL generator [47:20] Um... [47:21] I feel more confident in saying that [47:24] like deep learning is almost certainly going to be a part of the grand prize in some way i bet it's it won't [47:29] I'm pretty confident it wouldn't be just like a, you know, a pure deterministic program is going to be how it gets solved. [47:34] Um... [47:35] I think Transformers are [47:37] effectively the technology that is the most [47:40] has the highest degree of awareness and research like literature there's a lot of hardware now that's going towards accelerating transformer based models i think actually just uh earlier there was like an asic that got announced recently that's trying to accelerate like the transform architecture so to the degree that actually like some degree of like scale or compute is necessary to be dark i think those are like things that i would say i'm bullish on sort of the transform architecture though [47:59] I would point out that, [48:01] the

48:01-49:34

[48:01] search base of alternative architectures quite rich. [48:04] We've had maybe like nine or ten now [48:07] Mainline architectures from transformers to LSTMs, CNNs, RNNs, XLSTMs, statespace models, [48:13] um, [48:14] This would suggest that the search base of those architectures is quite rich, actually, and they all have like slightly different varying properties. [48:20] um [48:21] so uh [48:23] like i i think it's certainly possible that someone comes up with some innovation there um [48:27] I'm less confident or like bullish that LLMs in their exact form are, [48:32] going to be part of the 85% solution though. I would think it probably like a subcomponent of the architecture instead of the [48:37] entire application system itself. [48:41] When somebody does... [48:42] ultimately hit the 85%. [48:45] thresholds. [48:47] What do you hope they do with the solution? Like, what would you like to see out of that person? [48:52] other than submit it to the Leaderboard. [48:54] Ha, ha, ha, ha. [48:56] So this is one thing we didn't talk about a ton, but one of our prizes goals is to accelerate open progress towards AGI. So we're going to be requiring that in order to claim the prize money, you do have to publicly share and reproduce. [49:08] publicly share reproducible code and put it into [49:11] the public domain. And this goes for both the public leaderboard and the official competition leaderboard as well. [49:16] Um. [49:17] You know, this is... [49:18] with the spirit of trying to re-accelerate [49:21] open progress again so that we have research in small ways out in public that other researchers can build on top of. [49:28] and hopefully stair stepper would actually... [49:30] actually building AGI and not getting stuck in sort of the plateau that we're in today.

49:35-51:09

[49:35] um so i think that's probably my first pick up i've actually seen a handful of people online that have said hey if you've got a solution in arcade gi [49:40] I'll give you this like, you know, million dollar offer. My company. Yeah, exactly. Which, you know, on one hand, I'm like, OK. [49:48] That's kind of interesting. [49:50] But on the other hand, I'm kind of like, [49:52] you know, I think that's great awareness and it shows actually like the importance of solving this. I think more people are starting to [50:01] become aware of the lack of progress lack of frontier progress i think arc is kind of becoming a lightning rod for um [50:07] folks that want an actual measure towards this, I think, growing sentiment in the field today. [50:13] Should we close out with some rapid fire questions? Yeah, let's do it. [50:16] Okay, who do you admire most, may I? [50:19] I mean, Francois Chalet is a bit of a cop on answering. I wouldn't have co-founded Ark Prize if I didn't admire and respect his work over the last four years. [50:26] I mean, I think the two... [50:28] people that I have learned the most from, like directly, and have inspired my own belief and work, um, Rich Sutton and Shillai, both of them published [50:35] papers in 2019 right rich sutton published the bitter lesson which i think is fairly well known now in the industry at this point [50:41] um [50:41] I think his idea set is quite right there with maybe the one asterisk that [50:45] Um... [50:46] uh [50:47] you know, the, uh, [50:48] The one aspect that has not yet had scale applied to it is architectures search itself. [50:53] We certainly have unbiased search and learning applied on the inference side and the training side, but... [50:57] Every architecture still has a very human, handcrafted story and journey to it, which is an important insight, I think. [51:03] about that and then [51:04] "Unmeasure of Intelligence" from Cholet in 2019. And I think both of these papers are

51:09-52:41

[51:09] or in like, or I guess maybe, um, Sutton's was more of a blog post, [51:13] But both of these pieces of writing [51:15] I think are very important because history has proven them right as time has gone on. [51:19] language models, transformer scale has sort of shown Sutton's ideas to be even more true than they were in 2019. And I think the endurance of ARC has shown Francois's definition of AI to be [51:29] um, [51:30] more and more truest time has gone on. [51:32] What is your most contrarian point of view in AI? [51:38] I feel like everything we were talking about today, most people don't agree with me on, so... [51:41] All right, we'll count it. [51:44] Um... [51:46] Scale is not all you need. [51:48] New ideas are needed. [51:51] What's your favorite AI app other than Zapier? [51:56] Let me look and see. I have a handful. I think I'm not going to surprise you with anything. So I've got [52:01] ChatTBT, Perplexity, and Claude. [52:04] And I'm a paying user of all three of those services. You know, [52:07] One interesting thing, actually, I'll add. I have gotten... [52:11] way more value out of [52:13] language model based tooling over the last, like, call it six months. [52:17] than [52:18] Um [52:19] I ever did in the first aspect when I was starting to start working on it at Zapier. [52:22] And [52:23] It's because one of the things they're really, really good at, the thing they're perhaps like best at is summarizing... [52:29] Tons of unstructured text. [52:31] and helping be like an educational tutorial for you. So it's significantly like ramped up my learning rate. [52:37] on actually building with AI, like learning these fundamental different architectures,

52:42-54:13

[52:42] you know starting to do like model training that's something Zapier hasn't done but I started to do myself over the last six months to get [52:48] a sense of that type of work. [52:51] and [52:52] Yeah, language models and AI tooling has definitely accelerated my learning process. [52:55] Awesome. [52:56] All right, last question. [52:58] Let's do something optimistic, something that we can all dream about. [53:01] What change in the world are you most excited to see over the next five or ten years as a result of AI? [53:09] I think the... [53:12] I've always wanted to live in the future. I think that's maybe something that has always driven me towards working on Frontier Tech. I've always... [53:19] you know, bought the latest gadget, always [53:20] tried the latest app. I think it's [53:22] led me to work on Zapier and AI and [53:27] It's one of the reasons I'm working on ADI right now is because I think it's like the biggest thing that you can, I can potentially have an influence on trying to pull forward, pull forward that future. [53:35] I personally get [53:39] I think one of the things that [53:41] feels very limited from AI right now is that [53:44] with the narrow form of AI that we have, if we never get to HGI, [53:48] what that will mean is that we will always be [53:51] rate-limited [53:53] on developing things by the human that's in the loop. [53:57] And that means... [53:59] We'll never have [54:00] AI systems that can invent and discover... [54:04] and [54:06] sort of innovate alongside humans. [54:08] and really help pull forward and push forward [54:11] the frontier in...

54:13-55:11

[54:13] I think a lot of really interesting ways, like, you know, understand more about the universe, invent, discover new pharmaceutical... [54:18] Thanks. [54:20] you know, and discover new physics. [54:23] discover how to build [54:25] AI. [54:25] You know, I think. [54:28] We're always going to be rate limited by the human. [54:30] today and i think if you just sort of care about living in the future and you want to pull forward the good aspects of the future [54:35] Some form of AGI is necessary to do that. [54:38] Awesome. [54:39] Thank you, Mike. [54:40] Thank you both for having me. [55:10] you

Want to learn more?

Ask about this episode