In this episode of Hidden Layers, Ron Green and Michael Wharton dive into the latest advancements in artificial intelligence, with news from DeepMind, OpenAI, Waymo, Apple, and more. They discuss OpenAI's rumored Strawberry initiative, which promises enhanced reasoning capabilities for ChatGPT, the groundbreaking potential of Alpha Proteo for computational biology, and new developments in autonomous driving with Waymo. The discussion also covers Apple’s AI-powered advancements in their new iOS release, and the implications of California's new AI regulation bill. Whether you're a tech enthusiast or AI researcher, this episode offers insights into the rapidly evolving world of AI and its applications.
Resources:
Can LLMs Generate Novel Research Ideas? https://www.arxiv.org/abs/2409.04109
Waymo Safety https://waymo.com/safety/impact/
Claude Artifacts available to all users: https://www.anthropic.com/news/artifacts
Ron Green
Welcome to Hidden Layers, where we explore the people and the technology behind artificial intelligence. I'm your host, Ron Green. We're back with a Decoded episode today where we cover important recent developments in AI.
We're gonna cover major news and research coming out of DeepMind, Apple, Waymo, Stanford, and more. As usual, to help me unpack everything, I'm joined by my colleague and one of the sharpest minds in the business, Michael Wharton.
All right, Michael, you ready to get going? Let's do it. Okay, so I'll kick us off. I wanna talk about the rumored strawberry initiative coming out of OpenAI. There's actually some breaking news just this morning that it's gonna be released within two weeks.
So strawberry is this new rumored method being developed by OpenAI to basically enable advanced reasoning with chat GPT. News about this broke almost a year ago. It was rumored initially under the codename Q-star.
The early sort of leaks have said that it's capable of better performance, and it's mostly a focus on inference-side improvements. And I'll get into what that means here in a second. So strawberry is the codename for this internal project that OpenAI is using to not only make existing models better but to train other downstream models.
And one of the rumors is that it's called Orion. It's supposed to be text-only for now. Some people are saying it's moderately better, that it's not as impressive as a lot of people were led to believe maybe nine months ago.
But these are all rumors, I think we have to wait and see. They're also saying it takes about 10 to 20 seconds when it's in this mode to generate output, so it's a little bit slower than expected. So one of the interesting things about this is the idea that instead of putting all our time and money into the training side of model development, let's put some effort into the inference side of development.
And so there's really a lot to unpack here that I wanna talk about. So first off, if you look at the scaling laws—and that is the predicted performance of these foundational models—and let's just stick with language models as an example here.
If you look at the scaling laws, as the parameter count and dataset sizes have grown, the performance has scaled pretty much as predicted across what's now 10 orders of magnitude. It's kind of amazing. But that means that the existing foundational models are incredibly expensive to train and still pretty expensive to run at inference.
And there's increasing focus on squeezing better performance out of the inference side—the sort of predictive side of the model. So one of the hallmarks, one of the defining features of these strawberry-like approaches, is that it looks like the model's talking to itself, right?
It's very analogous to sort of chain-of-thought reasoning, which we've used on many projects. The idea is that as a part of the prompt, you would ask the model not only to complete some objective, but to explain how it's selected and how it's choosing to solve the problem and justify it as it solves the problem.
Basically, chain of thought. There's a lot of evidence that this is a really strong component behind the strawberry initiative. And this shift towards inference optimization is really interesting to me for a couple of reasons.
One is that if you think about it from a budget perspective, it kind of makes sense. We're putting hundreds of millions of dollars into these foundational models, and then once they're trained, we're putting almost nothing into the inference part.
But what if you could take—one of my favorite examples of this—what if you could take a million dollars and put that into an inference cycle to solve something like the Riemann hypothesis, right?
It would absolutely be worth it. And so I think this balance between training and inference costs is going to be more and more of a focus as these models become ever more important to our daily lives.
A couple more points. It's rumored that the strawberry initiative can solve unseen math problems. That's new. Language models typically don't do well on those types of use cases.
Leaks also say it can solve the New York Times connection puzzle, which is pretty fascinating right there by itself. So originally, rumors were saying this was going to be later this year, but as of this morning, they're saying within two weeks.
So we're going to know pretty quickly exactly what's going on here. But I think it makes a ton of sense to focus more on the inference side as these models become more and more prevalent in our day-to-day.
Michael Wharton
That sounds great. Yeah. I'm ready to beat this thing up. There's so much hype involved with all these releases and, you know, OpenAI is continuously raising funding. So they need to build this excitement and momentum, and I get it.
There's still plenty of residual value, but man, it's hard not to just be a little bit skeptical until you get that firsthand experience and get to play with it.
Ron Green
I totally agree. And you know, what's interesting too is for a while, OpenAI had such a lead, and now everybody's starting to catch up. I mean, even if you look at Llama 3.1 from Meta, which has been open-sourced.
It's nipping at its heels. Claude is on it. So I think there's a lot of pressure on OpenAI to continue to innovate and maintain the lead they once had. Yeah. Yeah. All right, I want to throw it to you.
What do you got for us this week?
Michael Wharton
Believe it or not, there's actually a little bit of a tenuous thread between what you just mentioned with the inference budget topic and this next paper. I'll get there. But basically, I wanted to just go over this very new paper. I think both the topics I have are less than a week old, so this is all really fresh. But this one's a paper called Can LLMs Generate Novel Research Ideas? And the idea is that the holy grail is you'd have this research agent that you can either use as a companion or just hit a button and go and say, "Hey, go do some research and come back to me with the results."
But it was looked at in sort of a narrower focus where they took seven different topic areas. I’ve got them written down here. They are: bias, coding, safety, multilingual, factuality, math, and uncertainty—all different subcategories of natural language processing today.
What they would do is choose a topic area, and they would say, “Okay, in fact, they actually used Claude 3.5 Sonic because it's just such a workhorse." They said, "Go with this topic area, generate 4,000 different novel research questions, describe them in a really short way, and then give them to me. And then I'll do some stuff with it."
So that's the first step. Then they would take those ideas, filter them based on some deduplication—just a light filter, not a ton to do there. After that, they would use multi-step reasoning and multi-step LLM calls to expand on those ideas and turn them into actual conference submissions.
Ron Green
Hmm, okay.
Michael Wharton
So you have a template with a conference submission where you have a topic, a research question, background, and a proposed experiment plan. All that stuff is really well templatized. They did this blind study where they used a couple of different approaches to create all these AI-generated research questions, and then they used experts in the field to do the same thing. They ran it through a style normalization layer, and then they had some other NLP experts look at these and rate them based on things like novelty, excitement, feasibility—all the stuff that you normally would with a journal, right?
It turns out that on those categories of novelty and excitement, these AI-generated research questions actually scored better with these blind human reviews. Which is super exciting at face value. I mean, with all these things, you have to look at it with a little bit of skepticism. But before I get too deep into that, I’ll just go a little more into detail on how they did this, because I think it’s important to infer what the next steps might be.
What they did was, once they had all these topics, they would re-rank them so they could filter and take the best ones and present those. They also had a human re-ranking step for this third category. The way they would expound on these ideas was just a RAG approach. I’ve got an example research question they have here.
Ron Green
Okay, that’ll really help.
Michael Wharton
Yeah, I think it’ll just make it a little more solidified. One of these ideas that they generated was “semantic resonance uncertainty quantification.” Huge mouthful.
Ron Green
That is a mouthful.
Michael Wharton
But the idea is that, like you mentioned chain-of-thought reasoning a second ago, they would ask an LLM the same question multiple times, have it give you the chain-of-thought reasoning, and for the answers that were correct, check out how much those steps in reasoning resonate or agree with each other, and use that as a measure of uncertainty, which is a big challenge.
Ron Green
Oh, interesting. Okay, let me see if I understand that. So you take multiple outputs to sort of the same prompt, with the reasoning steps being compared to each other and looking at the similarity of those.
Michael Wharton
Yeah, that’s exactly right. It seems like a cool idea. And that was one of many different ideas they used to review and look at. But the reason I think you should couch those expectations a little bit is that the examples I looked at seemed to be that associative creativity where you have one idea here, idea A, idea B, sort of mash them together, and you end up with something pretty novel but not a lot of hops of reasoning.
Whereas you take something like Einstein’s general theory of relativity, and that all came from some insight he had when he was standing in an elevator thinking about, “I wonder what the difference is between this acceleration and acceleration due to gravity.” Those two things, I think, are in completely different leagues. But like you mentioned, the Riemann hypothesis, that’s a big research question that would require a lot more than this type of analysis, I think. I’m not gonna drone on about all that.
Ron Green
No, no. Okay, so it’s really interesting. I agree. I think there are two types of insights. One is like combining idea A with idea B and just seeing what happens if we merge these together. And then there are just these intellectual leaps, like you mentioned with Einstein, where you have completely new perspectives that are in no way this sort of matching up of ideas.
But what I do think is interesting—and I say this all the time—we are still at such an early stage in artificial intelligence that there are, I think, really low-hanging fruit opportunities to take simple ideas and combine them.
And to me, that's actually a really positive thing. It means that we’re in the early stages of exploration and research, and it means that the acceleration we’re seeing is unlikely to slow down soon.
I don’t think we’ll have exponential growth forever—nothing can grow exponentially forever—but right now we’re on that part of the exponential curve where it is legitimate to take simple ideas and combine them and see what the results are, because very often you see amazing results at this stage in the learning curve.
Michael Wharton
Yep, 100%. And actually, along those lines, this is the last thing I’ll say here. Since they’re using RAG to provide the context for expounding on these ideas, I think it’s limited. But there’s so much additional low-hanging fruit—there’s all this research in these infinite-context transformers.
You know, we as humans can use the summation of all of our experience to generate insights. Things you did when you were a kid are sometimes relevant when you’re doing research. You don’t think of it, but it happens.
Right now, the context is just so narrow that the output is narrow. But when we start to combine, these research fields—who knows what will happen?
Ron Green
Yeah, totally agree. It’s going to be awesome.
Ron Green
All right, next item I've got. This week, Apple held their fall event, as usual, and they announced the release date for iOS. I think it’s 18. The reason we’re talking about it here is that it’s going to include Apple Intelligence.
The timeline’s a little delayed; I’ll speak to that in just a second. No new capabilities, but I think some strong first steps forward. Things like much, much more sophisticated photo semantic search.
So, you can describe something like, “Find all the images of Julie wearing a red dress on the beach,” or something like that. And it can understand all of that context. Email summarization, rephrasing, I think a “gin emoji,” so you can create your own emojis from text and things like that.
What I’m most interested in is the fact that we’re finally getting past stumbling over words. And we’re going to have smartphones that are actually smart for the first time ever. It’s been almost two decades of smartphones, and I think they’re about to actually show some smarts.
What I mean by this is when you are interacting with most AI assistants, and especially Siri, if you even stumble a little bit over your instructions, it just completely falls over. And you have to bail out and cancel.
That’s not the case with the new AI assistant agent coming in October with the new iOS. So, I’m really, really excited about that. And there’s one more thing that I think I should mention. Apple announced that all of this— all of the personal information— will be stored in what they’re calling Apple Private Cloud.
And they’re also going to make that verifiable by independent institutions. So, independent institutions—I don’t know how they’re going to do this exactly—but they claim institutions will be able to verify that all of that personal information within that private cloud infrastructure is indeed private.
Michael Wharton
Do you know how they’re gonna monetize it?
Ron Green
I think they’re going to monetize it the way Apple always does: they, for the most part, give away the software for you to buy the hardware. So, it’s part of the whole ecosystem. You want access to it? Buy a Mac, buy a phone, something like that.
They’re not going to charge for the software.
Michael Wharton
Yep. Okay. Cool. I’m excited. I’m ready to play around with it.
Ron Green
I’m ready to go as well.
Michael Wharton
Do you want a big one or a small one?
Ron Green
Do a small one, do a small one.
Michael Wharton
All right, so I’m sure anyone who’s a developer that’s listening to this has heard about Cursor now. It’s a fork of VS Code and it’s really supposed to be good at generating diffs for code contribution, which I think is great.
I haven’t personally had a chance to play around with it. There are a couple of folks at the company who, on our lab day last Friday, got together and played with it. But I thought this tweet from Andrej Karpathy really sort of sums it all up.
And I just want to read it verbatim here. He said, “Programming is changing so fast. I’m trying VS Code Cursor plus Sonic 3.5 instead of GitHub Copilot again. I think it’s now a net win.” You know, and he goes on to describe his workflow a little bit here.
And we could probably share the tweet in the show notes, but he sort of summarizes it and says, “I still don’t think I got sufficiently used to all the features. It’s a bit like learning to code all over again, but I basically can’t imagine going back to unassisted coding at this point, which was the only possibility just about three years ago.”
We’ve found Copilot alone to be a huge boon to our development workflow.
Ron Green
Absolutely.
Michael Wharton
And an endorsement from someone like Andrej—even if there’s a big learning curve—well, there’s a big learning curve to learning how to code in the first place. That’s not scary. But the fact that he’s having some great luck with it, I think is just a good signal that all of us as developers should go back and really take a serious look at this and kick the tires on it. See how it looks.
Ron Green
I couldn’t agree more. I am always mystified when I talk to people who dismiss AI coding assistance. I’ve been developing professional software for almost three decades, and I immediately found it useful because even if all it did was provide a little bit more intelligent autocomplete, that is great.
Just save me the time typing. But it is really, really much more than that once you understand how to leverage it and once you learn how to interact with it in a way that optimizes its output. I think in five years from now, it will be a very, very rare case that any major software is developed or maintained without some type of AI coding assistance.
If you’d have told me that, Michael, four years ago, I wouldn’t have believed it.
Michael Wharton
No, me neither. It’s moving way fast.
Ron Green
It’s moving really, really fast. All right, so I’ve got a really exciting announcement that I want to cover. So I’m a big fan of the computational biology field, and there’s a new announcement from DeepMind that they’ve got a new modeling system called Alpha Proteo.
Alpha Proteo is a major advancement in the field of protein structure prediction and design. And it really builds on the work that DeepMind did with AlphaFold 1, 2, and 3. Let me set the table for this a bit to make sure that everybody understands why this is so important.
Proteins are incredibly important. They are involved in pretty much every biological process within the body, from cell reproduction to the immune system. The way proteins are created is through the conversion of amino acid sequences into proteins.
We don’t have to go into the biology too much. All you need to know is there are 20 amino acids, and the way those sequences of amino acids get converted into the protein is incredibly complex. As these proteins are being built up, they actually interact with themselves.
They’ll fold over, and the angles and the chemistry and the physics of it are extremely complex. AlphaFold 2 was released in 2020, and it actually won the CASP 14 challenge. CASP is this every-other-year challenge to predict protein folding that started in 1994.
When I say AlphaFold 2 won it, I mean it destroyed the competition. It’s very much reminiscent of AlexNet back in 2012. If I remember right, I’ve got it in my notes here, yeah, it had the best prediction for 88 out of the 97 targets.
So just a huge, huge leap forward in this fundamental problem within computational biology. But AlphaFold can only predict protein folding. It doesn’t really understand how proteins interact with each other.
And that’s where this new Alpha Proteo comes in from DeepMind. It is focused on understanding the interplay of proteins with each other. It also understands the dynamics and the functional consequences of mutations, which is kind of staggering.
So what you can do is you can use Alpha Proteo to design proteins that will bind to target molecules. Interestingly enough, we at Kung Fu AI just participated in a BioML challenge a few weeks ago, hosted by the University of Texas here in Austin, where the goal was to design proteins with a predefined number of amino acids.
In this case, it was exactly 80, no more, no less, that would bind to a target molecule associated with cancer. And that would elicit a certain sort of chain reaction response, and you get T-cells that could come in and kill the cancer.
This would have been unbelievably valuable.
Michael Wharton
I was going to say, it was just barely too late, right?
Ron Green
Just barely too late. This would have been unbelievable had we had this. A couple of notes on the training data. One is it was trained on the protein data bank—that’s expected. Unexpected is it was trained on 100 million predicted structures from AlphaFold.
So this is another example—you hear me talk about this all the time—about synthetic training data coming from upstream models to train downstream models. This is part of that exponential positive reinforcement loop that I talk about all the time.
We’re in the early, early stages of this. But the work they did on AlphaFold is actually enabling Alpha Proteo to do its job at this amazingly high level. It’s analogous to what we’ve talked about several times about Phi 3, the model from Microsoft, that was trained on billions of synthetic tokens generated from chat GPT.
So this is really important. One last thing I'd like to say is, you know, a lot of people are very excited about AI, and their only example or interaction with it might be something like ChatGPT or Claude. But behind the scenes, we are seeing revolutions in many different fields—computational biology being one of them.
The reason Alpha Proteo and AlphaFold are so important is that it currently takes on average about 10 years and over a billion dollars to get a new drug to market. That includes all the research, development, testing, FDA approval, everything. Once we can get the drug development cycle turned into an information science, we will be able to create orders-of-magnitude faster ways of developing new drugs.
We’ll see giant leaps in genomic-based therapy. You’ll get sick, have your entire DNA genome sequenced, and we’ll have the DNA sequence of whatever bug you’re dealing with. Then, using these types of techniques, we will be able to synthesize on the fly custom therapeutics designed just for you and your body to fight off that exact bacterium or virus that is causing your illness.
We’re in the early stages of this process where biology is becoming more like an information science. We’ll be able to do drug discovery and simulated drug testing experiments entirely within a computer, saving enormous amounts of time and money. And ultimately, this will lead to saving an enormous number of lives.
Michael Wharton
I hope they figure all that out before I get old and sick. That sounds awesome.
Ron Green
I do too. I’m really, really passionate about this stuff. There’s so much going on that’s fascinating. That’s wild. All right, I’ll throw it back to you.
Michael Wharton
All right, so I got a little—I'll start with a very brief anecdote. Recently, I was in East Austin hanging out with Jack, one of our client partners, and then someone named Spencer who used to work here. We were just hanging out having a drink, and then Spencer says, “Hey, I got these Waymo credits. You want to go try this thing out and see what happens?” So we were hanging out at this one bar, and we’re like, “Hey, let’s go barhopping. It'll be fun.”
This Waymo pulls up, and I think it was like 10-15 minutes until it got there. We went over to it, tried to open the door, but the thing was locked. We were like, “What the heck's going on?” We stayed there for about five minutes, and eventually the thing just kind of sped off without us. No explanation, just left. So then we called another one, and I think it was about half an hour this time before it arrived.
We got in, and it feels like you’re in a fishbowl. There are cameras and stuff all over the place, and it’s a research and development vehicle, so lots of extra sensors. It took us on this weird route. We weren’t going far—just two or three miles. But it took these really weird roads and got us within a block of the place we were going to.
The problem was, it was this really narrow street with street parking on either side. It was supposed to be two-way, but there was barely enough room for two cars. So I think the Waymo thought it was a single-lane road. It ended up going into oncoming traffic and getting stuck. It turned on its hazards and just kind of gave up. The app told us that it needed assistance, but we just got out because we were close enough to walk the rest of the way.
As we were walking, there were these poor souls—like six cars by the time we left—just stuck behind this thing. The only reason I bring that up is that it’s still got some kinks to work out, right? It’s not perfect technology. But in the last couple of days, Waymo released some really interesting safety data about their fleet in aggregate.
They operate in Los Angeles, Austin, Phoenix, and San Francisco. In terms of miles driven, Phoenix and San Francisco are orders of magnitude more—millions to tens of millions of miles driven versus just thousands in Austin and LA. But the big takeaway is that they have an 85% reduction in incidents per million miles compared to human-driven miles.
Ron Green
Mmm.
Michael Wharton
So, on the outside, looking at this aggregate statistic, it seems like these things are safer than human drivers, and I think that’s incredible. I think it was maybe two or three Decoded episodes ago where we were talking about how at some point these automated technologies are going to be safe enough that it’s negligence not to use them. Because in aggregate, you're saving lives, reducing injury, and all that stuff, which I think is awesome.
But, with everything, you need to take this with a little bit of skepticism. Not all miles are created equally. I mentioned that story where we were just taking weird back roads with low speed limits and nothing crazy going on. In this study, there are actually a couple of papers we can link in the show notes, but they didn’t control for things like road type. So, when you're on the highway, a lot of human drivers are on the highway at faster speeds, but there’s an increased level of risk when you're operating at those speeds.
So, it’s not necessarily a one-to-one comparison. These rider-only miles, as Waymo calls them, are not exactly comparable. But I’m just excited because the fact that it’s even in the same ballpark or the same order of magnitude of injuries you get with human-driven vehicles is a step in progress, and we’re on our way to the future.
Ron Green
That’s really funny—what kind of a crazy mixed story! The first one comes, won’t let you in. The second one abandons you short of the destination, causes a traffic jam, but no incidents, and they’re safer in aggregate. Baby steps—that’s funny, man. That’s really good.
Michael Wharton
I thought it was cool.
Ron Green
That is cool. I still haven’t tried one. I’m going to do that very, very soon.
Okay, my last little piece of news—and this is not insignificant—is the new bill from California, SB 1047, that is about AI regulation. This has been really, really controversial in different sectors—within research, the commercial sector, and across various parts of the tech field.
The idea is that as these AI systems become more powerful, there’s worry that they could become dangerous, and they want to start regulating that within California.
This is important for the rest of the world—not just because it’s a California law, but because so much of AI research and development is done in California, and most of the major AI companies are based there. This will have a disproportionate impact.
There are many AI luminaries lined up in support of it, including Geoffrey Hinton, Yoshua Bengio, and Stuart Russell. But there are others who are strongly against it. Most of the tech industry—including OpenAI, Facebook, and others—are not on board.
The law has undergone quite a bit of change. There was a lot of early controversy, specifically around some of the thresholds being too low and the idea of creating a new regulatory industry around this.
Most of that has been removed or watered down. The current provisions are around transparency, accountability, privacy, safety, and public reporting. It’s kind of hard not to support those general thrusts, but opponents are mostly worried about the anti-competitive aspects—especially that big incumbents might gain a competitive advantage and it could stifle innovation.
It also didn’t help that there was some misinformation about the bill initially, saying there would be criminal liability associated with it.
The one thing I find really interesting is that the way the bill is structured right now, it may not pass. It may actually get vetoed by the governor because there’s apparently a lot of political backlash against it. It’s just something to keep an eye on. I think we’re going to see a lot of AI regulation coming in the future as this technology becomes more and more important.
Michael Wharton
Yeah, we’ll see what happens. I’m waiting for OpenAI to move its headquarters to Texas like everyone else.
Ron Green
I know. It feels like everybody’s moving to Austin.
Michael Wharton
Yeah, sure does.
Ron Green
Cool.
Michael Wharton
All right, I got one other quick note here. I’ll just sort of group these together. They’re on the same theme of the big players. So, OpenAI, they hit a million paid subscriptions recently, and they have about 200 million weekly users, apparently. That’s incredible. I think it’s amazing that they got to that level with such a new product. I think it’s less than two years old.
Ron Green
Yeah, it’ll be two years in November.
Michael Wharton
No memory yet. But just as a point of comparison, I looked up these numbers for other really big streaming or subscription-based services. Netflix has 278 million global subscribers. Spotify has about 246 million, and Amazon Prime has around 200 million. So, OpenAI is just in a different league altogether.
I’m still trying to figure out exactly what the long-term business prospects are for a company like OpenAI. I mean, there are a lot of smart people investing a lot of money there. But I don’t know how much of this is hype and how much of it is sustainable, especially with their shift toward more inference compute, right?
Ron Green
That’s exactly right.
Michael Wharton
I’m cautiously optimistic, but they’re going to have to figure out the business model soon.
Ron Green
Yeah, no kidding. And then, you know, the other big player, Claude—you know, Anthropic—their artifacts are now available to all users, which is awesome.
Michael Wharton
It really is. I know you’ve played with it. I played with it by your recommendation, and it’s fun.
Ron Green
Yeah, it’s fantastic. I love that. I think the artifact capabilities will probably be incorporated into all of the foundational models here pretty soon because it’s just so useful.
Michael Wharton
Yeah, yeah.
Ron Green
Okay, that was a lot, Michael. I think we raced through that as quickly as we could. I really appreciate your time, man. We’ll do this again next month.
Michael Wharton
Sounds good. Thanks for having me.