Mamba-2, KANs, OpenAI's Superalignment Team, and more | EP.23

Join Ron and AI experts Michael Wharton and Reed Coke from KUNGFU.AI as they unpack the latest news in artificial intelligence. Together they discussed Mamba-2 and Kolmogorov-Arnold Networks (KANs),c GPT4 passing the bar exam in 2023, the dissolution of the superalignment team at OpenAI, and whether or not expert jobs are at risk of being replaced.

Listen on Spotify
Listen on Apple Podcasts

Ron Green: Welcome to Hidden Layers, where we explore the people and the technology behind artificial intelligence. I'm your host, Ron Green. We're back today with another episode of Decoded, where we take a look at recent developments in the AI industry. As usual, we've got a lot to discuss today. To help me pick through this, I'm joined by two of my fantastic Kung Fu AI colleagues, our VP of Engineering, Michael Wharton, and Director of Engineering, Reed Coke. Welcome, guys.

Reed Coke: Glad to be here.

Ron Green: All right, so let's just jump in. I'm going to kick things off. I want to talk first about the big breakup of the OpenAI superalignment team. It was really interesting. I had my interview with Scott Aronson literally two weeks ago. We're sitting there, talking about his work and his collaboration with Ilya Suskivir, co-founder, former board member, and chief scientist at OpenAI. We spent the whole conversation talking about alignment, and like two hours later, we hear the news that they had both resigned.

So this is not only interesting but really important for a bunch of reasons. One is, Jan's group kind of pioneered the use of reinforcement learning with human feedback on InstructGPT, and candidly, I would have doubted it would work as well as it does. It's phenomenal, and you can point to that technique as a large part of why ChatGPT-4 is especially useful to humans. It's not just focused on being a good token generator or sentence completer; it's really good at being useful to humans, largely because of the work done in reinforcement learning with human feedback. Ilya has not said much since he departed, but Jan has.

Jan said he'd been having disagreements with OpenAI leadership for some time about the company's priorities, and they finally reached a breaking point a couple of weeks ago. He believes much more bandwidth should be spent getting ready for the next generation of models on security, monitoring, preparedness, safety, adversarial robustness, superalignment, confidentiality, societal impact, and related topics. These problems are hard to get right, and he's concerned we aren't on a trajectory to get there. He also thinks AGI (artificial general intelligence) is much closer than a lot of people think.

When I was talking to Scott, we discussed OpenAI's public promise last year to set aside 20% of their compute for aligning superintelligent systems. Jan mentioned that over the past few months, his team had been struggling for compute, making it harder to get crucial research done. He didn't specify if they were getting the 20%, but he implied that it might not be enough.

The reason this is so important is that OpenAI is doing some of the most important work in AI in the entire world, and they're on the bleeding edge of the most powerful models. They were leading the effort, at least optically, on superintelligence alignment. Jan also said, "To all OpenAI employees, I want to say learn to feel the AGI." This sentiment is that AGI is closer than people realize, and you need to feel it to take it seriously enough. Jan is now at Anthropic, which is probably a better fit for him as they're taking alignment much more seriously. This isn't just some minor issue that will affect a product timeline; we're talking about how to keep AI systems as intelligent or more intelligent than humans aligned with our ethics and values.

Michael, what's happened recently in AI that caught your attention?

Michael Wharton: A lot, honestly. But if I had to narrow it down, I've been following the idea of Kolmogorov-Arnold networks. These are interesting because they are one of the few innovations after the multilayer perceptron, the dense network we're all familiar with. Instead of having completely densely connected linear layers like multilayer perceptrons, these networks are based on the Kolmogorov-Arnold theorem, which says that any multivariate continuous function can be represented as a combination of univariate functions. So, instead of a linear combination of inputs, they silo them into receptive fields and pass them through splines, which are trainable parameters.

This approach has some interesting properties. The authors claim these networks are parameter-efficient, potentially a couple of orders of magnitude more efficient than MLPs in some cases, which could be significant for mobile applications or constrained computing environments. They are also much more interpretable, offering a closed-form symbolic expression of outputs. Another property is grid extension, where small changes in a local neighborhood don't dramatically impact the rest of the network, unlike polynomials or MLPs. You can add granularity and continue training without the shared representation issue of MLPs.

What are your thoughts?

Reed Coke: I was doing some beginner-level reading on this and found that the closed-form equations for input to output are appealing for mobile cases where you need fast response times, like 10 milliseconds. These networks take longer to train but recoup that on inference, which could be an order of magnitude or more faster. They have different strengths and might be useful in various areas.

Ron Green: I dug into why they're using splines. Splines are smooth and continuous in first and second derivatives, have better local control, and are numerically stable. There are solid reasons behind their choice. It's still in its infancy, and we'll see how it plays out.

Reed Coke: So, unsurprisingly, I'm all about evaluation. Coming from NLP, it's hard to tell if words are good or not. About a year ago, there was a paper where GPT-4 took the Illinois state uniform bar exam and scored in the 90th percentile for all test takers that February. This seemed like a better way to evaluate language systems than perplexity. However, an MIT paper later called into question the evaluation methodologies of this bar study. The original study took the bar in February when people who failed in July retake it, and they generally do worse. The MIT paper compared it to July test takers, and the numbers were worse, dropping to the 69th percentile and 48th in writing. When comparing it to first-time test takers from July, it was in the 62nd percentile overall and 42nd in writing. Compared to those who passed, it was in the 48th percentile overall and 15th in writing.

There are a lot of headlines about GPT's incredible achievements, but I'm curious about the best comparison point or the most valid one. What do you think?

Ron Green: It's difficult to evaluate these systems. Perplexity used to be a good measure, but now it's harder to judge these systems. Regarding the bar, how are they evaluating it? Is a human looking at the results?

Reed Coke: I'm not sure, but the MIT paper mentioned methodological concerns with the essay score calculation, suggesting it might not be the exact standard.

Ron Green: This segues perfectly into my next topic around datasets. The fine web EDU dataset was released recently. It's a subset of the common crawl, cleansed and deduplicated in English web data for large language model training. Filtered by LAMA3, it focuses on high-quality educational content. There's increasing evidence that higher quality data is disproportionately valuable. Andrei Karpathy tweeted that his open-source project, LLM.C, is performing better than GPT counterparts despite training on fewer tokens. The idea that textbooks are all you need for training is gaining empirical evidence.

To tie it back to your question, Reed, higher quality datasets filtered by LLMs could mean custom-trained LLMs evaluating the output of other LLMs, bootstrapping and standing on each other's shoulders.

Reed Coke: It sounds effective but expensive. I have an alternate idea: think about what you're doing and pick something matched to that. Regarding the bar results, which comparison point is the most valid?

Ron Green: How about taking the blended average?

Michael Wharton: It depends on the downstream use case. For monotonous paralegal work, the initial performance metric might be fine. It's important to have intellectual honesty when reporting evaluation metrics. The research community focuses on intellectual honesty, while popular media sensationalizes.

Reed Coke: I view OpenAI's capabilities more as that than products or solutions. Solving alignment in a general case without knowing its use is challenging. Regarding the bar, it depends on the use case—monotonous paralegal work or defending yourself in court. It feels disingenuous to just say 90% and throw a thumbs up.

Michael Wharton: Sensationalism feels good. 90 is a great number. It's bigger than 60 and not 100, so it looks good.

Ron Green: I noticed that the Atlantic recently partnered with OpenAI despite an article from the editor-in-chief of The Information arguing against partnering with tech companies. OpenAI benefits from high-quality training data and real-time access to reputable resources. Considering we're in an election year, having rapid access to long-form content with nuanced opinions could be a good thing. What do you all think?

Reed Coke: I have mixed emotions. The idea that AI companies use data to train models and then restrict its use for downstream tasks seems unfair. We don't know how things will play out, but there's a sense of urgency. Apple is rumored to announce a partnership with OpenAI, which says a lot about OpenAI's unique position and head start.

Reed Coke: Some companies realize they didn't get what they hoped for from partnering with OpenAI, and others flubbed their releases trying to compete. It's hard to tell what's moving too fast and what's falling behind. I don't love the idea of news getting wrapped up in AI, but adding this will make models better. It could be good or scary.

Michael Wharton: Potentially very good or very scary describes all of AI right now. Reed, do you have another topic?

Reed Coke: There's a lot of buzz about AGI and displacing skilled workers. How at risk do you think expert roles are for being replaced?

Michael Wharton: People often see this as a false dichotomy—either no expert work is replaced, or all human labor disappears. If people rush to cut humans out of the loop, it can fall apart, and there's cultural backlash. A smooth continuum toward automation will see more success and adoption. Human augmentation is more productive and realistic than the sci-fi scenario where everyone's jobs are gone overnight.

Ron Green: At Kung Fu AI, we counsel clients to build AI systems to make jobs more efficient, faster, or easier, not to take people's jobs. Empowering people through human augmentation is more appealing. Down the line, having global access to the best medical or legal advice would be amazing, but specifics matter. For minor evaluations, AI expert advice is more amenable in the short term than for something lethal like cancer diagnosis.

Ron Green: AI systems threatening professional jobs is new. Historically, lower-paid jobs bore the brunt of disruption. This time, it's different, and we don't fully understand it yet.

Reed Coke: I've been thinking about the Industrial Revolution and factory automation. Adding a mechanical arm for a well-defined task differs from other work, which presents more meaningful challenges. Also, the concept of P99 latency in websites shows how encountering worst-case outcomes frequently stacks up, making full automation more complex.

Ron Green: If an AI system does a potentially dangerous job, it must be much better than humans. People will feel that a human probably would have gotten it right. We need much lower failure rates for autonomous systems or lethal disease diagnosis to become predominant.

Michael Wharton: Once AI systems reach a high-performance threshold, not using them becomes negligent. People would lose their lives more often without them, making their use a foregone conclusion.

Reed Coke: The reaction to adoption differs if the error profile is similar to a human's versus a system that does things a human never could but flubs other tasks.

Ron Green: I have one more topic. The Mamba II paper came out recently, called "Transformers or SSMs: Structured State Space Models" by Tree Dow and Albert Gu. They revealed that transformer architecture with self-attention can be viewed as a form of the structured state space model. The quadratic nature of transformers means that as input context grows, computation increases non-linearly. This structured space approach doesn't have that problem, which is important for large contexts like genomics with 3 billion nucleotides in human DNA. The paper shows the deep connection between transformers and structured space-based models.

Michael Wharton: This could be interesting for large assemblies with lots of sensor data, like cars. Historically, making inferences across long time periods is computationally expensive, but this approach could help.

Reed Coke: There was a moment when Word2Vec was big, and a paper showed it was mathematically equivalent to LDA if certain parameters were fixed. It didn't hold back Word2Vec, and it's interesting how new things turn out to be largely the same but still valuable.

Ron Green: Everything's been tried before. State-space models go back to the 50s. Everything old is new again. Thank you, Michael and Reed, for joining me today. We had a lot more to discuss, but we ran out of time. We'll do another one of these in about a month.

Reed Coke: Fifty new things at least. It'd be a pleasure.

Ron Green: Thanks.