Hidden Layers

Microsoft’s Florence-2, GraphRAG, Claude 3.5, NVIDIA’s NIM. and more | EP.25

In this episode, Ron Green and VP of Engineering Michael Wharton discuss the latest developments in artificial intelligence. They explore Microsoft's Florence-2 model, Microsoft's GraphRAG, Anthropic's Claude 3.5 Sonnet, and NVIDIA AI Enterprise offerings, and much more. The conversation also touches on Runway's Gen-3 text-to-video model, Salesforce's time series forecasting foundation model, and the recent changes in OpenAI's board structure. The episode wraps up with a discussion on the hype and reality of AI advancements and the future of AI in various industries.

‍

Ron Green: Welcome to Hidden Layers, where we explore the people and the technology behind artificial intelligence. I'm your host, Ron Green. We're back today with the Decoded episode, where we take a look at important recent developments within AI.

We're gonna talk about news coming out of Microsoft, Nvidia, Anthropic, Runway, and more. To help me break everything down, I'm joined by one of the sharpest minds in the industry, our VP of Engineering, Michael Wharton. Michael, you ready to go?

Michael Wharton: So ready.

Ron Green: All right, I want to start off and talk a little bit about this really exciting new model from Microsoft. It's called Florence 2, and there are several things that make this interesting. One is it's really tiny. It's surprisingly tiny, where most models these days that people are familiar with, like ChatGPT, have rumored trillions of parameters. Both variants of this model are sub-1 billion, so just hundreds of millions of parameters, which today, that qualifies as small. But what's really interesting about it is that it has a broad set of capabilities. It can do image captioning, segmentation, object detection, visual grounding, and it's all based on a prompt interface. So it's text in, text out, large vision language model, which means it can support images and text, and it performs at a really high level.

The models come in two different sizes. There's a base model, which is 230 million parameters, and a large one with 770 million. It was trained on a large dataset of about 5.4 billion annotations, and it has incredible capabilities. It supports zero-shot capabilities on this mixed-mode modeling, and although I haven't fine-tuned it myself yet, people who've played around with it say you can fine-tune it exceptionally well on other domains. So when you put this all together, you've got this high-performance, multi-faceted, large vision language model that could run on your phone.

I want to talk a little bit about the architecture, because I think everybody's gonna find this really interesting. It has a unified, prompt-based representation. Have you seen many mixed-capable vision language models like this before?

Michael Wharton: No, I think there were some rumors about that with a closed-source version of SAM from Meta, which had some open vocabulary prompting and text input, but nothing in the open-source world that's widely available that I've seen.

Ron Green: OK, and by SAM, you're talking about the segment anything model from Meta?

Michael Wharton: Yeah, that's right. It's got a unified prompt-based representation, which I find fascinating. The architecture is the Dual Attention Visual Transformer (DEVIT), using a transformer-like encoder-decoder model. The Dual Attention version has both short-range and global attention within it, a unique design that came out in 2022. It supports variable length encoding, taking less computational power on simpler tokenized images, which reduces complexity. This is one reason it can run on a mobile device. The text embeddings come straight out of a BERT model, which is interesting because it’s going back old school. But if it ain't broke, don't fix it, right?

You put this all together, and you have an amazing set of capabilities. Anyone looking for an off-the-shelf model, open-source, with low computational dynamics for mobile or IoT devices should look at this, especially if you need zero-shot or fine-tuning capabilities. We're going to be using this on projects all over the place in the future. This would have been unbelievable five years ago, right?

Michael Wharton: Oh my God, so much time spent just trying to curate data sets. We've done entire labeling runs where we outsourced to tons of people. This is amazing.

Ron Green: And part of the way they achieved this was by taking other models like YOLO 8, highly trained on one domain like object detection or segmentation, and using those as a teacher for this model. They leveraged the powerful but narrow capabilities of these other models. This is the third decoded episode in a row where I've brought up these smaller, high-performing models. Distillation is something we're going to see more and more of.

Michael Wharton: Yeah, very recently, Microsoft has open-sourced their implementation of GraphRag. They’ve been doing a lot of research on it, and it’s impressive and interesting.

Ron Green: What is GraphRag for our listeners?

Michael Wharton: Most people listening have probably heard of RAG (Retrieval Augmented Generation). It’s a way to take large corpuses of text data, encode them, and retrieve relevant parts for a particular prompt, putting those into an LLM input so that it fits the context window. You do this because your corpus of data is so large it won’t fit in your context window as is, so you need to find only the important bits. Once retrieved, you can put that within the prompt context to help avoid hallucinations and be specific with detailed output.

Ron Green: That was perfect.

Michael Wharton: With graphs, you have this previously encoded contextual relevance baked into the graphical data structure, with nodes and edges already showing what relevance means. If you have a well-built knowledge graph, like tracking components in a manufacturing company, it helps in the retrieval process. The quality of your output depends on how well you can retrieve these inputs. Microsoft released a paper earlier this year and an open-source repo with an MIT license, which is the most permissive. It has their GraphRag implementation and code to build knowledge graphs on new datasets. You can look at a large corpus of text and pull out entities and relations.

Ron Green: So the graph portion of this code base includes code to take your own data, automatically finding relationships and connecting graph nodes?

Michael Wharton: Yeah, they recommend fine-tuning for your data set, but they have it pre-packaged. Building a graph structure is crucial for GraphRag, and you can start querying it.

Ron Green: That’s a good move. It’s open source but leads you towards their cloud platform with built-in graph stuff, but you don’t have to use that.

Michael Wharton: No, you can host it on any infrastructure. Playing with RAG systems, I noticed certain queries like enumerated lists fall down due to data structure limitations. With a graph, you can ask about frequent topics or other queries because the data structure encodes relevance directly.

Ron Green: Yeah, I love it. The next topic is Anthropic’s release of Claude 3.5 Sonnet. Large language models like ChatGPT are what most people think of with AI. Based on metrics, Claude 3.5 Sonnet seems to be leading in performance. Normal caveats apply—use the model yourself to judge. But there’s buzz about its high performance. They’ve added new features, like artifacts, where you can upload a PDF and generate an interactive dashboard, adding tabs, controls, and widgets. People are using it for game development, visualization, graphing, and more. You can fix code errors by copying and pasting the error message into the prompt.

They also added a console for production-quality prompts, allowing iterative improvement, role settings, chain of thought reasoning, and structured templates with XML tags. I used it for a project, and it worked fantastically. Anthropic’s policy on data training is strict—they won’t train on your data without permission. An employee tweeted that Claude is getting good at coding autonomously, fixing pull requests, and could be doing a lot of coding in a year.

Michael Wharton: I love Anthropic’s balanced approach. They announced funding for third-party evaluations of models, showing genuine interest in building a great product. The features sound incredible, and I can’t wait to check them out.

Ron Green: Metrics are subjective, but I’m excited about Claude. What’s next?

Michael Wharton: NVIDIA AI Enterprise offerings are rolling out this year. Their NVIDIA Inference Microservices platform (NIM) is exciting, like Hugging Face but for production deployment. You can host open-source models with a click, using a standardized format on Docker containers with REST API interfaces. It could become a standard way of deploying models, cloud-agnostic, deployable locally, on-prem, in the cloud, or on edge devices.

Ron Green: Can you deploy on your hardware?

Michael Wharton: Anywhere with a Docker container. NVIDIA AI Enterprise captures all their open-source releases, like Triton, an open-source deployment framework with YAML definitions for models. I hope it becomes a standard, saving time on reinventing deployment processes.

Ron Green: I love that if they’re using Docker with a REST API, count me in. We’ve been doing that for years.

Michael Wharton: Exactly. Also, Runway’s Gen-3 text-to-video model is out, with high-resolution and realistic photo capabilities. It handles unusual transitions well, like a chef cutting steak and it forming letters, or a bald guy’s wig falling off hyper-realistically. You can set keyframes to guide the video, which is impressive. We’ll include video examples in the YouTube version.

Ron Green: What’s next?

Michael Wharton: Microsoft’s board seat changes are interesting. They had an observing seat after Sam Altman’s ousting from OpenAI, but they left due to antitrust concerns. Apple was rumored to join the board as a non-voting member but pulled out too. OpenAI announced they won’t have investors or partners on their board going forward.

Ron Green: That’s fascinating.

Michael Wharton: Another interesting topic is Salesforce’s time series forecasting foundation model, MoiRAI. It’s open source for commercial applications and useful for bootstrapping situations with small data sets. It understands time series data and can improve performance in different domains.

Ron Green: Time series is still challenging. In enterprise applications, we care about things like documents. Models like Florence 2 also do OCR, which is amazing.

I have a hot take. AI is simultaneously overhyped and underhyped. It’s overhyped because everyone knows about ChatGPT and generative AI, thinking it's synonymous with all AI. Many companies underestimate the complexity of building and deploying generative AI without a human in the loop. There’s investment hype, with every product suddenly being AI-powered, which diminishes AI’s efficacy. Some enterprises see AI as a magic bullet, underestimating the necessary training, data, hyperparameter tuning, and handling edge conditions.

But AI is also underhyped. We’re at the early days, combining simple ideas like Florence 2’s multiple visual capabilities. Most people don’t interact with state-of-the-art AI yet. There’s low-hanging fruit in narrow AI use cases with enormous ROI. Projects like automated loan factoring and cancer risk prediction are game-changing, with life-saving potential. We’re just starting to use capabilities like AlphaFold in drug discovery and new manufacturing materials. We’re at the beginning of an amazing time in AI.

Michael Wharton: I agree. There’s a lot of noise, but we’re following the Gartner hype cycle. We’re nearing the trough of disillusionment, where people realize value from well-built, narrow AI applications. Competitors benefiting from AI will push others to adopt it too. We just need to keep producing value.

Ron Green: Exactly. This was fun as usual, Michael. Thanks, and we’ll do this again next month.

Michael Wharton: Sounds good.

Ron Green: Thanks.

‍