Join Ron Green and Dr. ZZ Si, Co-founder and Distinguished Machine Learning Engineer at KUNGFU.AI as they explore the fascinating journey of computer vision. They discuss the evolution of computer vision, from the early days of handcrafted features to the revolutionary impact of deep learning and convolutional neural networks. Dr. Si shares insights from his groundbreaking work at Google and Apple, and they delve into the significance of AlexNet and ImageNet in transforming AI research.
The conversation also covers the rise of transformers and their role in bridging computer vision and natural language processing, as well as the exciting advancements in diffusion models and flow matching. Discover how these innovations are being applied in robotics, healthcare, and more.
Ron Green
Welcome to Hidden Layers, where we explore the people and technology behind artificial intelligence. I'm your host, Ron Green, and today we have a special episode. Joining me is my co-founder, distinguished engineer, and all-around AI wizard, ZZ Si. ZZ and I have had the privilege of witnessing the AI revolution and computer vision firsthand. We've seen it grow from a niche academic discipline into a worldwide transformative technology. In this episode, we're going to share our experiences building early systems and what it was like living through the deep learning revolution. And because we're not just about nostalgia, we're also going to give you a peek into the future of AI and what we're most excited about. ZZ earned his Ph.D. in statistics at UCLA, where his research focused on generative hierarchical models for object recognition. He's published widely, including in the International Conference on Computer Vision, Conference on Computer Vision and Pattern Recognition, and the IEEE Transactions on Pattern Analysis and Machine Intelligence. His paper on generative modeling received the Marr Prize Honorable Mention, one of the highest honors in the computer vision community. His professional work includes algorithmic ad targeting at Apple, search ranking at Google, product search and matching at Impossible Ventures, generative modeling at Vicarious AI, and deep learning applications for image understanding in chatbots at HomeAway Expedia. Before UCLA, he earned his BS in computer science at Tsinghua University. ZZ, thanks for being here, I'm really excited.
Dr. ZZ Si
Yeah, thank you so much, Ron. I think it's fun to reach into our memories and super happy to chat about our vision and our experiences.
Ron Green
So, you are one of the few people I know who's been working in computer vision professionally and academically both before and after the deep learning revolution. You know, I was doing computer vision work back in the 90s, and it was just a completely different game. I got out of the computer vision game a little bit in the 2000s, but you were there. You were there during the wave of transition from sort of, you know, handcrafted features into deep learning-based systems. I thought it would be just a ball to go back and talk about what it was like living through that, arguably one of the most important technological revolutions in human history. So, let me tee this up and let's start at the beginning. How did you get interested in computer vision in the first place?
Dr. ZZ Si
So, my undergrad was from like 2002 to 2006, and I think during my last year, during my senior year, I got exposed to computer vision by going to the research lab at Tsinghua, I think it's called the AI lab, and we were doing one of the competitions called NIST competition, it's an image retrieval task, where I used like backup words, like backup visual words, and then used an SVM on top of that to predict which images are the most relevant to the query. So that's the, like, I immediately got super interested in computer vision, so that's kind of my foray into the field.
Ron Green
Okay, so I think everybody listening today would be kind of curious, what was it like? So you're building image recognition systems back in the early 2000s. Obviously no deep learning techniques. What were you doing? What was the feature engineering work like?
Dr. ZZ Si
There were a lot of very sophisticated features designed by smart scientists and engineers like SIFT, the Scale-Invariant Feature Transform, and HOG, which I think is called histogram of gradients. These are very well-designed features that describe a local region of an image. It takes a long time to design. They work really well. They're fast. After extracting key points from the image, you use the SIFT and HOG descriptors to describe those regions. Then you put a classifier on top. At that time, the popular ones were support vector machines or Adaboost. There was a lot of tuning on the hyperparameters of the features as well as the support vector machines. Rather than having a very deep understanding of the image, it was really about tweaking the parameters and watching if your rank on that competition leaderboard would increase or not. It was pretty fun, but I felt like it was very parameter-tuning heavy, which is why I later got into my Ph.D. study under my advisors, Sun Junjue and Nienenwu, to try to build a statistical model for understanding images.
Ron Green
And you were really interested in the deeper understanding of images, not just scoring well on some of these metrics, but in the science and research of how to build computer vision systems that understood them in a deep way. Can you talk about that a little bit?
Dr. ZZ Si
Yeah, absolutely. During that time, I read a paper about image parsing, which is really about understanding image data, like what's in the image. As a human, how do we understand an image? That paper was about parsing or decomposing an image into low entropy structure, which is kind of like high entropy texture, the fun details like the chaotic randomness of different lighting, different little textures. So that really got me into more statistical modeling approaches to computer vision.
Ron Green
What was that like? What were those techniques? I have almost no experience with statistical-based computer vision modeling.
Dr. ZZ Si
In my Ph.D. study, I was trying to teach machines to understand images like humans do. Humans have a very abstract way of understanding visual signals. When you try to describe an owl, a human can draw an owl with just a few strokes. Step one, you draw the circle for the head and the circle for the body. Step two, you draw the eyes, the feet. And then, of course, step three is all the fun details, textures. But even with just steps one and two, those few strokes are enough to tell that it's an owl. We tried to teach the machine to do that. I worked on the sparse model, sparse coding, and also Bayesian inference, where we tried to help the model learn from just a few examples, like three or five images, and try to learn the abstract representation of an object. At that time, I was working with datasets where I collected images of animal heads. My friends would come around and say, "Oh, are you studying zoology?" Yeah, we were trying to learn models to represent natural objects.
Ron Green
And were the images you used something like contour line detection, where you took out everything except for the hard contours to get the outlines for the input data? What were you doing?
Dr. ZZ Si
Yeah, so we used Gabor wavelets, which are filter banks that are like local line strokes at different orientations. We would do convolution, max pooling, another convolution to extract out the parts, and then another max pooling. You might feel like these terms are familiar; they're also used in ConvNets.
Ron Green
It's actually kind of funny, so as part of this statistical modeling, you were still using kernels, convolutions, and translation invariant techniques as part of that, but was that only for the image generation or was any of that used in the modeling as well?
Dr. ZZ Si
We mainly maximized the likelihood on the image, so it was to generate and reconstruct the image, but not to the fine details like GANs or diffusion models now, but rather to reconstruct very abstract sketches that looked like the animal. And there were also a lot of latent variables because the shape would deform. So we used latent variables to represent the deformation. There's a lot of EM algorithm, the expectation-maximization algorithm, which is very beautiful mathematically, but it had a lot of challenges making them work at large scale. The learning wasn't very fast; there was no stochastic gradient descent type of algorithm to scale it to millions of images. That part was really hard. We were able to detect objects like horses and cars at pretty good accuracy, kind of on par with the state of the art at that time, but then it plateaued because we couldn't scale to millions of images.
Ron Green
Is this when you first got interested in generative modeling?
Dr. ZZ Si
Oh, yeah, absolutely. I really have a thing for generative modeling and unsupervised learning. And Ron, I'm also curious about your experience. I really wanted to learn about what things were like in the 90s when you were working with neural nets.
Ron Green
It's a great question. In the 90s, the neural networks we built and trained were tiny in comparison to now. I mean, you know, billions or trillions of times smaller. And we were training on machines without GPUs. All we had were CPUs. I remember being very excited when we got access to a 20 MHz CPU, you know, orders of magnitude slower than your iPhone here. Everything back then was handcrafted. Everything was built in C because obviously, speed was so critical. We couldn't afford to do anything in higher-order languages. There were no frameworks. There was nothing that you really had access to except for just starting with an empty file and crafting things from scratch. I think the thing that is most surprising to people when I talk to them now who work in AI is that because we have these auto-differentiation libraries like PyTorch, TensorFlow, and JAX, you can create arbitrarily complex processing pipelines and transformations, and you don't have to worry about how that will be translated into the gradient descent backpropagation operation because we have auto-differentiation. Well, back then, any changes you made to the pipeline meant you had to literally break out a sheet of paper and a pencil and recalculate the gradient and the differential equations from scratch. So what that meant was it was really difficult to experiment broadly because once you committed to certain designs, you kind of felt locked in. And the other thing, when I look back on the 90s, the number one mistake that I think we made that's just astounding in retrospect is we used sigmoid activations for everything. We used them for the hidden layers as well, which we now know is really problematic because it leads to vanishing and exploding gradients. But back then, all of the work within neural networks was predicated, to a greater or lesser extent, on some mathematical proofs that had shown that they were universal function approximators. And so we felt very confident that we could model anything based upon that paper, but that paper assumed certain architectural constraints. So we didn't want to leave them. We didn't want to try things like ReLU because it wasn't clear if you had something that was only pointwise differentiable, you might throw away that entire edifice on which you'd built. It was sort of unimaginably archaic compared to what we do now. I would never want to go back.
Dr. ZZ Si
I feel lucky for the deep learning frameworks. So every time you changed the network architecture, you had to manually derive the gradients on paper and then implement them in C? I guess, what was the tool like back then?
Ron Green
There was no code generation. Now we did have Wolfram, and if you were lucky enough, you might have a copy of that. You could go get a CD and run it locally—there was no web access—or you had MATLAB or something like that. But it was still complex because, you know, we can do all kinds of weird things now as far as how the signal flows through the models and not even have to be concerned at all about how that might affect the final function. Conceptually, it's just some giant function we're computing, some giant nested function. It was hard to understand how changes might disrupt that. Then you could calculate your derivatives incorrectly, and your gradients made no sense. So, alright, I want to ask you, how did you react? I wasn't working in computer vision in 2012 when AlexNet happened. To set the context for anybody who may not be familiar, AlexNet was the first deep learning-based computer vision model that really had a strong impact on the scene because it performed exceptionally well on a competition called ImageNet. What's important about this is that, prior to this 2012 release, most of the architectures that were submitted for competition were sort of hand-tuned, feature-engineered solutions like you were describing a moment ago. AlexNet came in as this neural network-based, quote-unquote deep learning approach—although now we would consider it pretty shallow in hindsight, even though this was only 12 years ago—but it just blew away the competition. All of the features, meaning all of the signal information that the classifier eventually used to make determinations, were calculated automatically through the training process on the model. This was a sea change to best practices at the time. I would love to hear what did you think? What did your colleagues think? What did your professors think when you saw the paper? And I know that you were good friends with the lead author on the ImageNet paper. So just tell me about that whole experience. What was that like, and what were you thinking?
Dr. ZZ Si
Yeah, absolutely. That was wild. I graduated with my Ph.D. in 2011, and when AlexNet came out, I was working at Google. I was actually in a tech talk—I think it was Alex giving a talk—hosted by Google Brain, and Jeff Dean was the host at that time. Alex gave a talk about using ConvNets for street number detection and segmentation, and the accuracy was over the roof compared to the numbers that I was used to before. I was really amazed. Another impression was that it took so much resource to train. I think he was using DistBelief at that time, the precursor to TensorFlow, on thousands of machines, I believe. It was very resource-intensive but super accurate. And on this, I mean, deep learning ConvNets for computer vision means a lot. It automates the feature design, which was really painful in the 2010s.
Ron Green
What did you think about that, though? I've talked to people who were a little disappointed because so much of their academic focus was on the feature engineering part. Were you disappointed or excited?
Dr. ZZ Si
Oh, absolutely. There's definitely an emotional component to that. I mean, we talk about AI replacing jobs—well, which jobs were first replaced? It was the very jobs of AI researchers in computer vision. Feature design is automated now, but it's definitely for the good. In the 2010s, one of the hard problems of vision was the hierarchy, like from objects to the primitives, like the edges. There are a lot of layers that you need to go through. You need to group edges into small parts, then bigger parts, and then larger parts, and then to objects. The design space is so large. You need to design the features, make sure your code is bug-free. You need to have a learning algorithm for grouping the features into parts. It was just so complicated, but with deep learning, with ConvNets, a lot of that is automated away, and you don't have to care about that as much, and your design space shifts from features to network architectures and loss functions. So, yeah, definitely a major change.
Ron Green
Did you have any of the people you worked with, either professors, other students, or colleagues at work, who dismissed it out of hand?
Dr. ZZ Si
Oh, absolutely. I think we've both heard stories like that. ConvNets, RNNs, or deep learning were not very popular before 2012. The dominant methods then were like support vector machines, Bayesian graphical models, and a lot of belief propagation. Things like deep learning, with so many parameters trained on even millions of images, still a small dataset compared to the number of parameters—such models shouldn't work. If you asked a statistician, this is like, "How could this happen?"
Ron Green
Right, it's over-parameterized.
Dr. ZZ Si
Yeah, it shouldn't work, but in reality, it actually works. In theory, it shouldn't work; in actuality, it does. And a lot of things that feel hacky, like ReLU and other pooling, different engineering components in it, aren't necessarily derived from first principles. It's a very smart design.
Ron Green
They feel like hacks, almost.
Dr. ZZ Si
Yeah, they almost feel like hacks. The popular opinion before seeing the real success was that, "Ah, why should it work? Don't spend time on it."
Ron Green
Yeah, it's funny, in the 90s, that was the same sentiment. There was a sense that if it did work, it was just kind of luck, but it wasn't really principled. It was a sense that a bunch of things were being tried, but it wasn't scientific. It was almost more like experimental scattershot engineering instead, right? And I agree with you. I think things like ReLU—the first time I saw that, I thought it was crazy. I just thought that was nuts. Dropout makes no sense theoretically. This idea of just turning off neurons randomly feels so unprincipled, but those are two tiny algorithmic changes in the grand scheme of things that have generated massive improvements. So it's just really amazing. I want to ask you, what did you think about ImageNet?
Dr. ZZ Si
Oh, that's one of the most influential works in computer vision, and it's funny. The first author, Jia Deng, is my college roommate. He was actually not doing computer vision at that time. He was in Princeton with Kai Li doing, I believe, distributed systems. Then he moved to Stanford to work with Fei-Fei. I think we were actually doing a road trip together at that time. So the first time I learned of his work on ImageNet was at CVPR '09 in Miami, I believe. Andrei Karpathy is also one of the co-authors. At that time, I just felt like, okay, it's a great work. It's a much larger dataset than Caltech or CIFAR-100 because it's millions of images, tens of thousands of categories. It's a great dataset, but it's still a classification dataset. What does a larger dataset buy you? It doesn't really solve the vision problem. It will help, but probably not the most important one.
Ron Green
That's because you were focused on deeper understanding, right? You thought the dataset was fine, but you were kind of going in the direction of a deeper understanding of visual understanding.
Dr. ZZ Si
Yeah, exactly. At that time, I felt like the algorithm had to be the first consideration, but it turns out the scaling law holds. Data, compute, lots of experimentation with deep learning, of course, and then you get to much, much higher accuracy. It's such an influential paper, I mean, ImageNet, and that triggered a tsunami of innovation on the algorithm side.
Ron Green
Yeah, it really did. I think without ImageNet, we don't have AlexNet. I don't know how much longer it would have taken until we saw this resurgence in neural networks and deep learning.
Dr. ZZ Si
Yeah, exactly. After that, transfer learning became a thing. We've been talking about representation learning for a long time. There's the top-down designed representation and the learned representation. But after ImageNet, you started to see VGG nets, open-sourced and trained on ImageNet. When you apply those models to applications like various communication tasks, you usually pre-train from ImageNet weights, for example.
Ron Green
That's right.
Dr. ZZ Si
And that did increase the accuracy a lot.
Ron Green
On the Breast Cancer Project, I seem to recall, did we even start with ImageNet weights on that?
Dr. ZZ Si
Yes, that's one of the hyperparameters. We either started from random weights or from ImageNet. It didn't make sense to me in the beginning because ImageNet is a natural images dataset. It's not like mammograms or breast imaging, right? It doesn't look like that. But still, if you initialize your weights from the ImageNet weights, we saw gains in accuracy.
Ron Green
And I think the reason is what you hinted at a moment ago. It's because when you train a model on that much natural data, you force it to learn certain types of feature extraction capabilities, whether it's contour detection or corner patterns, and things like that. And that's just helpful and useful no matter what you're doing, whatever downstream visual task. That kind of powered and enabled the transfer learning.
Dr. ZZ Si
Yeah, absolutely.
Ron Green
I want to change focus and talk about the shift in computer vision over the last few years. So we have the early 2000s up into the early 2010s, where we have hard feature engineering-based work with some type of classifier like a support vector machine or something like that. We see the deep learning revolution take hold, and all of the most popular models within computer vision at that time are some type of variant of a convolutional neural network, right? Which interestingly goes all the way back to the 80s, but we finally had enough data, we finally had enough compute, and we ironed out some algorithmic issues like we've already talked about, like ReLU. So we get these convolutional models, and we're seeing really amazing results. And then transformers. Talk about that. And that was in 2017, right? The attention paper.
Dr. ZZ Si
Yeah, the attention paper in 2017. You didn't really see transformer-based computer vision models for a few years after that. Talk about that. What are your thoughts on that?
Ron Green
Oh, I just realized I misspoke earlier when I said 2010. What I meant is the 2000s, like before deep learning. In the 2010s, I think we saw a lot of great convolutional net-based models for detection. I mean, VGG, ResNet as backbones, and then for object detection, we have RetinaNet, for example, that we used for document understanding. But still, in computer vision, people used one type of neural net, but for natural language processing, you typically used LSTMs or other recursive structures. Transformers are what I feel kind of united the two and made the design even simpler, so we see more convergence there. With transformers, I think, previously, when we did document understanding, we would train the RetinaNet, and we would tinker with the anchor boxes, literally just go into the source code and change the anchor box design to deal with the very small, tiny bounding boxes in documents that are high resolution. But when transformers came out, we then transitioned into models like LALM, which combined both vision and NLP features, both text and visual features, and that really increased the accuracy a lot. Ron, what do you see about transformers that kind of changed how we work?
Ron Green
Transformers are unbelievably important, obviously. I remember vividly in 2018 or 2019 feeling like we had transfer learning in computer vision, which was wonderful. It was really a game-changer because it meant you weren't starting every project from scratch. But in NLP, if you were doing natural language processing, you were starting every project from scratch. There wasn't really anything, except for maybe Word2Vec weights or something like that. That was really all you had. When transformers hit, all of a sudden, you saw so much of the upstream feature engineering or preprocessing that you would do in NLP just fall away, like we saw in computer vision. Then you had the ability to have a single architectural approach that you could leverage for both computer vision and natural language processing projects. Amazingly, that meant we could do multimodal, like you said. We could combine these modalities and have a single model capable of understanding them both for the first time ever. That just wasn't possible before. Another fallout of that, which is really important, is that it meant all of a sudden, NLP researchers and computer vision researchers, who could barely talk to each other in, let's say, 2005, could collaborate and communicate and build systems together because the lingo, the nomenclature, the tools, and the techniques had consolidated within transformers. I think transformers are, and this is not some great insight, but they are probably the most important architectural innovation within AI of the 21st century so far. But I want to throw it back to you. Diffusion, right? So diffusion as a technique is incredibly important, and we're seeing diffusion techniques—they probably first came on the scene for text-to-image generation, like DALL-E and things like that—but now we're seeing it for everything. We're seeing it for music generation, video generation, and within healthcare. We just had a hackathon last weekend in the biomedical space, and it was all about using diffusion techniques for protein generation. What are your thoughts? You know much more about generative models than me, and I know it's a passionate area for you. What are your thoughts on diffusion?
Dr. ZZ Si
Diffusion was a big surprise to me. It works surprisingly well. Before diffusion, people were doing likelihood modeling directly, trying to estimate the density directly. But diffusion is closely related to score matching. You're not directly matching where the data points are; you're matching the arrows. You're trying to match the gradient, the direction pointing to the most dense areas. And that worked really well. Diffusion models are not tied to any specific backbones; it's not tied to ConvNets or transformers, but it works really well with transformers in the most recent works like DIT, the model that's behind SORA from OpenAI. And with transformers, as long as your data can be tokenized and de-tokenized, and it can be encoded and decoded, you can shove it into the transformer and use the diffusion training loop. It's really a training loop where you start from clean data, corrupt it, and then train the model to find the way back. It's crazy to think about, but now you have a neural network that predicts the gradients of a log-likelihood. But that works really well. I think diffusion models are fairly recent, from the 2020s. Even so, diffusion models are starting to feel a little stale now. People are switching to flow matching, which you could argue is a special case or a generalization of a diffusion model. You're still trying to train the model to find a way back, but the way to do that is more flexible now. You can design the flow to get back to the clean data.
Ron Green
Okay, so you jumped—this was the last thing I wanted to cover, and you teed it up perfectly—which is, I think flow modeling and diffusion models and these techniques are some of the things I'm most excited about. Because, like you said, you can make a process where it may look utterly impossible to go from point A to point Z, but you break it down into steps and train a model to take these little steps one by one, and then it almost seems like magic is the result. I think protein folding might be one of the most important examples because this was hands down one of the most important unsolved problems in biology. We've talked to a lot of biologists recently, and it's just considered to be a solved problem at this point. So, what are your thoughts? We'll wrap up like this: I'm dying to know, what are you most excited about within AI and computer vision? Whether it's some unsolved problem, new techniques, or just an area where you think we're making progress and you're excited to be involved and see where we can take it.
Dr. ZZ Si
That's such a great question, and I think there are probably more than one. There's the multimodal trend of now we start from language models or LLMs to vision-language models. And then more recently, I think people call it unimodal, like any-to-any. There's recent work called Transfusion that combines transformers, meaning autoregressive generation, with diffusion-type generation in parallel. That is able to handle inputs like text, images, video, audio, music, and your output can be text, video, or robotic actions. You just switch on and off, and because most business data have multiple modalities—you have transactions, text, images—such a model only needs one big model to learn all of your business, which is really exciting. Another direction is robotics and embedded vision systems, which we see in companies like Figure and Boston Dynamics. A lot of the computer vision models are now applied to systems like that to make the robotics more intelligent, able to navigate and help us in our lives. I'm also super excited about that.
Ron Green
Diffusion, I think, has unlocked robotics in a way that nobody really fully expected. We're seeing giant steps in robotics, I think, because of that process. Right. Any other further thoughts? Anything you're really excited about in the future that you want to wrap up and discuss?
Dr. ZZ Si
We didn't talk about NeRF yet. That's another really interesting topic, right? Neural Radiance Fields, 3D reconstruction, and Gaussian splats. But, you know, that probably takes a long time to dive into, but that's another very surprising and exciting development over the last few years.
Ron Green
Well, we'll do another episode soon. You'll come on, and we'll dig into that.
Dr. ZZ Si
Sounds good. Thank you, Ron.
Ron Green
Thank you, ZZ. This was just a ball, and I really appreciate you taking the time.