Contact Us

Multimodal Models, Self-Supervised Learning, and the Quest for Data with Dr. Suyog Jain | EP.9

In Episode 9, we dive into the fascinating world of self-supervised learning with Dr. Suyog Jain in this insightful interview! Discover how this innovative technique is revolutionizing artificial intelligence, especially in complex and sensitive fields like healthcare and pathology. Dr. Jain, a Research Scientist at Facebook AI Research (FAIR), shares his expertise in multimodal learning and his journey from precision medicine at PathAI to geospatial data consulting with KUNGFU.AI.

Ron Green: Welcome to Hidden Layers, where we explore the people and the tech behind artificial intelligence. I'm your host, Ron Green, and I'm excited to be joined today by Dr. Suyog Jain to talk about self -supervised learning, multimodal models, and the idea that in some ways, data is becoming more important to the advancement of AI than model architectures. Suyog Jain is a research scientist at Facebook AI Research (FAIR) working in the area of multimodal learning. Previously, he worked at PathAI, focusing on AI and precision medicine and clinical diagnostics. He also spent time as a consultant at KUNGFU.AI working on geospatial data. He holds a PhD from the University of Texas at Austin with expertise in image and video segmentation. Welcome, Suyog.

Dr. Suyog Jain: Thanks.

Ron Green: Let's talk a little bit about how you got interested in artificial intelligence. How did you get involved in the field in the first place?

Dr. Suyog Jain: Yeah, it's a bit of a funny story. So as basically like, I think in second year of my undergrad, I started just sort of wondering, essentially like we would use Google so often to do a text-based search. And it was just like an organic thought about why I can't search through images. It's like, you know, sounds silly to me like now, but back then I thought I have basically cracked like the next big idea that's going to, you know, you know, make such a big impact. And then, you know, just started digging into, you know, how do you start to work on this problem? And then, you know, it was obvious to me that people have been working on this specific problem for at least like 50 years. And, you know, it's such an incredibly hard problem, you know, to work at. And I then got in touch with some professors who are working in this area, and then just started doing some small research projects around that time. And my first foray was to sort of actually like work on image search kind of a problem, where, you know, you have some sort of like query images and you want to retrieve like similar images. This was done in very small scale, and this is 2006. And so, yeah, a long time.

Ron Green: Oh the dark ages.

Dr. Suyog Jain: Yeah, and then just interest, you know, kept on growing. I did internships in related field and then came to grad school with sort of like a very clear view that I want to pursue like a doctoral program in this area. And then, yeah, it just went from there.

Ron Green: What, I'm kind of curious, in 2006, what kind of techniques were you using to do image search back then?

Dr. Suyog Jain: Yeah, it's amazing. Like it was definitely only handcrafted features. So you would basically like look at the raw pixel values and quantize them in some manner. So you would try to sort of extract some sort of, you know, color, like features representing colors, representing textures. There were like, you know, some old techniques which are even like predating something like an hog or a sift which were before deep learning. So there are like, you know, some features which are even older than those features. And I was basically like, you know, using those kinds of features and using some very simple sort of classification or indexing sort of techniques, which would, you know, work with those features to build that kind of a system.

Ron Green: Well I'm kind of curious, what did you think the first time you heard about deep learning being applied, neural networks being applied to computer vision?

Dr. Suyog Jain: Yeah, it's interesting because when I came to grad school, like in my first semester, I took a neural networks course and this is pre deep learning sort of era revolution. And I kept hearing like it's something that, you know, no people no longer use. And so, you know, like it's something only very small set of people are interested in, it doesn't work in practice. And so even our group was very much focused on non deep learning kind of techniques. And as basically like AlexNet, which is the first sort of major breakthrough paper that showed how impactful deep learning can be that came in 2014, which was right at the midpoint of my PhD. And so I always think of my PhD as pre AlexNet and post AlexNet. And, you know, because pre AlexNet we were, no one was even looking at deep learning in sort of any sense. And so obviously when AlexNet just gave such a highly big impact result, then, you know, we were all very keen, but also skeptical at the same time, because oftentimes, you know, things there's flashes and then things go away. So we continue to ignore it for like a year or so. But then it became clear that, you know, this is just so much more powerful than how we were doing things, because pre deep learning, the way vision was done was you would just separate feature design from the learning part of it. And so you would come up with some ways of doing hand designing of features. There's some established techniques, you would sort of come up with newer techniques. And that was what the vision people were doing. That was their main thing, come up with new features, and then just use some standard classifier, like an SVM or something on top of that. But this sort of approach, like a deep learning based approach, like really demonstrated that feature learning itself should be data driven and not hand designed. And that is just a paradigm shift in terms of scaling this kind of technique to all sorts of problems, because you cannot hand design features for every domain that comes.

Ron Green: It almost feels like cheating when you when you when you move from saying I spent all my time doing you know feature engineering feature selection and now I just let the model figure that out it's almost liek cheating.

Dr. Suyog Jain: Yeah, but at the same time, it just feels natural, right? Like, we, even from a human perception perspective, we don't sort of, you know, have a, we probably have a mechanism of feature learning, but it's directly going from data to features, it's not really like a separate process. And so, in a way, my first sort of big takeaway was that, from from the impact perspective, is that this approach of feature learning is going to be, you know, the most critical thing in next few years.

Ron Green: Pretty much everybody I've talked to who was working in computer vision at that time, you know, had that moment where they saw what AlexNet could do and they were a little suspicious. It took a little bit of time, maybe a year or two, to really get on board and then now nobody's looked back. It's pretty much all deep learning since then.

Dr. Suyog Jain: Yeah, totally. Like, and, you know, I published like a couple of papers on deep learning based techniques for video segmentation towards the end of my Ph.D. And I was just amazed at how good those models were. So you could sense that something fundamentally has changed. And, you know, it has continued since then.

Ron Green: Yeah, unabated. So during your time at PathAI you were working in a really specialized domain Pathology and that's a particularly difficult domain in computer vision the data is scarce, it's complex labels are hard to come by and Was that your first exposure to leveraging self supervised learning in computer vision or had you done that earlier?

Dr. Suyog Jain: That was the first experience. And I think, like, even in that sort of most of that phase, we were still focusing on supervised learning to a certain degree. But to pull off super supervised learning in that kind of a domain came up with lots of challenges. So for example, when you're training supervised models, you're training very specific models, you know, designed for doing something very specific, like detecting a certain type of a cancer in a certain limited data set. So it's very difficult to generalize, you have to kind of train a custom model, like for each different problem that you're working on. And on top of that, you know, you can leverage some notion of pre-training, but pretty much you're collecting all the data that you need for solving that specific problem again and again. And when you work in a specialized domain, like pathology, getting access to unlabeled data is hard, but getting access to labeled data is even harder. Because maybe you can get unlabeled data from your customers or from some other sources. But there's only a handful of experts who can actually provide you the annotations that you need for doing self supervised learning or supervised learning. Because, you know, the kind of problems or the kind of models that you're trying to build in those kinds of domains, it's not equivalent to our everyday vision where any sort of person can kind of look at the data and tell whether this is this object or that object or this activity or that activity there you really require like an expert knowledge to be able to tell you this patch is this type of cancer or that type of cancer or this subtype or that subtype. And that just then makes scaling this kind of a model development process very hard.

Ron Green: And you're also dealing with, if I understand correctly, you know, differences in scale. You might have to pay attention to things that are maybe very spread out or very, very isolated and very small and very fine in nature. Is that correct about pathology as well?

Dr. Suyog Jain: Yeah. I think totally. I think if we look at the workflow of a pathologist, like let's say someone's using digital pathology, they would be zooming into various areas, making some analysis, looking at micro patterns, and then zooming out, looking at like macro patterns and like making sort of decisions based on that. This is definitely a lot more of the modeling question as well, not just the data question, like maybe you can sort of have data at multi scale. But then how do you build models that can deal with inputs that come at multiple scales, and then fuse information or like fuse predictions that you do at multiple scales. So those are like some fascinating challenges which come from that specialized domain.

Ron Green: Were you using any self -supervised learning at that point in your career?

Dr. Suyog Jain: So we started experimenting with that towards the end. And that was pretty much to kind of, in a way, basically to solve this problem of the child, like, you know, this needing to label data from scratch, whenever you're trying to sort of build any sort of supervised model. It's expensive, right? Your experts are annotating this data, this expensive, both in terms of cost, as well as like, it's hard to scale, because there's a small number of annotators who can do it. So it's not like you can get these images, like millions of them annotated quickly. So what self supervised learning provides you is this ability to sort of like, do this feature representation that we talked about previously, by just using the property of the data itself, or the problem. And so in a way, to give you a simple example, not specific to pathology, but when self supervised learning first came into sort of being people were doing very simple things, like, for example, they would, you know, like, rotate an image, and then train a model to predict whether the image is rotated or not, or do colorization, where they would convert the image into black and white, and then force the network to colorize it. Now, you don't need a human annotator to do this kind of thing, you can just generate infinite amount of data by this sort of method. And the hope is that if you just do supervised learning with this kind of self supervision, the features that you end up learning will still be generally useful. And so the field has started from there and evolved into all sorts of advanced techniques for doing self supervision, like some of the most popular ones are doing sort of masked training, where you would mask certain parts of the image and force the network to predict other parts, like the the the masked parts of the image. And it has been shown that if you do like really large scale training with this kind of an approach, the features that you end up learning are very universally and generally useful. And so we started to explore some of these directions with the hope that let's say we work on a new problem, we get the unlabeled data, and then we can pre train our model with this kind of an approach. And so even without getting any human labels, it knows a lot about that specific problem or domain. And then we rely on a small amount of handful of human labels, to then sort of, in a way, just take those features and convert them into sort of meaningful classes, like, right, because it already knows the visual patterns. In a in some sense, what you're only doing is kind of converting those embeddings or visual patterns into names, in some sense.

Ron Green: Right, right and to me this is I think one of the most fascinating things that's happened in the last decade within deep learning the idea that you can take a model and train it to perform some task some self-supervised tasks like you mentioned predicting rotation angle or removing noise or something like that and for that model to become good at that task it has to learn things that turn out to be really helpful for other completely unrelated tasks so for example face detection or identifying objects in an image that has nothing to do with noise removal but it turns out to understand the world there are some sort of commonalities that in from a feature perspective that the models have to learn, right?

Dr. Suyog Jain: Yeah, yeah totally.

Ron Green: So I'd love to segue a little bit and and and have you talked a little bit more about the importance of data you know I think that this is somewhat provocative at this stage maybe maybe increasingly less as time goes by but there's so much focus right now on model architectures, transformers and and other competitors but within specialized domains like pathology the data can be the limiting factor right you might know exactly how you want to approach it from a model architecture but you're limited by data. Can you talk a little bit about how you are approaching that in other domains like maybe multimodal domains to address this expensive hand labeling of data problem ?

Dr. Suyog Jain: So yeah, so there's a couple of aspects to it. So I think models are extremely important. And we thought even when convolutional neural networks were there, they were so good. And now we have a transformer architecture that's even better. And I don't doubt that maybe in like five years or so, we will have another variant of it that will probably be even better. But when you combine that with the power of open source, it's amazing to see that so much of this is being driven through open research, where these architectures, this sort of deep technical knowledge about how to build these architectures, train these models is widely available in public open source. But still that, you see that it's not like it's just so easy to go into any domain and build a successful product out of it. And the limiting factor just comes down to data. There are certain domains for which you just don't have the data to begin with. And then certain domains where you have data, but it's either like it's not accessible. So no one is open sourcing data, like the way people are open sourcing, like models, right? And so you can have the models, but only the people who have the data can actually build great products out of it.

Ron Green: That's a really good way of putting it. We're talking about how important the data is. Proof of that is the fact that companies may spend millions of dollars on a model, designing it, training it, and then open source the weights, but not the data. Because it's a real differentiator.

Dr. Suyog Jain: Yes, totally. And I mean, there's sort of like a huge aspect of that is the differentiator. And then other aspect is, you know, the whole sort of notion about privacy and you know, those kind of factors that it's just not possible, like to, you know, just put that data out there. But I think still this sort of contribution towards having access to this data, I think plays the most important role right now. If anyone needs to build a startup in this space, they will not, they should not worry about the models, they should worry about the data.

Ron Green: That's fantastic advice. That's terrific. In many ways, the advances within AI were really only made globally public with the release of ChatGPT. And it's really captured everybody's imagination. And it was, for the better part of the first year, pretty much all text input and output. They've since added images. So now we have multimodal modeling. I know that's an area that you're working in. What are you excited about? And what are some of the challenges in the world of multimodal modeling right now?

Dr. Suyog Jain: Yeah, so it's very fascinating when you combine like multiple modalities together, because oftentimes, like, you know, when I think about text, it's basically like a, you know, a human created construct, like in some sense of describing the world when you think about language. And in some senses, it has, like, limited, like, in some sense, it, it opens up a lot of applications, but it's also limited in some ways, like, you cannot, when when people say a picture is worth 1000 words, like, you know, it sounds cheesy, but there is like a lot of truth to it. Yeah, because if you look at a picture, or if you look at a 10 second clip, there's just like so much that you can describe about that image that you just see in this sort of, you know, like high fidelity signal in a way. And so I think we are not truly multimodal, like yet, because right now, the approach that I see, generally, academia or like industry taking is to align signals coming out of images or videos to what we know about text, because texts give you this amazing ability of generalizing, because we have so much text about everything that exists in the world. And so if you can take what is there in an image, its representation, and just align it to the text, then in a way when, if you get that alignment, right, then it starts to generalize to unseen domains, because it's pivoting on going from the alignment between the image and the text, and then generalizing from there. But my personal sort of intuition is that's limiting in some sense, because you're kind of just aligning the two modalities, you're not necessarily leveraging the richness of this other modality that exists, like, because it's not, again, comes down to how do you generate like large scale sort of descriptions about that image that is independent of web text and stuff, right, like it's probably pretty expensive, you know, to do that kind of a thing. But there is like a lot more information that exists in this image. So I think it's still fascinating to see how fast or like how far you are able to go by this sort of an alignment step. But I think there is like a lot more to be gained by leveraging this like rich content that exists in image and video, like in a sort of more robust manner.

Ron Green: So when you're training a multimodal model, you're dealing with text and images and audio and video sometimes. What about the challenges of keeping it grounded and dealing with hallucinations as you move between those modalities?

Dr. Suyog Jain: Yeah, and I think this sort of challenge fundamentally comes from the approaches that we are taking right now for multimodal learning. And just referring back to the previous sort of conversation about alignment, when we align one modality to another to leverage the power of generalization from let's say the text modality, what we end up doing is like a text-based part of that multimodal model can always generate a plausible answer for whatever question you ask. But there is really no way of guaranteeing that that answer is actually grounded in this other modality, whether you see that in that image or video or not. And right now, you're implicitly trying to sort of make the model learn to ground the answers because when you're training the model in a multimodal fashion and it's trying to generate a plausible answer, you can design your data sets in a way which forces the network to leverage both modalities and not just rely on the text modality. And so people have come up with all sorts of approaches to do this, especially if you look at the lava paper, they come up with so many different ways of generating these question -answer pairs, which are grounded in some sense. But still, at a fundamental level, we are just relying on this kind of a training data and then hope for the best that the model always uses both modalities. But I think there's just a lot of work that needs to happen. I think in this case, maybe a lot more on the architecture side as well, or the machine learning side of it. How do we build models that have a deeper understanding of the world, the semantics that you're seeing in the visual input, and then make sure that the answer that you are generating respects that semantics and not just simply hallucinate.

Ron Green: Right. I'm tempted to go into a whole other area around reinforcement learning with human feedback and things like that, but we'll save that for another time. We'll have you on the show again. I meant to ask you this earlier, do you find that self-supervised learning techniques are more applicable within computer vision than NLP generally, or are you using them broadly or across both?

Dr. Suyog Jain: No, I think it's broadly, broadly across both. Like, I mean, like the whole sort of approach of doing like masked language pre -training with BERT and GPT and everything, like that brought like a paradigm shift in the language world. Like I think the concepts are kind of similar, like in that sense. And in some sense, like it's more broadly applied in language right now, because you just have a lot of well -designed structure text written by humans, like available on internet. And then you can use that, wherein in vision, like we are trying to still like just operate at the level of pixels in a way, like whether you're doing masked auto -encoders or something like that. Your representations is still guided more from low -level image representations than more like a semantic sort of space that you see in text. So I think it's definitely broadly applicable in both, but I think primarily like just the amount of data that exist when you like look at image, like very large scale image data sets or video data sets. I think in the long run, I feel that it will have a huge impact on the vision domain as well.

Ron Green: I'm curious to know, in the next 12 to 18 months, is there an area within AI, whether it's research or application development that you're particularly excited about right now?

Dr. Suyog Jain: Yeah, I mean, lots of potential sort of exciting avenues. I think when I look at from an academic perspective, I think I feel there's so much potential on videos that we haven't explored. Even like current multimodal models are very heavily focused on combining text and image. But when you go to videos, like you get so much more information, but at the same time, it's extremely challenging like this, like extremely large videos, like how do you kind of build models on top of that? Yeah, so I think that's a very exciting area for me, like very looking forward to all the advancements that happen from like a video learning perspective. But I think simultaneously, I'm very excited to sort of see what are the kind of real world products or applications people successfully launch, which go beyond what OpenAI is doing or what these like large foundation models are doing. Because like building a successful AI product is very hard. You have to go deeper into a domain and sort of like, you know, just go and like build deep expertise into a domain, get that right data from that domain. So I think people who are working hard on that aspect, where they take some of these advancements of foundation models, but when they combine it with their domain expertise, I think that can lead to amazing things. And so I really hope in next 12 months, we see some, you know, a couple of areas which are just completely, you know, transformed using like an AI based product.

Ron Green: That was a perfect closing of the loop because one of the challenges, as you mentioned, within these specialized domains is the proprietary data, the specialized data, the complex data, and that's one of the challenges with building AI-based products right now. So I'd love to ask you a final question. We always ask our guests here, which is if you could wave your magic wand and have AI help you in some part of your daily life, what would you pick?

Dr. Suyog Jain: Yeah, so I think one is related to, I think, creating memories. I feel that when I have a three -year -old daughter and she is always doing something cute and I never have my phone handy to capture that moment, I wish there was, you know, something like a smart camera or something that can just automatically, without me ever worrying about capturing those moments, is continuously capturing those moments and creating like something magical for me that I can just look back, you know. So I don't have to be the active photographer, I just am part of that moment rather than trying to capture that moment and, you know, so that's one thing. The other thing like I'm really sort of excited about is the potential of, you know, these foundation models, like in some sense, how are they going to impact in how we like learn new things or acquire new skills, you know, and you can sort of think about it more from advanced skills, but even something, you know, as simple as like, you know, elementary education, like, you know, sort of like the way we learn new techniques, the way I read papers or the way I read like new things. I would love to have a very good AI assistant who would like mine through these gazillion papers.

Ron Green: Read all the white papers, give you a summary?

Dr. Suyog Jain: Yeah, I mean, even if it's not a summary, but it's like creates like go takes me away from this sort of like a mundane way of just loading a PDF, reading through it versus like it makes it more fun, more interactive, and I can sort of quiz it. That would be super interesting.

Ron Green: It would be amazing if you find that, please let me know. Well, thank you so much for coming on today. This was fantastic. I enjoyed it so much and I really appreciate your time. Thank you.