Support the show to get full episodes, full archive, and join the Discord community.
James, Andrew, and Weinan discuss their recent theory about how the brain might use complementary learning systems to optimize our memories. The idea is that our hippocampus creates our episodic memories for individual events, full of particular details. And through a complementary process, slowly consolidates those memories within our neocortex through mechanisms like hippocampal replay. The new idea in their work suggests a way for the consolidated cortical memory to become optimized for generalization, something humans are known to be capable of but deep learning has yet to build. We discuss what their theory predicts about how the “correct” process depends on how much noise and variability there is in the learning environment, how their model solves this, and how it relates to our brain and behavior.
- James’ Janelia page.
- Weinan’s Janelia page.
- Andrew’s website.
- Twitter:
- Paper we discuss:
- Andrew’s previous episode: BI 052 Andrew Saxe: Deep Learning Theory
Transcript
Andrew 00:00:04 I guess the jumping off point is this long running debate about where memories are stored in the brain. And it’s a profound question is, you know, something that’s, people have struggled with for many decades at this point.
Weinan 00:00:17 Like this is really give me that insight on, you know, why a lot of episodic memories are being kept in the hippocampus and require the hippocampus. It’s just a lot of it is because the world is so complex.
James 00:00:32 I definitely think that, you know, we wouldn’t have gotten to where we currently are in AI without past generations of theoretical neuroscience research. And I also definitely think that projects like this, where we try to kind of boil it down to the essentials and really analyze everything very rigorously and really try to figure out to what extent is relates to the biological brain will provide useful seeds for future AI research.
Speaker 0 00:01:01 This is brain inspired.
Paul 00:01:14 Hey everyone, it’s Paul today. I have three find folks on the podcast. James Fitzgerald, Andrew Sachs and Wayne on soon. James and Wayne on are both at the Janelia research campus at the Howard Hughes medical Institute. James is a group leader and Winan is a research scientist in Nelson spruce tins lab. And Andrew is a joint group leader at university college London. Andrew’s been on the podcast before, when we discussed his work on deep learning theory on episode 52. And you’ll learn more about what James and wanan on do, uh, in a moment, the reason I’m speaking with them today is their recent theoretical neuroscience. Pre-print with the title organizing memories for generalization in complimentary learning systems. So we’ve discussed complementary learning systems a few times on the podcast before the general idea is that we have a fast learning system in the hippocampus that rapidly encodes specific memories.
Paul 00:02:13 And we have a slower learning system in our neocortex where over time and through mechanisms like replay from the hippocampus memories get consolidated. So in their new paper, they build on complimentary learning systems and suggest that these two learning and memory systems might work better together in a way that optimizes generalization, which is a good thing. If you want to function well in our topsy turvy world. And one of the big takeaways is that how much consolidation should happen from hippocampus to cortex depends on how predictable the experiences are that your brain is trying to remember. So in unpredictable environments, you want to cut off the consolidation, uh, pretty early in predictable environments. You want to let consolidation run for longer. And that’s of course, a very simplified explanation, which gets elaborated during the podcast. So they built a model to explore that hypothesis, and we discussed many topics around their model and related phenomenon.
Paul 00:03:17 I link to the paper in the show notes at brain inspired.co/podcast/ one twenty one hundred and twenty. Uh, there’s also a guest question from Mike, a Patrion supporter who actually pointed me to the work that we’re discussing. So thanks Mike for the question and pointing me to it and thank you to all my Patrion supporters, by the way, during my little, uh, post recording check-in with James, Andrew and Wayne on, uh, asking if everything went okay. Uh, James mentioned that it felt a lot like a conversation they would have in their regular weekly lab meetings. So if you’re wondering what their lab meetings might be like, here you go, Andrew James wanan, uh, thanks for being on the show here. So, um, what we’re going to do to start off is I’m going to ask you guys to each introduce yourselves. Andrew, let’s start with you because you were, uh, on the podcast before we were talking about deep learning theory and ever since you’re on, of course, uh, you’ve been emailing me. Can I please come back on? Can I please get back on and, and finally we have you back on here. So, uh, Andrew, who are you?
Andrew 00:04:24 Yeah. Um, so I’m a joint group leader at, uh, the Gatsby unit and Sainsbury welcome center at UCL. And I interested in deep learning theory and ways that can inform theories of psychology and neuroscience.
Paul 00:04:41 James, would you like to go next?
James 00:04:43 Sure. Yeah. Yeah. My name’s James Fitzgerald. I’m a group leader at Janelia research campus, which is part of the Howard Hughes medical Institute. So I’m a theoretical neuroscientist. I’m actually very broadly interested in a lot of different things. So in addition to the learning and memory stuff, we’ll talk about today. I also work on like small animals, like zebra fish and food flies. Um, and I’m very collaborative. So I’d like to work with diverse people coming from all sorts of different perspectives. I think that’s one of the most fun parts of doing science.
Paul 00:05:10 I’m going to ask about the collaboration in just a minute here. Wayne on hi,
Weinan 00:05:15 Paul. It’s great to finally meet you. Yeah, you too. My name’s wanan, I’m a currently a senior postdoc in Nelson spruce and slab in Idina. So I joined at now four years ago after, uh, doing ion channel bell physics and Snapchat transmission studies, uh, for seven years after joining Ginita. I just wanted to sort of step up and get a bigger picture kind of more a framework thinking on neuroscience. So I decided to, okay, let me do theory and experiments together. So it has been a pleasure to collaborate with engineering teams on this project.
Paul 00:05:52 Yeah. So, so this project is, um, pretty much all theory, right? So, uh, the title of the paper that we’re going to talk about is organizing memories for generalization and complimentary learning systems. Before we talk about the theory, um, I kind of want to ask, well, how did this collaboration come about? And, um, anyone can jump in and answer because, uh, and, and tie it into that is the idea of, of Janelia. And I, I’m curious, you know, it seems like Janelia specifically is highly collaborative. So, um, I don’t know if that’s a factor in this as well.
James 00:06:29 Yeah, for sure. I mean, so I think, you know, as you just heard to weigh on is actually a Nelson spoons lab at Janelia. So none of us are actually in the same lab anyway. Um, so Janelia labs, the very small to kind of try to encourage collaboration. So the idea being that, you know, no one lab has enough people or all the expertise you would want to kind of achieve the project’s goals.
Paul 00:06:51 Is that built in though as a, as a principle when forming
James 00:06:55 It is. Yeah. So they, that’s why they’d kept the labs small is to kind of, you know, really encourage people to interact with each other and collaborate. So, so why not? And I kind of arrived at Janelia at almost the same time and, uh, you know, at the time my lab was completely empty. So we non was really kind of my first main post-doc collaborator at Janelia, even though he didn’t have any position in my lab. So I dunno know way. Do you want to kind of affect a little bit on the early days?
Weinan 00:07:21 Oh yeah, that was interesting story. So in 2017, I joined and that’s executive one James just joined. I was just in a sort of experimental crisis mode. Like I’ve been doing so many experiments and, uh, and that I was trying to decide what to do here at engineer. I know I joined Nelson slab just to study some more single neuron computations, but that’s when DeepMind released their off ago. Yeah. And it was having a big splash in the world and, uh, I was a long time go player and that just really shocked me how human life the moves are. And it would just start to think, okay, so where can I get frameworks, uh, to inform like all the data I collected, I think I need to collaborate with theorists and I need to combine neuroscience AI. So, and then James and I started to talk and, uh, I was generally interested in complementary learning systems. So the seminal work by J by Cleveland, and then we started talking and then James was interested. Okay. So why don’t we just start with modeling, say one, because he, his recent findings show that say one spine is a very transient. So it turns over every two weeks
Paul 00:08:40 And hippocampus I’ll just interject and hippocampus. Yeah.
Weinan 00:08:44 Yeah. So that quick spine dynamics, James suggest that that might be a form of regularization, like weight decay and machine learning. So that’s why, what got us started self is than that. And I don’t want to get this too long, but, uh, but later on James, just, I remember that really vividly. He, he drew on the board that, okay, do you know what happens? You’ve have noise in the data. If you train, uh, on the training data and the generalization error, if that your performance on new data will start to decrease, and that will start to rise up. And I remember being an expert on this, I was shocked by that fact that I said, why is that okay? You need regularization in order for training to not over-fit.
James 00:09:30 Oh yeah. And I think that’s just interject a little bit. I think that’s actually a great segue in also to how Andrew got involved in all this stuff, which is that Andrew and I go back a long, long way. So we were grad students together. We knew each other way back then. And then we were actually postdocs also together in the same program. And so we’ve known each other for a very long time, but we hadn’t been collaborating and a wall. Um, Andrew was in his postdoc, he was working with MedU, another author on this paper. Uh, very precisely analyzing the amount of learning in a batch system that, you know, would be optimal for generalization. And so it was very much in my mind, the importance of thinking about legalization in part by, you know, interactions with Andrew in a non-collaborative way, but then once went on and I started to think about the benefits, um, that, you know, this types of legalization might provide for learning systems.
James 00:10:19 Uh, we decided to be really fun to see if Andrew would also be interested because we knew that he had overlapping interests too. And that’s when we kind of, uh, kind of bought Andrew and Debbie thing. And we had Andrew come to Janelia, we planned to visit our project. Uh, and then, you know, we voted up and we’ve been working together for the last couple of years on this. So yeah, in the early days, but we’re not. And I we’re modeling, you know, we were thinking about kind of the role of, uh, you know, transient memories in the hippocampus and what that might do to kind of aid system-wide function and a complimentary learning systems type way. But we didn’t actually have any kind of explicit, critical model in those early days. And it was only once Andrew started to get involved, that we kind of really started to build a integrated model of the whole system, uh, based on the insights that he’s had by kind of thinking about learning and deeply learning networks.
Paul 00:11:06 Oh, Andrew you’re you’re, uh, not at Janelia. So, uh, I don’t know if you’re being pulled in a thousand different directions by lots of people that want to collaborate, uh, with you. So yeah. How do you, where’s your threshold? How’d you get sucked into collaboration with these guys?
Andrew 00:11:25 Yeah. Well, I was just thinking, I’m glad that you to remember how this started as sort of ironic given that we’re studying episodic memory, but I really don’t remember quite how it all came together. Um, but, but yeah, I mean, I guess the, the memories that are recurring or yeah. Thinking about do 33, I have this structure that I’m hoping will be replicable across many experimental domains. And the idea is first come up with some basic insights into how deep networks work, just treat them on their own terms. At that point, it looks basically like what a computer scientist or physicist would be doing. And then if you can get some durable insights into those systems, hopefully that offers interesting hypothesis for search and experimental to means. And in this case, we initially did work on generalization error and then it seemed like this could potentially shed light on, um, mechanical consolidation. We just had a co-sign poster on this to just do a night that didn’t go nearly as far as this paper, but it was some early seeds of it. And then I think James and I probably just heartburn James Richard.
James 00:12:34 Well, it couldn’t reconstruct actually how I even looked the old emails, how this got started. But I do remember, I do remember when you visited us at Janelia and you know, and at that point, presumably we already knew we were planning to collaborate or maybe we just had to come for a talk, actually, maybe, maybe it was just going to come and give a talk. But in any case, I remember us taking a walk along Seldon island. So Janelia is white on the Potomac and one of the odd, but wonderful things about us is we have a private island in the Potomac that does a footbridge and it’s just wild. This there’s nothing that are except for like a field. And, and so Andrew and I will walk in and, you know, weigh on and I had already been working on this stuff. And of course, Andrew had been doing his stuff with MedU and we were just discussing like, well, you know, everybody talks about complementary learning systems and it’s kind of implicit in a lot of what people say that they think that the point of having the cortex is for generalization, but do people actually realize the danger of overfitting in these systems?
James 00:13:25 And we were kind of debating this back and forth a lot because on the one hand we were like, well, it seems like they kind of should, if they think about it from the viewpoint of machine learning, but at the same time, it didn’t actually seem like anybody had been thinking through the consequences of that about kind of, well, then you really need to regulate the amount of transfer. You can’t have it that you just kind of fully transfer memories from the hippocampus to the neocortex. And then I think that these conversations also in terms of why not, I can remember that same, visit him getting very excited, uh, by the generalization angle, you know, before that visit, we were thinking about many different things about the benefits of kind of transient HIPAA capital traces, also in terms of things like memory capacity in the hippocampus and stuff like that. But I think after that visit, we all kind of consolidated around this fundamental importance of, you know, if you build the system for generalization, there’s going to be some new requirements that, you know, people have not been thinking through, uh, from the perspective of Nolan.
Andrew 00:14:19 And I think one of the most fulfilling aspects of this collaboration is that at this point, the idea is, are so jointly wealth and that it’s like, it really was one of those wonderful times where you’re just everyone’s riffing off of each other. And that somehow it comes out to be this thing that’s greater than that. Some of the parts,
Paul 00:14:40 Well, that’s interesting. I didn’t know that the connection James between you and Andrew, because I didn’t do my homework about reading your CVS, I suppose, but I, but I believe it was Wayne on actually who first recommended, uh, Andrew to come on the podcast way back when, um, so I know that there’s a connection there as well. Um, It’s great to hear, well, let’s talk about, uh, the big idea in the paper and then, um, we can kind of unpack it from there because you’ve already hinted at some of the, um, some of what it’s about. So I don’t know who would like to start and give out like a really high level overview. Um, and then we can talk about, uh, complimentary learning systems and just go on down the list there.
Andrew 00:15:26 Sure. Well, maybe I can. So, um, I guess the jumping off point is this long running debate about where memories are stored in the brain. And it’s, it’s a profound question is, you know, something that’s people have struggled with for many decades at this point. And the data from the neuropsychology is riveting the patient Hm. For instance, lost his hippocampus and other MTL structures and just can’t form new memories, but even more striking, if you look into the memories before he had this, um, resection operation, uh, a lot of those memories are damaged as well. So the damage went back into the past basically. So that’s retrograde amnesia. And what that suggests is that there’s some process by which memories might initially be started at the campus, but ultimately transfer out or, um, reduplicate themselves in other parts of the brain. But this just raises all kinds of questions.
Andrew 00:16:29 Why would you have a system set up like this? Why do you need multiple memory systems to start with, why couldn’t you have these memories stored in all places across the brain? Uh, and there’s a raging debate about this, this topic. And so when we were looking at this, we were trying to find, um, ideas that might’ve been overlooked and looking at machine learning, you can see that there’s, there’s this very interesting phenomenon that if you’re training from a fixed set of data, a fixed batch of data, and you’re going through it again and again, and the idiosyncrasies of that data can cause you to learn spurious relationships. And so too much learning from the same fixed batch of data can be counterproductive. And so we thought maybe this was relevant to a systems consolidation. If you think of that batch of data, as the experiences you started your hippocampus, what is saying is that there’s only so much replay you should do to sort of try to encode those memories into neocortex, because if you did too much, you would not be learning the general rule that’s in that data set. And so that we think can start to make sense of a lot of these critical puzzles that have been out there.
James 00:17:45 Yeah. So maybe to kind of just elaborate on that a little bit. So like, you know, one of the big empirical puzzles is that, you know, because of patients like, Hm, there’s been, what’s called the standard theory of systems consolidation, which says that, you know, in the beginning, everything is encoded in the hippocampus, but then over time, everything is consolidated into the cortex and it becomes hippocampus independent. Um, and there’s a lot of data to support that from both humans and from animals. But over time, there’s also been a growing body of literature that conflicts with that and suggest that in both humans and in animals, certain types of memories do permanently require the hippocampus. And so there’s been kind of a conceptual shift in the field where people start to think about consolidation, not just as something that happens over time, but that has something to do with the content of memory. And, um, you know, there’s all sorts of, uh, conceptual ideas about what that content might be and why it is that certain things to be quite of the hippocampus permanently. But again, from this perspective of generalization and neural networks, we thought we might be able to make this very concrete about kind of when you can and when you can’t, uh, take something that was initially encoded in the hippocampus and gradually make it camp or independent.
Paul 00:18:55 So you you’ve taken complementary learning systems, which we’ve already talked about, um, a bit. And, uh, essentially the theory, uh, is generalization optimized, complimentary learning systems is the name of the, I guess, is it the theory or the model setup, or is that interchangeable?
Andrew 00:19:14 It’s sort of actually two levels that you can view. This is a work on the first is a formalism that could let you model many other kinds of consolidation theories. So we have this particular mathematical framework. You can instantiate the standard theory. You can instantiate generalization, optimized, complementary learning systems. There may be others. And so at that level, it sort of lets you understand the consequences of different choices of how these memory systems interact. But then the one that looks good to us, just looking at the data, yes, is generalization optimized, complementary learning systems.
Paul 00:19:51 So what is the take home of what’s new about a generalization optimized CLS versus an a, you may be repeating yourself. So I apologize.
James 00:20:01 So maybe the way I’d characterize the original CLS idea is that there are benefits from a rapid learning system and a slow learning system. And a lot of those benefits that we’ll highlight it in the context of the original complimentary learning systems idea was that having a fast learning system that can record and memorize examples could allow the slow learning system to interleave those examples during learning and prevent what they called catastrophic interference. And so you’d be able to use the fast system to record the memory as it comes, but then slowly train up a slow learning system based on those experiences and idea would then be that in this slow learning system, you get some sort of representation that is kind of generalized over the various training examples that you’ve seen. So in some sense, I think generalization has always been an important part of thinking about the complimentary learning systems framework.
James 00:21:05 Um, but what is new and how we set things up is that we have an explicit generative model for the environment that allows us to consider the possibility that there are unpredictable elements in that environment or noise if you set it up in kind of an abstract way and because of this, but we show that had not been kind of considered an earlier complimentary learning systems models, is that in the presence of noise, it’s not always ideal actually for the slow learning system to learn forever. At some point you actually have to stop learning to avoid overfitting to this noise, which again, from the viewpoint of machine learning makes a lot of sense, but in the setting of conventional complimentary learning systems problems, you’re learning from these data that don’t actually have any noise. It’s just kind of these very reproducible, very reliable cognitive relationships.
James 00:21:53 And as a consequence, there’s no tension between, you know, what we’d call a generalization of what we call is memorization, that the ability of the system to recall those examples versus kind of deal with us, you know, cognitively as semantically, similar examples going forward. But once you add noise, you break that and it starts to become actually then you have to make a choice that, you know, well, what is it that you want this slow learning system to do? Do you want it to be able to faithfully reproduce the past, which is what we would call the standard model of systems consolidation, or do you want it to actually do as well as it possibly could and anticipating new experiences from the environment that could incur in the future. And, and that’s what we mean by generalization.
Paul 00:22:33 So I had, uh, already Hassan on a awhile back and he, he, uh, had written this paper, uh, I believe called direct fit to nature. But the idea was that, uh, our cortex essentially, uh, have so many neurons and synapses, um, AKA, so many parameters that it’s constantly trying to, uh, overfit trying to, basically you can’t overfit, um, it’s so big that it can memorize everything. And of course this has been shown in deep learning networks as well, is the right way to think about the cortex then. So, so James, what you were just describing was sort of a larger picture, um, what a normative framework for what you would want as a generalization, um, organism didn’t realization, uh, geared organism. Um, and one of the things I thought in reading the paper, um, was, well, is it right to think of cortex then as trying it’s damnedest to fit everything perfectly. And, uh, there’s, there are regulatory systems that are preventing it from this direct fit that, uh, Ori talks about. And so at the organism or brain level, I suppose we should think about that system as separate from, well as a kind of a control mechanism for this otherwise, uh, running rampantly, memorizing, uh, things cortex, sorry, that was a mouthful.
Andrew 00:24:00 Yeah, no, I think that, I mean, you can do better by for instance, stopping training early, even in a large deep network. And so if this is something that, um, the brand takes advantage of and it would be generalizing better. So there are circumstances like already saying where if you have a giant network you’re not really going to over-fit dramatically. So it’s not maybe a huge benefit if you stop early, but it’s still there. And in some regions that effective, early stopping is incredibly important. So if, if the number, if the amount of experience is roughly matched to the number of degrees of freedom in your model, then that’s the point where, uh, you could get a lot of benefit from replaying that data, if it’s noise free, because you could just perfectly determined what the whole mapping should be. But if it’s noisy, it’s also the point where you can do as badly as possible.
Andrew 00:24:54 And so regulation regulation is very important. And maybe just to highlight with an example, so because these things can sound abstract, the patient Hm. For instance, um, uh, Suzanne Corkin, the MIT professor who did a lot of work with him, was asking him, um, if he had any memories of his mother and specific memories of his former mother and his response was well, she’s just my mother. And, you know, I, I think the point there is that this is someone who had their entire cortex intact, right? And they, he could not come up with a specific memory of even his pants. Right. It’s true of his father as well. Like you do all kinds of more reliable facts. He knew that his father for instance, was, um, from the south and things like this. And so there’s this interesting, um, tension here where the, the, the quality of the memory, the type of memory that you can put in new cortex seems to be very different.
Andrew 00:26:01 And we think this theory explains some of that because there’s certain components of a memory or a certain scenario that you can transfer. Like the fact that mom is mom is always true, very reliable, but then there’s other features of memory, which can be very idiosyncratic, like what you did one specific Christmas. So he knew that Christmas trees were things that happened at Christmas time, right. But he didn’t have a specific memory of one specific Christmas and what they did. And that that’s the sort of, that’s what we’re proposing is explaining the character of this transformation as, um, uh, being aimed at generalization and, um, flowing from these properties didn’t numbers.
Paul 00:26:48 Maybe we should get into the model, the models, the three models used, and just the whole setup. I was going to say, experimental setup, the whole modeling setup. So, uh, on, do you want to describe how, like, the different kinds of models used that are supposedly to represent different brain areas, although you use a different vernacular in the paper, because, uh, you talk about how these could map onto other brain areas or it’s amenable to other brain areas because, well, so you use student teacher, uh, and notebook in the paper, but, uh, do you want to talk about how the, what the models are and how they map on?
Weinan 00:27:25 Oh yeah. So we thought about how to formalize this learning problem of systems consolidation. So Timberlake think about a brain that can learn things from environment and what is the environment. It’s just a sort of view be viewed as a data generator and you produce some kind of input, output mappings, maybe very complex functions. And we want to replicate that by very simple, generous model. In this case, it’s a shallow linear neural network. So it just transforms an input vector into a single skater in most of our simulations. And this funds formation generally produced data pairs like X and Y input output pairs to feed that to another similar architectured student network. So it’s another shallow, uh, linear network that can between all the 20 data generated by the teacher and learn to represent the mapping of the teacher. And the student learning is 80 through a memory module, similar to like the current AIS external memory idea is modeling the hippocampus.
Weinan 00:28:36 So it’s a healthy network. That’s, bi-directionally conducted all to all to the student. So the job is really capture the ongoing, uh, experiences by one shot in coding, through hiding learning. So what the heck campus has to be in suppose to be learning really fast. And then after capturing that it has the ability to undergo patent completion offline. So it can just randomly search for a previous memory and reactivate the student through the back, uh, the feedback weights. So this essentially is modeling episodic recall. So you could offline re play what the students was seeing when the teacher give the student the example. So as he, if you have the teacher, like it’s a notebook, so you just reviewing what the teacher said, essentially. So through doing this offline activation, the student can learn much more efficiently as we later show in the paper. So that’s roughly the three, uh,
Paul 00:29:34 And your networks. Yeah. So you have the environment, which is the teacher. You have the, uh, cortex, which is the student, and you have the hippocampus, which is
James 00:29:46 Just to kind of emphasize. One of the thing is that, you know, one important difference between, you know, the teacher and the student is that the teacher has noise. And because the teacher has noise, where that means is that the mapping provided by the teacher may or may not actually be fully learnable by the student and controlling the magnitude of that noise then is a critical parameter that determines the optimal amount of consolidation in this framework,
Weinan 00:30:14 Just for the lights. One more thing that you asked, what’s new compared to the original cos framework. So we have an explicit notebook in this, uh, in this model that’s directly connected to the student. I think, uh, some of the early CLS works just kind of replay, um, training examples, not by storing them in a neural network, but just replaying the representations. And this has generated some really interesting insights that we can talk about later, like having a distributed binary, healthy network reactivity of the student could have some very interesting, uh, interference, robust properties to train the student.
Paul 00:30:51 Great. Andrew, I was going to ask, uh, so you guys are using linear, although these are shallow linear networks, and we talked all about your deep linear networks last time you were on, why, why the linear networks in this case, is it just to have principled theoretical accountability?
Andrew 00:31:09 Yeah, I mean, I, I hope one day we’ll have nonlinear ones, but, um, all of the qualitative features that we wanted to demonstrate came out with shadow linear networks. So it’s just learning linear progression. Right. And so my impulse and I think is shared by James and went on to some extent, at least is to go as simple as you possibly can and still, um, get at the essential and what you get for that in return is greater tractability. So another feature of this framework is that most of our results are sort of mathematical demonstrations. And so you feel like you can really, at least I feel it’s easier to get one’s head around it. And, uh, another thing that this very simple setting enabled is we can make clearer normative claims so we can optimize everything about these settings. How well could you possibly do if you just had a notebook or if you just had the student and then we can show that yes, indeed. You really do do better when you have both interacting.
James 00:32:11 No, I was just going to say, and just to add to that, I think another thing that’s really powerful about setting it up in this very simple way and being able to analyze it so comprehensively is that, you know, as we kind of alluded to earlier, I think one of the big challenges in memory research is to figure out, well, what is the key quantity that determines whether it’s going to be hippocampus dependent or not. And within this kind of modeling architecture, we can really solve that problem from the viewpoint of what would optimize generalization and then, you know, going forward, you know, Wayne ons and experimenters. So we can actually design experiments very much around directly that parameter and just test the theory very rigorously about whether or not this actually does provide a empirically meaningful predictions more than just the theoretical insights. And I think that that gets harder and harder. The more complicated the model becomes to really kind of boil down what is the critical parameter and to design an experiment that embodies that critical parameter.
Paul 00:33:06 Oh, no. Wayne on you’re going to be stuck in experimental crisis still. You’re trying to get out of that.
Weinan 00:33:12 No, I think that it’s perfect. Combination of theory and then do the experiment.
Paul 00:33:17 Okay. Alright. So who wants to talk about how, how the model works to, to generalize the, the right amount of generalization? Yeah.
Andrew 00:33:26 So the setting that we’re looking at is sort of like, imagine you’re doing a psychology experiment for an hour and you see a bunch of, uh, experiences over that course of that hour. And then you go home and over maybe many days, um, you have whatever you store and during that hour, and you can, uh, perhaps, you know, different, the notebook could replay that information out into students to learn from it. And then after some period of time would bring you back into the lab and we test your memory. So it’s this sort of upfront get a batch of data. How do you make the best use of that scenario over analyzing? And so generalization for us just means when you come back to the lab, how well will you do on new problems? Instances drawn from the distribution of problem instances that you’re seeing on that first case first time.
Andrew 00:34:23 So it could be, you’re learning to distinguish dogs and cats or something like this. And then we show you a new images of dogs and cats. How well do you do on that? And, um, the key feature of the framework is that, um, just as in deep learning theory, it means is building directly off of deep learning theory and double descent phenomenon. Um, there’s an optimal amount of training time that you can train from a fixed batch of data because otherwise you start picking up on these aspects that are just noise. And so, um, as the predictability of the rule that you’re trying to learn increases, you can train for longer and longer and you can characterize sort of exactly how long, um, but that, that’s the basic idea. As you get more predictable, you can train for longer. If you can train for longer, you can also memorize the exact examples.
Andrew 00:35:17 You’ve seen more. And so your, your memory error is decreasing. And that means that more of the memory with specific memory would transfer into your cortex and not just be a notebook. Maybe one other thing to throw in here before I let someone else jumped in and said, you can compare this. So there’s different ways you could generalize. You can try to use the student network, but you could also try to use the notebook. You can just say, let this hot field network complete the pattern, that whatever pattern you give it and make its prediction. And, uh, one important result here is that in high dimensions, that strategy fails completely actually. So basically if you think of high dimensional vectors, the geometry is very different. Any new input is almost surely orthogonal to all of the inputs in your, all of the experience that you’ve had previously. And because of that, it doesn’t let you generalize. So it’s interesting, you need this notebook to store these examples so that you can replay them to the student, but ultimately it’s the student that’s going to be able to generalize well and not, Nope.
Paul 00:36:29 Maybe this is a good time. So I have a listener question. So predictability is a key aspect of, uh, of the performance of the generalization performance. So, um, with different levels of predictability, the S the generalization needs to cut off right at certain different points. Well, you know what, I’ll just play the question. So this is from Michael till doll, and then we can, uh, back up and talk, talk about the bigger issue after the question, if needed,
Speaker 5 00:36:57 And then discussion section. It suggested that replay could be the mechanism that regulates generalization to the neocortex, which seems very probable, but the thing I’m still missing is do you have any ideas around how predictability of an experience as determined as that seems to be a key parameter in the theory?
Paul 00:37:16 Okay. So I know that’s a little ahead of the game here, but I thought I don’t want to, I didn’t want to miss the opportunity, uh, to play the question before one of the, you started, started answering the question on your own.
Andrew 00:37:27 Yeah, no, that really is such a good question. And that, you know, we don’t, we don’t address that in this paper. What we say is, imagine you had an Oracle, which could tell you exactly how predictable this experience was, what, what should you do to be optimal? But we don’t explain how you could estimate that it’s not, we do think there are ways you could potentially estimate it, but, um, it’s not part of this theory at present. We just are saying, suppose you were able to understand the predictability. What would that then mean for systems consolidation?
Paul 00:38:05 I was going to reiterate the problem, which is that predictability needs to be estimated by some system to regulate the generalization process.
Weinan 00:38:14 Yeah. Just to give a journal up yesterday. And this question is such an important one. Yeah. People always ask. Okay. So that’s great. But wait, how do you actually estimate the SNR in the experience? So op Harare, if you get a new batch of data for the first time and you, if you are learning from no previous knowledge, there’s no way for, to know whether this batch of data is predictable or not. So you kind of have to learn that through trial and error, but a, the trial and error can be divided into like a longterm evolutionary scale or like, uh, within a lifetime. So maybe some animals already has built-in predictability estimators, um, from birth, maybe there’s something like humans, like a certain facial features or certain animals that if you see that it just, the brain will treat it as a high, as an Arvada, no matter what, or during our lifetime, I think the property like that, the main way we learn how to estimate the predictability through lifelong meta-learning.
Weinan 00:39:15 So when you are a child, you experience this a lot, experience, a lot of things, and you’ll make predictions all the time. And then gradually you learn what of, what source of information is good to consolidate all this, give this to the example that, okay, so people typically know the reliability of the source of information. So for example, I gave you an article from New York times, and then it gave you an article from the onions and you get a, like a really visceral feeling like which one to trust more and to understand more. And not that example, it’s like a, my daughter, I, I see, like, she has no idea. What’s almost no idea. What’s predictable. What’s not. And just one time I was holding her and I was cooking and she just wanted to touch the hot stove. I said, okay, uh, that’s not a good idea to do that.
Weinan 00:40:05 Uh, you’re going to get burdened and she’s got attached to the edge. That’s a little bit. And, uh, she, she just said, okay, if you don’t listen to me, you can go ahead and try it. And the, she touched an ouch and then she turned a hat back to me. I look, I think the looking her eyes is that okay, I really need to listen to this dude in the future. This dude is that I think that’s when the metal learning is occurring in her brain, that is assigning different sources, uh, for like different predictability. Like we all trust our, like authorities, like teachers in our lives, parents and the friend who would trust, like, even like another key aspect is that a lot of people think more frequent things should be consolidated because it’s more reliable or, but Arthur is really decoupling predictability from frequency. So there are like nowadays, you know, there are frequent misinformation online, and it’s not the quantity that can overwhelm your brain and determine what gets transported. It’s really like, like, uh, for example, uh, like something your trust, different told you, like he went for once can, can really make a long lasting impact, but some news outlets give you the same story again and again yet, and you will not transfer. So I think that’s a key thing. Like we learn these predictability through experiences through meta-learning,
Paul 00:41:31 But those experiences need to be essentially stored right. In some system to be able to be used again. Um, and so is that, is that a, just a different kind of memory? Is that more of a implicit procedural memory or, you know, outside of the hippocampal neocortex complimentary go complimentary learning system framework or, or do we know, or does it matter?
James 00:41:52 Yeah, that’s a, that’s a good question. I mean, so in our model, the notebook does kind of create a record of the whole memory. And so using the hippocampus or the notebook in the mathematical framework, you can reactivate those critical patterns correspond to that full memory. But that isn’t, I think part of what we mean by complimentary Atlantic systems, he already had the ability to do that in the notebook. So then maybe you don’t need to kind of create a new critical system to do the same thing, and the idea would be, well, what can’t the notebook do? And what the notebook can’t do is to generalize well to new examples and to kind of go back to one away nones points earlier. I don’t mean to say that, you know, the, you know, the neocortex can’t actually aid in memory itself. And in fact, there are some examples within our framework where the, you know, the cortical module is actually able to even memorize better than, uh, the notebook.
James 00:42:47 But we think that the really fundamental thing that’s missing from that just sort of fateful we production of the past, which the notebook can do is the ability to generalize well, but then the amount of consolidation did it would optimize for that generalizations. What depends on do we have predictability? And as we’ve been describing, you know, it is a very important unknown within our modeling framework, how precisely this gets done. Um, but we think that you first have to recognize that it needs to be done before you have to think about how it is done. And so the earliest experiments, I think that we could design and test, you know, we can just configure them so that the predictability is set according to us, and then see kind of to what extent the brain in fact does regulate the consolidation process based on that predictability, and then get more into the, uh, both algorithmic and mechanistic details of how that degree of predictability is determined. And then once determined how that leads to regulation of the consolidation process itself.
Andrew 00:43:46 There is some quite compelling experiments that show that it could be that individuals do miss estimate the predictability sometimes, or maybe it’s not even fair to say necessity, but they just estimate it differently. So there’s individual differences that maybe in on, do you want to explain that the generalizer is discriminator experiment?
Weinan 00:44:08 So I think a key thing about particularly T is that it’s, uh, in a way it’s, it’s not the universal particularly the objectivity is it’s really like the inferred predictability. Uh, I’m not sure if you guys agree, but just the thought of this it’s really like how the agent or how the animal thinks what the predictability is. And that depends on a lot of things. So there’s a interesting set of rodent experiments in fear conditioning. There’s like really like individual differences on like the policy of animals doing the same task, whether they generalize or not. So like, there are certain, certain animals, if you shock them in a cage, for example, and two weeks later, it bring them back and they will show high freezing. So there’s a fear memory, but then if you take the same set of animals into a different, but similar cage to test their fear, generalization only like around half of those animals will freeze.
Weinan 00:45:10 We’ll start to generalize the different cages, but the other half will just maintain their discrimination in this two cages and not freeze. And they know this is a different cage, surprisingly, that the generalizer is who froze in both cages. The memory is not dependent on the hippocampus. So there’s this evidence that, you know, the generalizes do treat the original context as the high SNR context and, uh, that promoted generalization and systems consolidation so that the memory actually is becoming, you know, dependent. But the other group that is distilled is maintaining the discrimination. Lesioning the hippocampus, uh, actually impaired the original memory. So I think that means that those animals is treating the environment as a, probably like a low SNR task and that will not transfer. So it still maintains that became about dependence. And, uh, we have different, like in figure five-hour paper, like we have a diagram showing you that, okay.
Weinan 00:46:11 So even Lacy in a single experience, the single animals maybe are different cognitive processes can change the SNR of the data. So for example, you, you have a whole scene, the animals can actually use their covert or overt attention to only focus on part of the scene that night has different predictive value to the outcome. So maybe the animal can just pay attention to the general features in the experimental room. Like the smell may be similar and the experiment or may be wearing the same lab coat, and that’s highly predictable. The other one, like if you just focus on the things that different patterns on the wall, that’s highly idiosyncratic, so that will be low SNR. So I think attention is a very key thing, both in determining the signal to noise ratio also for regulation in consolidation. I just want to add one last thing about the implementation of this regulation that we set in the discussion that replay might be a natural way to do this.
Weinan 00:47:12 Just regulate the amount replay to modulate consolidation, but it has been shown that replay actually, uh, functions in like a variety of, uh, different ways, uh, to promote, for example, you hadn’t seen attractors in the record network, for example, or keeping the hippocampus in register, at least the neocortex. And so Ray play. So it could be not beneficial to stop replay altogether just to prevent over-fitting a might be the brain might be using it. You replaced, it was still repair it, all the memories. And then you have a predictability module. For example, like the PFC can control offline what, which part of the cortex gets activated and enabled learning in a offline attention manner. Like we have the amazing ability, uh, like for example, you close your eyes, you can navigate within your memory, like focused on certain aspect of things. And maybe the brain could tap into that mechanism to mask certain memory components, uh, during offline replay for consolidation.
Paul 00:48:18 One of the things that you said. So I kind of have like two questions in one here. Um, one is, you know, there are certain situations where I don’t care about predictability because I have to, uh, climb up the mountain before I fall or whatever, you know, what’s the classic escape, the lion or something like that. Right. Um, and in that case, I guess you would predict that predictability your predictability regulatory system, uh, just gets overrided, uh, perhaps because you, you’re not needing to, uh, really consider how predictable the data is, or maybe it just automatically happens. And, uh, I mean, cause you’re going to remember that event probably unless you’re, unless you die.
Weinan 00:49:00 That’s another really, really good question. So I think through our framework and many people ask us what about emotional salient memories? That’s really surprising and really normal. The how is that related to the idea of predictability? I think it’s important to keep this two concepts orthogonal to each other. So for example, emotional memories could either be highly idiosyncratic or it could be predictable. I think what, what the emotional salient is doing is maybe I, I’m not sure how much data support there is. It could bias the memory retention, uh, of certain memories. So for example, you’ve climbed the mountain and you made a mistake and that was really dangerous or something like surprising happened. That’s pretty random, that’s surprising factor, maybe enhancing the memory, retention your head campus, and then you can actually remember that memory for a long time. You can tell the story maybe 20 years later, okay.
Weinan 00:49:58 Back 20 years ago, I had this terrible event, but then what, which components gets transferred to the neocortex is determined by the predictability of the different memory components on top of that. So I think that’s a, almost a first future process of which memories, like we forget, like almost all of our episodic memories in a few days and only a few gets encoded. Uh, maybe that’s more like Hayden ER, hippocampus, and to be weakened by the who knows, but it just seems that way forget most of the things and the certain things that we remember if we do remember is modulated by some kind of other neuromodulatory process. And our theory builds on top of that is the memories that gets retained for long-term storage, which components actually routes to the neocortex and which components should stay in the hippocampus. That’s kind of determined by the secondary factor is the predictability.
James 00:50:59 I think your question also brings up another interesting and subtle point about predictability and, um, you know, we introduced the teacher as an environment, right? In terms of this is some sort of generator of experiences, but that’s actually pretty abstract because you know what, the brain only knows is the brain’s activity. And so really what the teacher is, is a generator of brain activity based upon for example, sensory and motor experiences in the world. So if you kind of think about the teacher, not exclusively as an environment, but just as a generator of Noel activity, then of course your cognition itself can also contribute to part of what the teacher is. Cause the teacher is just whatever it is that leads to patterns of activity that the student is trying to learn to produce without, for example, the involvement of, you know, the hippocampus or other modules.
James 00:51:53 And so if you think about the teacher in this way, it could be that for example, some of these highly emotionally salient or very memorable events, they in some sense are very predictable just because you think about them a lot. And because you think about them a lot, you actually do recurrently, get these patterns, no activity that you may actually want the student with the neocortex to be able to start, to be able to be produced on its own. And we think that this may have to do with why, you know, in human patients, for example, they are sometimes able to reproduce highly detailed, uh, aspects of their past life. That seemed to be highly unpredictable. But we think that the reason for this is because basically they were so reproducible based on the experiences of that person or based for example, on the thinking patterns of that person, that they start to be able to be consolidated because they start to be predictable in this more general sense of not just what happens in the environment, but what happens in your mind.
Paul 00:52:49 So is it too simple to map on the teacher, to the perceptual, uh, systems, perceptual cortex, for example?
James 00:52:58 Yeah, I think it could be. I mean, I definitely do agree that, you know, we think about the teacher in the most simple setting as just the perceptual systems. And that is kind of the examples that we provide in the current paper. And that is the setting that is designing our kind of initial round of, uh, experimental design. But I do think that when it comes time to really understanding how these abstract neural network models will get mapped onto real human cognition and neuroscience, that that is too simple. I believe that you do need to consider, uh, more broadly what it is it’s meeting to patterns of normal activity that cost the entire quarter.
Paul 00:53:35 Yeah. But then you would have the problem that, uh, th the sensory cortex should be getting trained also as a generalization optimizer.
James 00:53:45 Absolutely. And in fact, that’s a very important part of how we think about it is that, you know, um, you know, from the viewpoint of the abstract neural networks, we have very well-defined inputs and outputs. But when we actually think about what that means in terms of the brain, we’re just thinking about the neocortex as if it’s some kind of an autoencoder where, you know, activity is generated by your sensory motor experiences in the world, your cognitive processes. And what you’re trying to do is build a critical system that is able to reliably be produced those patterns going forward into the future without needing sensory inputs, without needing involvement of other parts of the brain. And in this point of view, then, you know, it’s not as though just that, you know, as you said, like some low level sensory area is not only the input of this framework, it’s also the output. And many of these relationships have to get learned simultaneously. And for each one of these relationships, that could be a highly different degree of predictability. And as we emphasize in the paper, based upon that high, then variability and the degree of predictability, there should also therefore be a high variability in terms of the amount of consolidation in terms of different types of synopses within the critical network.
Paul 00:54:57 So, so when, on, I mean, you, you, you posited, uh, evolutionary architectural constraints versus metal learning earlier, uh, when talking about the regular, how to regulate the system. Um, so you think that there’s room for, uh, both, I suppose. Yes. I think I, I don’t want to get into yeah, no, that’s okay. I, I just, I just wanted to make it clear. So, so Nate, wait, so nature and nurture are factors.
Paul 00:55:30 So one of the things that’s, uh, know fairly attractive about complimentary learning systems is, is the idea of, you know, when you have complimentary systems that the whole is greater than the sum of its parts. Uh, and actually Steve Grossberg, um, calls this complimentary a complimentary computing paradigm. Um, and, you know, he thinks of multiple processes in the brain acting like this and that when you have these two things working in parallel, both neither of which can do well on its own, but when they are paired together, actually, um, give rise to a, what he calls a new degree of freedom and extra degree of freedom. How would you describe that in terms of the whole being greater than the sum of its parts, uh, with this go CLS architecture?
James 00:56:19 Yeah. So I think the sum is greater than its parts and at least several interesting ways within our current go CLS framework. So one, as Andrew mentioned earlier, we’re able to determine, well, what would be the optimal learning rate you could have for generalization, if all you had was the students. So you just don’t have any, uh, hippocampus or notebook you can use, do we call the past? And because we could kind of treat that problem, optimally, we could show very rigorously in fact, that when you put the two systems together, that in fact you do get a critical network that generalizes more efficiently from the data than you could from online learning. And so there’s a really fundamental advantage where if you’re going to have some finite amount of data, you can just make better use of it. Period. If you have the ability to record it somewhere and we call it subsequently to guide learning.
James 00:57:08 So that’s one way, another way that kind of also came up a little bit earlier, is that actually even when it comes to memory, there is a benefit. At least if you have a small notebook or a small hippocampus system, because what we were able to show there is that in that setting, you actually get some errors when the hippocampus or the notebook is trying to recall those memories. But what’s really amazing is that the nature of those errors is that it’s interference with other memories. And so if what you’re actually trying to do is train the student to memorize that. In fact, you can actually do better training from those noisy reactivations than actually those noisy of activations themselves. And so what ends up happening is that quite counterintuitively, the training error or the memory performance of the student can actually outperform what you could get in a notebook alone. So yet again, for both of these cognitive functions for both the memory and the generalization, uh, the system works better when you put the two parts together.
Andrew 00:58:09 And one, just to elaborate one small piece of that is it also clarifies the regime where you get the benefit. So there are regimes where online and sort of just having the students and, um, the replay strategy will look very similar and that’s, if you have tons of data, you have tons of data. It doesn’t matter get to the same point, but both of them, um, also if your data are very noisy, then, um, in the limited data regime, it can still, the gap can be fairly small. So I think it delineates the regime where this dual system memory, that sorts of worlds basically, where it’s the most useful. And it happens to be when you have sort of a fairly moderate amount of data and that data is quite reliable. Then you see a big advantage from replaying a lot. And arguably that’s a setting, a lot of real-world experience falls into.
James 00:59:06 Yeah. And just to follow up on that, you know, what’s incredible about that as well is that that same regime is where the risk of overfitting is highest in the model. And so what’s really interesting if you just kind of step back from the details of the model and think about what it might mean. We basically say that, well, if you’re in a regime, we’re actually having these two learning systems is complimentary. You’re also in the same regime where if you don’t regulate, you’re going to overfit. And we think that this is a really important conceptual point because then, you know, you know, we do know that, you know, the biology has built multiple memory systems in the brain. And one of the lessons we can walk away with our artificial neural network is that perhaps the fact that it’s built, it suggests that it’s good for something. And at least within our framework, it’s very rigorously true that when it’s good for something you absolutely need to regulate you can’t just kind of transfer everything to the cortex. Hmm.
Weinan 00:59:53 Yeah. So this is going back to James’ point on like, we have a notebook to replaying samples and because a notebook has limited capacity, uh, it shows some interference. So the replayed example is not exactly the training examples. There there’s some error and this error just surprisingly, as James mentioned, that it does not hurt, uh, training the student any in fact, that, so I think this is a great story. That is just so even counterintuitive that I was running the simulations and, uh, James and Andrew here at Janelia. And, uh, I showed them the curves for the first time. So the notebook reactivation error is like, there’s a little bit error due to interference and our youth, this type of re activated data to train the student. And it turns out the student training error can drop below that reactivated error. And that’s really where like you are using the notebook, reactivity examples as labels, and you can’t possibly do better than that, that rehabs reactivation.
Weinan 01:00:53 And later on, it just, that led to a lot of mathematical analysis. And it turns out that I think to me, this is so powerful and they questioned me. Okay, wait, not that that’s, uh, you should check your code if there’s any bugs and that I do something crazy. So I said, okay, no matter what I do, uh, I’ll get the lower error from the student. So what if I just generate random activities in the notebook and to my surprise that still trains the student perfectly with some change in the learning rate. So I think this is something deep that about like maybe if listeners, uh, building future genders and models of memories. I think there’d be activation has to a certain property that can enhance generalization mainly by. So let me give you an example. So in machine learning, after all there’s I discovered a connection that they use some certain data argumentation methods called mixed up that you just like, for example, you’re trained in internet or amnesty, and he would just linearly combine different training classes together and also combines the output probabilities together, like adding them together. And they only use the mixed examples to train the net network.
Paul 01:02:05 The images themselves are mixed. Like one image would be a mixture of two separate images.
Weinan 01:02:11 Yes. So you’re just randomly some pole like a two to four images and the stack one and two together, and then opposite probability will be like, you have 10 outputs and it would be 0.5 0.5. Yeah. And he only used this. It just got rid of the original pictures. You only use the composite images to train the network and it trains equally well, and sometimes it’s even better. And this has became a very powerful, uh, like data augmentation in even modern transformers, uh, that is performance much better. And it was also like really interesting being used in a data encryption. Like there is the Princeton study showing that, you know, the hospitals have records to train some kind of AI model for prediction. And because of the data privacy issues, they just run them, emerge the patient’s data together and merged the output together. And it still trains them out equally well, but masking the original data.
Weinan 01:03:10 So I think this interesting connection is, I mean, the, uh, the brain is capacity limited. And if you want to store some previous experiences to turn up your neocortex, and you’ve such a flexible ability to change your input data and still train the neocortex. Well, I think the brain might be tapping into this mechanism and have sort of weird ways of generating examples instead of just replaying train examples. One by one X. Exactly. There might be random mergers, uh, off memories that, that has been supported by experimental data in the hippocampal field that certain memories are just prone to errors. So for example, you, you imagine a house like you left your house yesterday and there could be related things, uh, being put wrongly, put in the scene. Like you will remember, okay, there’s a car parked right in front of me, but, but it was not. So there’s this, uh, leaking interference between memories could also be helpful in turning the neocortex because you know, a car is a car sort of independent of what, which scene it’s at so that this kind of interference might not be as bad at training the neocortex. And that probably reflects some certain composition. Now the nature of the world, like you can have different things merged together and still give out good training. Uh,
Paul 01:04:32 This is a trivial question, or maybe this is a trivial question, but how do humans perform on the mixed image net dataset? I think
Weinan 01:04:40 I can’t tell them our apart
Paul 01:04:43 Will sell. So then does that run counter to the story you just told then? Because presumably unless, um, unless that’s just an inherent difference with, you know, deep learning networks, which we’re going to talk about here in a second anyway.
Weinan 01:04:57 Yeah. So that perceptual example might be different, but you know, if, if you really put different objects together and humans have the amazing ability, like if I put like a cup of apple, a weird, like a car in the same scene versus seeing them apart, I think humans will have an easy time to pick up, okay, how many objects are even the scene and to give each labels?
Paul 01:05:18 Oh, I thought that the pictures were blended, like where you take the RGB values and oh, they are so-so, they’re not compositional pictures. They are, they’re literally blended where we wouldn’t really be able to perform well. Is that true?
Weinan 01:05:32 Yeah, I, I look at, so yeah, this, I need to think about this more, but maybe there’s something there.
Paul 01:05:37 So predictability is, uh, a key or the key, is the world a really predictable or is it super noisy? Are we ever in a, I’m trying to think if I’ve ever been in a situation where it was completely unpredictable, maybe early on when I was dating, but, so how does predict, uh, are there situations where something is just truly unpredictable and if so, how does the network handle that?
Andrew 01:06:04 Yeah, it’s a great question. Because so far we’ve basically been talking about predictability as noise. So, you know, we’re real randomness coin flips, but it doesn’t, in fact, all the same phenomenon occur in completely deterministic settings. It’s just what th th the equivalent is that the teacher is more complex in the student. So if there’s something that the students can’t possibly represent about the teacher, and I think that is definitely a reasonable assumption about the world, right? This very, very hard to imagine that we would be able to predict everything about a physical situation. And so essentially that unmodifiable component, which, um, you know, learning theory, we call it the approximation error. It looks like noise from the perspective of the student. And that judgment is, is completely relative to the students. What, what can the student actually do? And then if the teacher’s more complex than that will have the exact same effects, you can see the same, over-training all of the same behavior and learning curse. And another version is maybe you have a completely deterministic world. Maybe it’s completely predictable, even it’s just that you don’t observe the full input. So imagine that, you know, the teacher has a hundred input nodes, whereas the student only has 50. Now the remainder looks like unpredictable, uh, information from the perspective of the student again, and that will have all the same properties as if it was really just a noisy environment. So there are several forms of predictability that behave similarly and, um, would, would require the same regulation, uh, and transfer between brain areas.
Weinan 01:07:55 Yeah. So just, I want to add to that, like maybe I was just trying to really think about this intuitively with the, these are silly, silly examples. So for example, the noisy teacher, you just, you can’t imagine like someone going to the casino in Vegas and then play the rotate and then pick them number 27 and a, a lot of money on it. And it, he lost, he lost like $10,000 and then the unregulated consolidation will be that you’ve treated number 27 as the bad number of forever. Like you just learned that I should never pick number 27, despite it’s kind of random. And that could be detrimental to your future. For example, if you are dating a girl whose birthday is now August 27th, and you said, oh, no, no, no, I’m not going to date this girl. So that’s going to be bad for generation.
Weinan 01:08:43 And the second example is about the complex teacher. Like this is really give me that insight on, you know, why a lot of episodic memories are being kept in the hippocampus and require the hippocampus. They just, a lot of it is because the world is so complex. Just you just imagine you’re on a street, there are different things happening. And, uh, a lot of them are independent processes has its own cause and effects and to cross predict such like complex interactions between so many things, it’s generally impossible for our human brain to do. And like intuitively example is that a lot of times in movies, you see like a tragic, like a tragedy, like a part of the scene is lowest of the low, like the person got a cancer and then hit by a truck and there’s something else happened. And I, the actor just start to cry, like a why, why me?
Weinan 01:09:38 I think at those times it might not be beneficial to really consolidate, consolidate such complex events. Like it’s better to remember those things, but if you overgeneralize from those complexities, it’s going to hurt generalization. And the last thing about a partial predictable partial opposites over ability is, I mean, a lot of like most of the time our perceptual access, uh, to certain events are really limited. Like people always say like the traditional ways is that, uh, you should really put things into context and don’t just judge things by the first Glen’s like someone is behaving in a certain way and you get offended. And maybe that person is having a really, really bad day. You know, you, can’t just based on that partial observation that this guy, just a grumpy person, and he’s not friendly. And maybe it’s just keep that as episodic memory, and maybe you can build up more accurate representation of this person by long-term interactions. So I think that, I think about those three, I’m pretty busy unpredictability this way.
Paul 01:10:41 Can I throw one more concept into the mix? Uh, this seems to me related somehow to concepts like Herbert Simon’s satisficing and bounded, rationality, and Kahneman, and Tversky’s heuristics the use of heuristics and good enough, uh, scenarios. Have you guys thought about how your work and the results and implications of your work overlaps with those sorts of concepts? So,
Andrew 01:11:10 Yeah, bounded, rationality it’s because we kind of are assuming, you know, we have this Oracle assumption, you know, that predictability, we’re optimizing all parameters at the setting, but we’ve constrained the system to be this particular neural network with inherent limitations in that. So, yeah. I mean, I wonder if that is a version of, I guess it’s like bounded architectural rationality, you know, there’s like something baked into the architecture that you just, it only is going to take you so far. And, um, and in terms of heuristics, I mean, yeah, I guess you could maybe view it, uh, similarly that you, you may be forced into a simpler solution to what is actually a complex problem, just because of the resources that the student actually has available to it. But I don’t, yeah. I haven’t really used in connection
Paul 01:12:02 Jammed. It looked like you were going to add something.
James 01:12:05 No, I think that was a good answer. I wasn’t going to add anything more than that. I was just going to kind of bring up the same points that this notion of unpredictability, as it relates to approximation, or is kind of giving the idea that the cortical network may only able be able to do so well, and that could have to do with the architecture of the network. It could have to do, for example, with what it’s able to learn, the learning mechanisms involved in that network. And so it is a notion of bounded rationality, I think for sure. But then how closely that would relate to the more famous notion of bounded rationality, I think is a very interesting and deeper question that I think is harder to kind of answer at the moment.
Paul 01:12:47 Yeah. Because, uh, here, one of the great things about heuristics, although they, they fail in many ways, but they’re also beneficial in many scenarios. And that’s kind of why I was wondering, because, you know, you have to have this, uh, predictability estimator and it needs to be beneficial for the organism. And then I was thinking, you know, heuristics for all their failures are also very beneficial in certain scenarios. So yeah,
James 01:13:09 I mean, I think you could view whatever it is that the student learns in our framework as a heuristic, because it is going to be an incomplete and inaccurate to some degree representation of what the teacher actually is. But as you said, this heuristic is very useful and in our setting, it’s very useful in a very precise mathematical sense that it nevertheless optimize a generalization, given the bounded rationality possible for that system. And so it kind of brings these two things together, um, that, you know, if you have some sort of bounded rationality or some sort of limitations in terms of what the system can do, then, you know, obviously you can’t do better than that, but then the heuristic may be the best possible thing.
Paul 01:13:51 But so you guys put this in terms you’re careful not to just map on the networks to the brain areas, to hippocampus and cortex and the environment or perceptual cortex, for instance, uh, how should I think about this? Should I think about this as theoretical neuroscience? Should I think about it as, um, artificial intelligence work and then what does it imply? Because there are, you know, like, uh, what we alluded to earlier, there are networks already with external memory, uh, there are metal learning networks, so what could deep learning and, or, uh, AI in general take, um, from, from these networks?
Weinan 01:14:29 Yeah. So yeah, like you mentioned, there are a lot of memory augumented neural network, neural networks out there, and also memory based RL agents that can use like it’s typical days competing in some kind of cortical modules, like, uh, LSTM, uh, to, uh, external memory. So typically the memory is fairly simple. So a lot of times it’s just a pending each new experience as a like additional row. So it’s kind of panning into a big matrix. And a lot of times there’s a keys and values. Like you use the keys to search and to retrieve a soft max average, uh, output as your episodic memory. And that has been very successful in certain problems like the new oratory machine or different show, your computer work from deep mine, Casella vastly different problems than the traditional and your networks. Uh, like Greg Wayne’s Merlin framework also can solve that the water maze like typical LSTM can not perform, but I think there are inspirations so we can take from the <inaudible> architecture.
Weinan 01:15:35 I think instead of like first thing, if that is still like the experienced rate play in our, our agents and the usage of online, like a memory, excellent memory modules, those two are different things that a different module to doing the memory storage and doing the online inference. Now in the mammalian brain, we think the, we use the hippocampus to predict the future, but we also have evidence that the hippocampus is replay and, uh, serving as experienced replay buffer to train up the neocortex. So maybe there’s a vantage of a merging the two modules together. So that’s one direction. And the second direction is that how exactly, uh, should be the memory representations in X nor memories like he, instead of just appending different rows, uh, can we use the like more spars distributed representation and, uh, uh, biologically realistic retrieving rule for memory retrieval?
Weinan 01:16:31 So there’s actually a very interesting work. Uh, I think in the end of last year, there’s a group showing that how few network, like a modern continuous Hafi network is equivalent to the transformer self attention mechanism. So there’s some deep connections here, maybe like memory and attention are really like different aspects of, of the same thing. So I think using some kind of a hippocampus inspired architecture, um, maybe there’s some certain research direction it’s gotta be uncovered. I don’t know exactly what yet. And also, I think the last thing about the memory module is we use a healthy network and that’s fairly traditional. And the current AI accidental memory modules, they use like a form of like more events versions of those generative models, like variational auto encoders or, uh, against. So, uh, these things, I think like one insight might inspire AIS that, you know, we know the anatomical connections between the hippocampus and, uh, between like the rest of the cortical areas, not also the PFC and the typically people doing variational autoencoder work assumes there is a yin putt going into, uh, a series of hierarchies and arriving to the head campus at the head of campus.
Weinan 01:17:59 It’s sort of try to encode a latent representation that it can reconstruct that input to reduce the reconstruction error, but maybe there is a way to improve this by. I just realized that, you know, there is, uh, architecture called vegan is VAE again. So it’s a V and again, connected together. So the idea is that instead of reconstructing the original input, you sent your reconstruction to, for example, the PFC and the PFC serves as a critique or the disagreement enter in again and tell and feedback, the signal that is this realistic or not. And this is a plausible assumption to you to make, because like a lot of patients with schizophrenia, if they have lesions in the PFC, they cannot tell the difference between imagination and real world. So it might be true that the reconstruction from the hippocampus is sent to multiple modules to compute different calls, functions, for example, can be post reconstruction, or it can be reconstructing another, uh, discriminator is function to best reproduce that sensory stream. So I think, I think some architectures like this, like a multi-head, uh, generative motto as excellent memory will be very interesting.
Paul 01:19:18 W where different parts of cortex serve as different modules for different, uh, well, like you said, cost functions is in the vernacular of AI, but, uh, purposes, I suppose, in the vernacular of organisms. Yes.
Weinan 01:19:31 Yeah.
Andrew 01:19:32 Yeah. I mean, I also think there is maybe not as, uh, exciting lessons to be learned from this work for AI. Like, uh, if you build a continual learning system that does store its own examples and manage its own learning, then it’s going to have to regulate you not to replay. That’s a very simple point, but I do think that’s probably something that will start emerging. And it’s an interesting, broader question. How do you decide when to learn and how to learn or like ultimately agents? We, we decide, oh, this was a learning episode. I’m going to store that. Presumably how do you manage those
Paul 01:20:16 Look at, because in the original complimentary learning systems, more replay is better always. Right. And that’s one of the take homes here is that you have to regulate that and knowing when to regulate it is a pretty important factor.
Weinan 01:20:29 Yeah, yeah. Also with that point, uh, I think more than more than RL algorithms, it’s kind of having like a lot of problems of generating, uh, generalizing to new tasks, like train on one game and a test on the unseen levels of the other game. There are a lot of effort improving this, but I think those agents will be benefited by having a different volunteer with that specifically as to mate the predictability of the captured experience used to have like replaying or train all possess your data. You only train based on the score of that predictor, like, so you can actually filter through your experiences only generate generalize the useful components. So maybe that will help our out as well.
Paul 01:21:16 So one of the recurring themes these days is that AI is moving super fast and neuroscience, especially experimental neuroscience because experimental science in general is very slow. But, uh, even, uh, I think theoretical neuroscience is kind of lagging behind the progress of the engineering in AI, right? And so what this is, is theoretical neuroscience, at least partly, right. We talked about how it’s kind of a mixed bag of things, but do you see this kind of work theoretical neuroscience more broadly as being able to, so backing up what, what the implication of that rate of progress means is that right now and for the past 10 years or so, AI has been informing neuroscience a lot more and the direction of the other arrow from neuroscience to AI is slow or lagging or lacking. Do you guys see theoretical neuroscience as a way to, um, bring more influence from neuroscience into AI?
James 01:22:13 Yeah, I mean, I mean, just to hear what everybody else has to say, but for me, I definitely do, because I think that, you know, it may be slow, but I think that theoretical neuroscience and kind of really rigorously working out how individual models work and how they relate to the biological brain. I think that that provides kind of fundamentally new and fundamentally, uh, Volvo bust conclusions. Then you can get from just kind of numerical experimentation on very large AI systems. And so I definitely think that, you know, we wouldn’t have gotten to where we currently are in AI without past generations of theoretical neuroscience research. And I also definitely think that, you know, projects like this, where we try to kind of boil it down to the essentials and really analyze everything very rigorously and really try to figure out to what extent this relates to the biological brain will provide useful seeds for future AI research.
Andrew 01:23:06 What do you guys think? Do you guys agree with that? Yes. I mean, I think James and I see very similarly on this it’s the timescale may be long, but ultimately I think theoretical neuroscience and psychology just have an enormous role to play in driving AI, but you have to be willing for that impact to happen many years down the line, but just imagine, I mean, for instance, deep learning, right? The whole thing. Um, the fact that we’re using neural networks, those all came from contributions that were worked out in dialogue with our scientists. And if what theoretical neuroscientist were doing right now today, 50 years from now at a similar impact, wouldn’t that be amazing. It’s something that everyone would want to work towards. So, and that’s how I really do think that’s possible. You know, we don’t know what those it’s more in Kuwait, it’s more uncertain. We don’t know what these principles will be, that will guide us towards even better systems, but having the insights from the brain, guiding, um, the inquiry and showing us some of the problems that maybe we didn’t even realize were problems, uh, is really valuable.
Paul 01:24:13 But my perception is that the bro, the in large part, the majority of folks working in the AI world, what they would say is, well sure, theoretical neuroscience may eventually, uh, provide us with something, but by that time, our systems are going to be so advanced that it won’t matter, right. Because we’ve already, uh, basically accomplished that. And I mean, I think that that personally my personal opinion, it’s just an opinion, uh, is, uh, silly. Do you think that that’s right, that the AI world thinks that, and is it comforting to know that the mass majority is wrong? Or how, how do you think about that?
Andrew 01:24:49 I, I do think it’s right to say that that’s why it’s very opinion. I’ve had AI people telling me, like, why do you even do mathematical theory? Because eventually I’m going to make a theory prover. That’s going to do the mathematical theory and explain it back to us. So, you know, just focus on getting the AI system to work. We won’t know until the proof will be in the pudding. My own opinion is that we’re going to make, it’s going to turn out that today’s AI systems as fantastic as they are, we’ll hit roadblocks. And part of getting them unstuck from those roadblocks will be looking again to the brain just as happened with deep learning again. I mean, I think this history often gets kind of run rough shot over, but that the paper on backpropagation the first author is David Romo Hart, who was a psychologist, the first paper on the perceptron that’s Frank Rosenblatt, he’s a psychologist.
Andrew 01:25:43 Um, the contributions. And, and I remember as an incoming graduate student, the paper that inspired me towards deep learning was from, and I’m sure it’s different for different people that were AI people working on it. But for me, it was Tomasa Panchito’s work and maximally advising Uber, and they’re neuroscientists that theoretical neuroscientist, the reason why they were interested in these convolutional network architectures, even though the rest of the field, hard to remember, but the rest of the field was not totally different. The reason why they were interested is because they kept looking at the brain and saying like, this is what we see in the brain somehow has to work. And so I think that, um, that promise is still there for the future. And if you think about some of the topics, like theory of mind, um, some of these ideas that have come from cognitive science, again, ultimately causal reasoning, all of these things have been pointed to by cognitive scientists. And now we’re seeing that yeah, they really are important and they require their own methods to address. Yeah. So I guess I think it will continue to be important going forward.
Weinan 01:26:53 I just want to add on that. I totally agree with both James and Andrew. I think neuroscience steel has a lot to offer to AI and especially the exciting new direction like, um, Yoshua Bengio has proposed this idea of, we should learn from human cognition or animal cognition, all the systems, two level combination. So that’s comparing to the directly perceptual classification tasks that apples to apple is like you have long-term deliberate planning and reasoning that, you know, had, you have to think about things and those abilities are not well captured yet in the current models. And one of the solutions I think Yoshua Bengio was proposing is there’s some kind of attentional, so called a sparse called a grouse. That is a good word motto. And you’ll form this cultural nose, uh, through learning, like whether by semi-supervised learning or by other methods. And you form this called the note and you can actually reasonably within that word model.
Weinan 01:27:55 And the key thing, I think just one of the key things is there’s a recent debate about whether to learn things from end to end, or just have unique structures all at the other side, like whether to use deep learning or to use good old fashioned AI, like mainly on symbolic processing. I think there is a trend now is, is actually beneficial to learn the symbol like or abstract representations of entities in the world. It has to be an old topic in psychology, but read more, more recently, like, especially my work, I run bill O’Reilly and by Jonathan Colin’s work from Princeton, they show that if you couple a LSTM controller to an external memory and only manually collate the memories indirectly from the keys of the memories and through learning through many, many simulations, you can form this symbolic representations in the keys. So I think that’s a kind of a key insight offering, like how the brain might be generating those abstract representations.
Weinan 01:28:55 And it could be in like a hypothetical, like structure that the cortex and the sensory and coding actually are bound together through fast plasticity. And the cortex can learn to manipulate those memories. And eventually symbols are generated by this manipulation. So this is kind of going from sensory generalization to the next level, the out, out of sample distribution. Um, and, uh, it’s like a higher systematic level generalization idea that have chemists might also offer insights into this process. So I think this system two level combination is really like what is needed for a more, more powerful, or even like a more humanlike agency in the future.
Paul 01:29:41 Okay, guys, this has been a lot of fun. Uh, I want to wrap up, let’s do one round of what kept you up last night. What were you thinking about? That’s just at the edge of your knowledge. And I know that this is a really more of a question for the beginning of a conversation, but something may be unrelated since I know you all have different lives outside of this, uh, topic that we’ve been talking about a wane on. Can you tell me what, uh, what’s been troubling you lately that you can’t quite figure out just beyond your reach?
Weinan 01:30:13 Um, yeah. A lot of things to be Frank. Yes. So I think reflecting back to the, on the collaboration, I think there are really two things that really troubled with is so mayonnaise, just a due to the journal sentence of, uh, like being adequate for so many years. So, so being an experimentalist, interacting with those Andrew and James, and just the first few years, I, I found this interdisciplinary work it’s just to be extremely challenging. And, uh, you know, I was stepping outside of the synaptic transmission world, single cell computations to cognitive science. I’ve like, I’ve never heard of direct a great amnesia curves before those to be Frank. And, uh, I was surprised to find out, okay, there are so many diversity. I thought this are all like really figured out. And I was just diving into this research researcher. And, uh, and also on the other side, I diving into the mathematics and the Mo more than machine learning. It’s really trying to learn multiple things that has been extremely difficult, but also rewarding. So steel that poses the challenge to me, I think if you try to learn multiple things at the same time, I mean, I’m just barely feeling that I can barely keep up with the minimum amount of knowledge happening in dispose field. So I, I still don’t know how to deal with that. Maybe more collaborations, but you know, our human, this is how a certain amount of hours during the day. So how do you read so many papers? What’s keeping up?
Paul 01:31:47 Yeah. You can’t read so many papers, so there’s, isn’t there a perfect interleaving of time spent on various things going back and forth. And have you figured that out because I don’t know it
Weinan 01:31:58 That’s going to be the next model
Paul 01:32:01 Learning strategies.
Weinan 01:32:03 So I think the second thing just real quickly is really like this process, like with Andrew and teens. I mean, I think I talked me about the, you know, really the benefit of really talking to people in different fields. And that’s kind of obvious now, but, uh, but in the beginning was not easy. Like I felt like being an experimentalist, James and Andrew had to explain things very carefully from very basic level knowledge to build me up. I think I really appreciate that. But also like through interactions we found like including, uh, my current advisor Nelson, who are also at experiment lists, like the communication between theory and experiment are extremely, like, could be challenging at times because we spoke different languages that, for example, like even games and Andrew, like the idea of generalization in machine learning is well appreciated and that has entirely different meaning in the experiment experimenters mind.
Weinan 01:33:01 So I think that communication is hard. And also, like I talk about experiment, details that huge complexity in experiment neuroscience. I think that sometimes we just assume people know this knowledge and to know this complexity, that’s not true neither. Right. So I think the natural tendency is really like, okay, I just gave up, we theorists works together because we talk well to each other and experiments work together because you know, it’s really effective communication, but I think what’s what I learned is it’s really important to be patient and to really try to understand each other. Uh, so I think that still is challenging science communication is super important, important for multidisciplinary research.
Paul 01:33:44 Is there a role of predictability in this complimentary collaborative learning a whole organism system system? Sorry, I just, I couldn’t help the analogy to go CLS here, but yeah, no, I was joking about the predictability. All right. Well, that’s great. Um, Andrew, do you want to, uh, chime in anything bothering you, uh, last night that you couldn’t, that you are frustrated, you just can’t figure out, et cetera?
Andrew 01:34:13 Well, one very, this is very specific, deep learning theory, but I actually think it’s quite important when you look at the brain, you see lots of modules, you see this complex interconnectivity of sort of as a scale of modules. And we all think that those modules are kind of specialized for function and kind of not, and it’s distributed representations, but also not. And somehow this all is important for generalization and systematic reasoning and cognitive flexibility. And I want to understand why, and it’s still very much bothers me that if you don’t have a theory, but maybe to build on one of Wayne’s points to another thing, which keeps me up at night is whether this theory that we proposed will in fact, be tested by an experiment. I thought you
Paul 01:34:57 Guys had, um, begun sort of the early stages of thinking about how to act. I was wondering if you had already begun, but I knew that you’re thinking about how to test it.
Andrew 01:35:07 Yeah. And, and if anyone could do it, it would be weighed on and Nelson’s lab is perfect for it. I have however, been in the beginning stages of many experimental tests that have not finally panned out. So, you know, it’s just, I think like when I was saying it’s, it’s hard to cross these communities. And one of the things that I would love to see is how we can create this feedback loop and virtuous cycle and actually get it functioning on all cylinders to sort of, you know, make the theories concrete enough that they can be falsified and make those experiments actually happen using all the amazing methods that we have now.
Paul 01:35:44 All right. So James, I’m sorry. We ran out of time and can’t include you. No, I’m just kidding. James, do you have something that, uh,
James 01:35:51 Sure. I think it’s actually super interesting that, you know, both way naan and Andrew highlighted the difficulty of how hard it is actually to kind of really get these feedback loops going really robustly. And I agree with that and that really does, you know, often keep me up last night. I mean, I guess in reality last night there was a big election in Virginia. So maybe that wasn’t what was keeping me up at night, but many nights I may have been what keeps me up, but I think maybe also related to that, I think, you know, and also maybe going back to a theme of earlier about, you know, the, the value of kind of abstract models. Like I think that actually when I I’m often kept up at night by science, it’s often because it’s a real concrete math problem that actually keeps me up at night most often, because I feel like when you get it to that level, you can think about everything so crisply and you can really get the sense of, you know, I’m really about to solve this problem.
James 01:36:36 Like if I just think about this a little bit longer, I’m going to have that solution. And so actually there another one there, and maybe I’ll give a little plug. I completely unrelated to this work. Uh, I’m collaborating with, uh, two of the very best while such Janelia when trying to analyze the links between structure and function. I know one that works using geometry and we’ve had a lot of really interesting conversations over the last week where I’ve learned a lot mathematically about like, how to think about this problem. And that’s what keeps me up at night in the positive way. Maybe the negative way is worrying. Like, are we communicating clearly enough that we’re going to be able to kind of break down these barriers, but maybe also just highlight that sometimes you stay up because you can’t bear to go to sleep. And the solution to the math problem is so cool.
Paul 01:37:14 <inaudible> is the desire for the solution to the math problem due to your background in physics? Or is it just, uh, uh,
James 01:37:21 I think I, I would actually flip it. I think actually my background in physics is probably due to my desire to sell math problems. And I think that this is why it gets so hard, I think, to cross these boundaries, because I think that, you know, fundamentally probably why different scientific communities, like why their individuals went into science could actually differ. I mean, for me, it probably really is that the beauty of like solving a math problem, but you know, for many other scientists who have a lot of valuable expertise to lend me, that’s not why they went into science and they went into science for completely different reasons. And so then how do we kind of, you know, not only communicate to each other, to help them understand what we understand, but also as well, what we’re motivated by and go to Andrew’s point about like, okay, well, why don’t many theories get tested?
James 01:38:05 And I think that a lot of times it happens because the motivations of the theorist and the motivations of the experiment was actually not the same. And so, you know, the theorist may be like, oh, I don’t understand why you don’t want to test this. But then the experiment, I was like, I don’t know why you keep on talking to me about this Boeing theory. That’s all this other stuff going on. Like, you know, I just found this crazy thing, like look at what actually came out of my experiments. And I think that like, you know, getting those motivations aligned, I think is another huge part of what will eventually be needed. I think, to kind of close these disciplinary divides.
Paul 01:38:32 Oh, it’s a challenge. Um, all right, James. Well, uh, I’m gonna let you go do some beautiful math guys. Thanks for talking to me for so long. It’s really cool work. And I appreciate your time here today.
James 01:38:43 Thanks for having us. This was a lot of fun. Thank you. Paul
0:00 – Intro
3:57 – Guest Intros
15:04 – Organizing memories for generalization
26:48 – Teacher, student, and notebook models
30:51 – Shallow linear networks
33:17 – How to optimize generalization
47:05 – Replay as a generalization regulator
54:57 – Whole greater than sum of its parts
1:05:37 – Unpredictability
1:10:41 – Heuristics
1:13:52 – Theoretical neuroscience for AI
1:29:42 – Current personal thinking