Demystifying Artificial Intelligence
CFR's Technologist-in-Residence Sebastian Elbaum discusses the capabilities, failures, and future of artificial intelligence and its intersections with policy.
DUFFY: Good evening, everyone. Welcome to the Council. It’s so great to see everybody. I am delighted to introduce you all—or, to introduce you all to the Council on Foreign Relations first-ever, inaugural technologist in residence, and to welcome you to our roundtable tonight on “Demystifying Artificial Intelligence.”
I wanted to start, as I almost always do, on discussions around AI by demystifying the audience first. So who in the room self-identifies as someone who works on AI—works on AI issues, talks about AI issues? OK, a healthy smattering. All right. Who in the room identifies as someone who feels stuck at cocktail parties because they feel like they should know something about AI but they’re not entirely sure how to weigh in or what to say? Who self-identifies there? You can self-identify across many categories. OK. Who self-identifies as someone who never wants to hear the letters AI again? (Laughs.) And who here worked with AI and used AI today? Who checked the weather today? Some of you, impressive; you must have just really been optimistic about the weather today. Anyone who checked the weather today, all of you were using AI, all right? So part of demystifying artificial intelligence is us really thinking about how long we’ve been using it, how standard so many aspects of it have become in our society, and then what is actually feeling new, and looking ahead what is changing. And that is why I am so delighted and thrilled that we have Sebastian with us at the Council this year.
So, as I said, Sebastian is our inaugural technologist in residence at CFR, thanks to the Eric and Wendy Schmidt Fund for Strategic Innovation. You can read his bio. It is exceedingly long and impressive. I won’t read it at you because you are all CFR members and extremely literate. He is a professor in the Department of Computer Science at UVA, where he cofounded and co-leads the Lab for Engineering Safe Software. He’s also an entrepreneur. And as CFR’s inaugural technologist in residence, he is here with us for a full year really helping us bridge the space between the world of artificial intelligence research, and academia and innovation, and the world of policy research and policy innovation. Sometimes those things are the same and sometimes those things have some pretty significant gaps.
And so, with that, I am going to welcome Sebastian up to give us a bit of a presentation. He’s, you know, full-time, highly beloved professor, so he’s probably pretty good at presentations. And he’s so beloved, in fact, that even though this is not a hybrid meeting we had to arrange a video stream for his fan club at UVA. So shout-out to Anita and all the colleagues at UVA who are streaming in because he’s that popular.
As a reminder, today’s conversation is Chatham House rules, so you can freely discuss, you know, what we talked about here, but the conversation will remain not for attribution. And with that, I wanted to introduce Sebastian for the presentation, and then we’ll have a little bit of a discussion and some Q&A. And please, keep getting up, getting food, getting drinks; enjoy yourselves. Sebastian?
ELBAUM: Thank you. (Applause.)
Well, thank you for being here. Thank you for being here. Thank you, Kat, for the introduction, for having me at CFR. And thank you also to the members that recommended that I make this talk.
So I think let me just switch gears here a little bit. I couldn’t—I could not do a presentation—I’ll promise you that it’s going to be short, but—OK. Let me see if I can—if I can click it and you can see it.
(Pause.)
All right. Thank you. Thank you, Jamie.
Yeah. You know, it is a very exciting time. I’m a professor of computer science. I work on autonomous systems, and AI are a big part of it, so this is a very exciting time. I think at times even I feel like the technology’s arriving at a—at a rate that is—that is hard to grasp. AI is actually a technology that is getting diffused in a lot of things that we do, in the way that we play, that we operate, we work. It’s very exciting how it’s going to change a lot of things for really good purposes, like creating new drugs, helping us to diagnose health issues much better than we used to be able to do. At the same time, a lot of us are concerned with—you know, with their safety, privacy issues. So there are a lot of—a lot of potential—a lot of potential gains, and also concerns about where AI or how AI is evolving.
So in this talk, I am going to try to do just three things. I’m going to show you what AI is about, just so that we are all on a common—on a common ground; what the most advanced AI tools can do today, or some of them. I’m going to kind of lift the veil and show you how they work. And I’m going to do that in five minutes. So some people require a Ph.D. to do that—(microphone volume increases)—hello, hello? Yes. Some of them—some of the people that are being trained in this area, my Ph.D. students are five years doing this, but you’re such a crowd that we’re going to do it in five to ten minutes. (Laughter.) And then I’m going to try to show you what we can do with it for policy or the implications that it may have for policy. So, OK.
So, you know, so what is AI? So the way that I want you to think about it is that when people talk about AI, they talk a lot of an umbrella of techniques. Some of these techniques are dated back to the ’50s, where AI was really thought about as a science of making machines work or think as people and do things that people do. Very general, that’s what artificial intelligence is about since the definition in ’55.
Now, within that big umbrella, you have machine learning techniques. Machine learning techniques are a type of AI that is based on statistical algorithms that helps you find patterns in data.
Within machine learning, we have deep learning, DL. Now, deep learning is slightly different than general machine learning because it’s actually—it uses structures called neurons that are organizing layers that resemble our brain, and are really, really effective at scaling up and at analyzing unstructured data.
Now within deep learning, we have now what you have heard about in the last year, which are these LLMs, these large language models; or VLLMs, visual large language models, as well. These are the big—the big news in the last year, year and a half, where we have, you know, ChatGPT, Gemini, Copilot, all these tools that you hear about, and they have become so—you know, such a big part of our conversations because they’re free, they’re accessible, everyone can use them. They have been deployed to a scale faster than any other technological tool that we had in the past, bypassing any other type of tool that you thought about. ChatGPT was faster than any of those to reach most people in the world.
They’re very general. They can be applied to solve many problems. We had a lot of tools that were effective at—a lot of machine learning tools that were effect at, you know, solving particular problems, but this type of tool is very, very general. It can solve many tasks. You can build a lot of things on top of them.
And the last thing is that they are fluent. They have this ability to communicate with us in a way that is so natural that the adoption or the entry bar is very, very low. And you feel comfortable with them, and you feel that you can master them in a short amount of time.
So, again, in the order—in order to keep us kind of on the same ground, one of the things that I thought I would do—and this is the worst thing that you can do as a professor, which is try to do things live, but we’re going to give it a try. I’m going to show you what, for the ones that haven’t used it—and this is going to appear in a second, I think.
(As an aside.) We are? We need some technical—yeah.
DUFFY: It would not be a presentation on technology if the tech worked. (Laughter.)
ELBAUM: Yeah.
DUFFY: Not how it happens.
(Pause.)
ELBAUM: Thank you, Jamie. Stay close just in case. (Laughter.)
So there are a couple of—a couple of these tools that I mentioned before. One is—one is ChatGPT. The other one is Copilot. ChatGPT is the one that you have seen the most. But let’s play this exercise. Let’s assume that next year—this is not recorded, right, Kat? So, OK. So let’s assume that next year we ask the CFR CEO to hold this meeting in Hawaii. Let’s suppose that that’s what we want to do. So what we’re going to do is I’m going to say to ChatGPT, “draft”—and I can misspell; it doesn’t matter—“email to CEO to take all attendees to end of the year celebration to Hawaii.” OK.
So this is the email that ChatGPT is generating: “I hope this message finds you well,” blah, blah, blah. “This will be a great opportunity.” It gives you good reasoning. So this email that would have taken me, you know, ten minutes to write carefully, it’s spelled out pretty nicely by ChatGPT.
Now, you will say, well, I can use a template and do that. Let me try a few other things. Let’s suppose that we do the same thing but now we also have a group of people—so I’m originally born in Argentina. We want to generate an email like that to send to our members in Argentina. OK, so same thing, same email. Now to Spanish. We have it done.
Let me do something slightly different. This is another one—another one of the same tools like ChatGPT, Copilot. Let me try it here. Draft email for CEO to take us to Hawaii. OK, so this is part of the setup here. Let me start with ChatGPT then. ChatGPT, I’ll stay here, so. But one of the things that I can do is I can generate the email, I can translate it. I can also—can you generate a cost estimate for it? And it will generate, probably, an estimate of the cost of the trip, just some of the logistics, and generate a little table for me. So for us to go there, it will take about $275,000. It makes some estimate, assuming a hundred attendees.
So throughout the process, the tool made a lot of estimates about how much the air fare costs, what type of accommodation we want, how many people will attend. But the thing is that you realize now in two minutes I was able to create a draft email, create this spreadsheet. And if I were doing my machine, I could actually say, you know, pick the right dates for us to do it, given the CEO’s schedule, my schedule, and your schedule. So the power of these type of tools to really cut corners and help us solve a lot of our mundane tasks that take a lot of time is incredible. It’s really incredible.
So let me just go back to the—back to the presentation, and we’ll come back to some of these later. But hopefully it convinced you that in five minutes I was able to produce something that, you know, it would take us much longer if we did it by hand. So and this is just in my world today how we can do that. Now, we can do a lot of things beyond this mundane task. And can we shift back to the slide of the office assistant, please? And I promise I won’t do this back and forth anymore, ever. (Laughter.) Oh, I see. This is it. This is it, OK. OK, so—no? All right. That’s it. So we won’t do a live demo again.
Now, how good are these tools today? Well, really that may be the wrong question. Really depends on the tasks that you use them for. If you were to use them for tasks that were set in 2012 to 2020, they have become pretty good. So you can see this is the human baseline, how the human operates. These are how the tools operated for different tasks. And you can see that every one of these tools is actually approximating or overpassing human performance. So for a task of just classified picture to determine what type of animal it is, some of the algorithms to analyze images are much better than humans. So they can do as well.
Now as the tasks get more difficult, more challenging, human performance by expert still is better than some of these models. But these models are as good as a non-expert at most tasks today. That’s a scary thought. Two years ago, that was not the case. But today, these tools are as good as average humans or non-expert humans on most tasks. However, if we pick tasks that are difficult—so in this case, this was an example of chemical engineers that require a graduate school—the models are not quite there yet. But the key thing is that they are getting better. You can see the curves going up.
Now that I told you what these tools can do, how they’re getting better very, very fast—you saw the slope of those curves surpassing human performance, surpassing often teams of humans at certain tasks—I want to show you how these work—how these tools work. And I want you to think about them as functions, as functions that take an input and produce an output. So if you—let me introduce a few concepts. And I promise you no equations, but just a few graphs, OK? But bear with me.
A key concept on all machine learning, deep learning, and language models are data sets. And data sets consist of data and labels, X and Y. Let’s think about that with simple type of labels. For every X, I have a Y. So for every ZIP code today I may have a temperature. That type of relationship. And we have—this data set consists of observations here. A lot of the machine—most of machine learning algorithms, what they attempt to do is to identify functions that feed this data so that you can use that function to predict future instances of X. So in this case, this data has a linear distribution. I have this function that cuts it right in the middle, so that next time I get a value in the X axis I just map it to the red line and I get the Y value.
The critical part here is that for this function to be useful it should be good at predicting things, even things that it hasn’t seen. And in order to do that, the critical part is to find the right values of A and B. Those are the parameters of this function. If you can find good values of A and B, this function is going to be valuable for you to predict trends in the future. Now let’s pick a more difficult example. Instead of having X and Y, let’s assume that the input X are images of digits, and the output of this function is going to be you tell me what digit you saw. So if I give you an image of a five, it could be any of those fives. But you folks, you know that these were all fives. But how does a machine learning algorithm figure out that this is a five?
Remember, fives can be small, large. It can be inclined. It can be rotated, right? So fives may take many, many different shapes. So the way that—some of the tricks that we apply to this is that for machine learning algorithms to work everything needs to be basically a set of numbers, or an array of numbers, or a vector of numbers. And the way that we do that is that for the image of a five, we think about the pixels that make the five in the image. Each one of those pixels has a brightness value. And each of those brightness value becomes an element in our vector.
So just like before we had an X. Now we have many, many, many—the X is made out of all these vectors—all these vector values. That’s called the encoding. So now we have data set. We have encoding. You’re getting all the right terms, OK? There are five of them. The training process is kind of the interesting part. Before, when I told you about parameters, I was telling you about A and B. Now when you think about this process, think about all the dimensions that you need to consider to identify what’s a five, or a one, or a zero. You need to—you have all these pixels, twenty-eight by twenty-eight image that we had before. That’s 784 vector sizes that may take all different shape, all different values.
So when we look at these the way that number of parameters that this neural network needs to figure out are more than A and B, are thousands of them. But the process works in the same way. You have a data set that has a data and a label, like all these are zeros, all these are ones. And the way that this algorithm for machine learning is going to find the right function for us—because remember, before I said it’s a linear function. But right now this algorithm is going to figure out the function and the parameter for us on its own. And the way it’s going to do that is it’s going to run every one of these pictures of numbers through the network. And it’s going to get a result. And then it’s going to compare the result with a label, with the value that I should have gotten for that.
If there is an error, it’s going to adjust the network to minimize that error. And it’s going to do that iteratively to optimize the values of every single neuron in this network such that all the parameters helps you to get the right result that match the labels. I said a lot of things there, but there are a few things to keep in mind. One is that it’s the same things as X and Y, but just in a much, much larger dimension. You have many, many dimensions to consider. The other one is that this process is automated. So it’s able to identify what makes a five a five in the image. What is the shape of the little curve? What are the key pieces that I identified to say that the five is a five?
So once you have that, the network has been trained. That’s why this color here. And when the number comes in as a five, we will pixelate it, we’ll get all the pixel values, we’ll make it a vector, we’ll push it through a network, and we get that it was a five. That’s how neural network works. And I have seen a lot of staring faces. These are exactly like the students look at me in class. So I’m very familiar with it. Feel free to ask questions afterwards. I’m happy to answer them.
Q: Can I just ask you a very quick question?
ELBAUM: Yeah, absolutely.
Q: Fives sometimes look like S’s.
ELBAUM: Yes.
Q: So how does the network distinguish between a five and an S when actually they’re identical—(off mic)?
ELBAUM: So let me answer in a few ways. First of all, this network is just identifying digits. It’s going to actually give you an answer. The only answers that I can give you are what is the likelihood that this input is a five? What is it likelihood it’s a four, a three, a two, a one, or a zero? It doesn’t know about Ss. If you train the network that consider digits and letters, it will give you probability. And it may tell you in that case that the five has the probability of being a five with 51 percent and the S with 49 percent. And that’s why these networks can make mistakes, because if things look pretty much alike, you know, it will pick the one that has the highest probability, but it may not necessarily be right. These are probabilistic estimates.
So let’s do one more example, and this will bring us back to LLMs. If the function that I want to create now is not—the input is not an X, it’s not an image, but it’s actually a piece of text. And the function is start—what it’s going to try to do is estimate what is the next word that should appear after this text. So if I say to you, predict the next word that is the most likely next, if I say, “I like my coffee with,” the networks that come up with a function is going to say, well, sugar has 45 percent, cream has 40 (percent), A has 1 percent, and there’s some other smaller percentages here.
Now the network has been trained—just like I trained the network for the images—has been trained to guess what the next word is going to be. Now let’s suppose that—it may not be right—but let’s suppose that someone, the user, puts that, you know, that’s the right word—that should have been the right word. So I’m giving it a little more context. Then, in this case the next word—the next word may be “pastry,” or may be something else. But I think the point here is that now you can see that LLMs, these languages like the ones that I was showing you about writing an email to the CEO that does this, they use the text that we’re providing, the prompt, plus other context, to come up with what’s the next word. And once they come up with the next word, they come up with the next word and the next word and the next word.
And that’s kind of the essence of how this LLM works. They work by—instead of estimating, you know, what is this digit, it tries to estimate what is the next word. There are a couple critical difference that makes these LLMs, like ChatGPT, so powerful and so amazing. First of all is the scale. I was showing you before a data set that had, I don’t know, fifty, a hundred values. When they train an LLM like ChatGPT, they script a whole or a good chunk of the internet. That’s how much data they have. It’s massive. That’s a critical difference, because now your probability estimations are much richer.
Second, they have billions of parameters. Remember when I started F, X, Y, we had A and B, two parameters. These are billions. Actually, the smallest of these large language models have a couple of billion parameters. They’re just ginormous. They have an interesting neuron architecture that is called a transformer. And what it allows you to do, these transformers can actually—they’re so smart in that they have—they can actually look at not just the previous words, but how the words are combined, and in what way they appear in the text. So it’s much more richer how they can pay attention to the right thing to make the next prediction.
They also have a human in the loop process that reinforces when they—that you’re able to train these models to make the right prediction, or to push the right predictions over the bad ones, or correct bad ones and kind of steer the model away from bad behaviors. And, as you saw, they have a lot of user prompting as part of these interactions. But the big thing is the scale. The scale of training these networks is unbelievable. So much that the cost of building these models, as you can see, it’s going up quite quickly. So these models may take millions of dollars to train. And a lot of the big companies are training tens, hundreds of these models a day. So the cost of actually building these is, you know, staggering.
So now that—now that you saw how this work, and I know this was a very conceptual explanation, but I think you get the key concepts of why these models can get things wrong. The models are getting much better. I mean, ChatGPT when it came out, it was just—the type of mistakes it could make was really ridiculous in comparison with what it does now. This is very, very effective. But part of it is, like, the incompleteness of the data, right? If I don’t give you enough data, the more points that you had in that X and Y diagram, the more chance that you have of coming up with the right function.
Hallucinations, again, the model is trying to predict things that it hasn’t seen before, predict for inputs that it hasn’t seen before. So it may hallucinate things that are not quite right, either because it doesn’t have—doesn’t have enough data, or because he doesn’t know exactly how to bound what it knows. And I’ll show you a few of the other examples here, but let me—let me—and I’m sorry. I’ll read it for the people in the back. But incompleteness. I asked Copilot how many people work at the Council for Foreign Relations? Because that email, I wanted to make it not for a hundred people. I wanted to make it for all the employees at CFR. And it gave me—and it told me 457 employees, which I think is quite high for number of employees at CFR.
But perhaps some interesting things is that I asked it again a little bit later from my phone, so I mistyped a lot of the words, like “they” and “people.” And the answer was between 1,000 and 5,000. So we have an order of magnitude of estimate. And, you know, in part maybe that—because number of employees of CFR perhaps is not anywhere to be found on the internet, or it varies quite a bit. This number is probably pretty good for a number of members instead of employees. So it may be confusing—you know, it’s a little hallucinating that members look like employees. You know, it’s—there’s some uncertainty there about that. But obviously this doesn’t inspire as much—the purpose of showing you this example is to make you aware that even though the email that it generated for my CEO is great, I better check it before I send it, right, because it can have some sorts of errors like this.
The other interesting part is that—and this is a clear hallucination—is that I asked it to create a map. And this shows that they can create text, but they can also create spreadsheets, they can also create figures. So I asked it to create a map of the potential locations that we’re going to visit in Hawaii. And some of the parts of the map are right, and they look like Hawaii. Some of them definitely not. There are volcanoes that are not there. And a lot of the words are in the language that we don’t recognize because it mixes perhaps sources of information from all over the world that may have letters in different languages. Actually, I’m making this up. I don’t know why it will come up with those letters, except for—I’m conjecturing that it’s because of that. But really, explaining why and how the model got to create this image is something that that we’re working on, that we don’t have a good answer for always.
OK, let me show you another example of what it did today in terms of biases. I asked it: Can you make up a picture of a nurse that is tired after a long day of work? And it showed me two pictures, actually four pictures, of a nurse that was tired after a day of work. And they look very realistic. These are not real pictures. These are not real people. And then I asked it to make a picture of a CTO that is about fifty years old. And it came up with these two good-looking gentleman over here. But you notice—you notice the bias, right? So if you ask for a CTO, you get four males. If you ask for a nurse, you get a female. And, you know, part of that is because of how these systems are trained. They are trained on bodies of data, and they reflect the biases in the data that they’re trained with.
Models are getting better at that. So if I asked for CEO picture, I got diversity. I had to play a little bit just to show you this example. So my research at UVA is really about autonomous vehicles and how to make them safe. They’re not about—it’s not about LLMs. But we use LLMs, actually, to understand the context under which these vehicles are deployed. So we collect millions and millions of pictures, and cloud—so one thing are the pictures, the other ones are radar-like point clouds of where the vehicle operates. And we get all these data. And we try to analyze it automatically. Say, hey, what did the car see at this point that it actually passed a stop sign? Can you describe it for me, so I can understand what’s going on without looking at these point clouds? And it’s able to do that. And so what I’m trying to do here at CFR, maybe we’ll talk more about this, is how to connect my experience in building these systems with—and trying to make them safe—with where policy is in terms of AI and autonomous system in government today.
Just a little more. How do we connect what I’m talking about to CFR and policy? And I think this is part of the effort that I’m working on here as well, is trying to see how do we bring these technologies to, for example, predict what is the next policy change that we’re going to see towards the war on Russia and Ukraine? So I tried a few things here just to illustrate how people can use this type of tool for policy research. So I can say, can you create a timeline for the Russian and Ukraine war? And I’m not an expert on this area, but it looked reasonable. And it actually gave me the sources that it used for creating that timeline. I asked it, what are the likely outcomes of the war? And it gave me these four outcomes. And we don’t have our Ukraine-Russian expert here, but I checked one of her articles and this seems actually quite aligned with the outcomes that she had described, at least a few months ago.
I pushed it a little bit. And I was trying to make it—to make mistakes. And I said, please assign probabilities to each outcome. I wanted to know, like, what is the likelihood that we have a long war, and they came up with this, frozen conflict, victory, or defeat? And interestingly enough, I look at the sources and there were some of the sources that actually had these percentage estimates on them. Some of the Wilson Center report. But not for all the—all the points. So it actually made some estimates on its own about this. And just going further in understanding, I was trying to analyze similarities—historical similarities, cases where we have seen similar type of situations. And in this case, the similarities between the German invasion of Poland in the Second World War and the Russian invasion of Ukraine in ’22.
So again, this is just to illustrate where this type of tool can be used for starting to kick the tires of an idea that you have in terms of foreign policy. It’s not enough to just ask a question and get the answer, because of all the issues that I pointed before, because now you know how it works and how it can error because of the lack of data, or the misuse of the data, or just the lack of capacity of the model to actually encode all the right parameters. But you can see all the potential of this type of tool in the future to accelerate, you know, policy studies, you know, analysis of what-if studies, do type of comparison of different strategies. Very, very efficient at that.
The last piece I wanted to mention is that there’s a global race, AI race, going on. And it has many dimensions. And I’ll just mention a couple here. One is how many nations are creating these foundation models. The U.S. is in an extremely strong position right now, leading this type of creation of foundational model. There are also many countries trying to institute laws or containment policies towards AI to regulate it. These are just two of the dimensions of this multidimensional race. Creating the human capital that can actually create this AI is something that China is leading today. So I think there are tons of interesting dimensions to these problems to think about that are—that are relevant to CFR today.
So I think with that, I’ll stop. And maybe we can handle a few more questions. Thank you. (Applause.)
DUFFY: All right. Thank you so much, Sebastian. I hope you don’t miss the classroom too much. So I have a question for our audience first. Who here is most interested in the kind of divide that exists between the world of AI researchers and academic experts and the world of policymakers and building that bridge? Because that’s one area we could sort of lean into and focus on. So who in the room would like to kind of lean in on that? Raise your hands. Who would like to lean in more on the kind of hard science behind AI, how it’s working, what it’s going to be good at, what it’s not necessarily going to be good at? Who’s feeling that? OK, so that’s about half and half. Slightly more on the other side. And who is interested in just, like, AI bubble, like, what’s the hype? What’s not the hype? (Laughter.) Like, like signal and noise? Who would like to lean into signal and noise? Is anybody feeling that? OK.
So we’re about half and half then on the first two. So what we’re going to do is—it’s already—let’s see. We only have about twenty-two more minutes. Let’s just start by who has questions? Let’s go on and get—oh my gosh. So many questions. Maybe we just jump into questions, and then we’ll close it out maybe with some final thoughts. OK. Why don’t we start right here, and we’ll take two at a time, maybe to batch? We’ll start with you and you guys. And then we’ll just move progressively across the room.
Q: Thanks. That was so helpful.
I just had a question. (Coughs.) Sorry. Towards the end you said that between U.S. and China, China was sort of more advanced. It sounded like you were talking about human capital. But I was just curious what that was. I just wanted to kind of understand that a little bit better. Thank you.
ELBAUM: Yeah. Thank you for asking that. So—
DUFFY: We’re going to batch.
ELBAUM: Oh, OK.
Q: Hi. Trooper Sanders.
So we’re roughly between the State Department and the National Security Council—two complicated, weird organizations. How do you think about integrating AI into the high-stakes tasks and work that the people in those organizations do?
ELBAUM: OK.
DUFFY: So essentially two questions about human capital. But we’ll start with China, and then we’ll start with implementation. And as a foreign service spouse, may I just say it is weird. You’re right. It’s a weird organization.
ELBAUM: So what I was referring to China was primarily—I was thinking about two things. One is, you know, the U.S. has the models, has the leading companies, had the leading foundational models been released in the wild. What I was referring to are actually, like, the number of scientists, Ph.D. production of China, which renders ours tiny in comparison. So that is a significant difference where we need to be able to—we need to be able to somehow equalize that in some ways. In the past, we have done it through, you know, immigration and other mechanisms. But I don’t think we can train our way out of this one quick enough.
Q: Can I follow up just a quick question on that? Is there—is it more the number of papers in China or the quality of the papers coming out of the PRC?
ELBAUM: Yeah. I was talking particularly about the production of Ph.D. students, not about the papers. But let me just refer to the papers. The number of papers in AI is such today that it is—we don’t have the reviewing capacity to identify the top papers anymore. I mean, it’s that bad. So, but if you look at the top AI conferences, which have double-blind submissions, so that means that the authors don’t know who the reviewers are and the reviewers don’t know who the authors are, there are significant number of papers from Chinese institutions at those conferences. So even in that sense, the balance is shifting already.
DUFFY: And then going to implementation for NSC, for State Department. What is it?
ELBAUM: Yeah.
DUFFY: What are you seeing when you—the more—especially, when you meet with policy professionals and get a sense of what they’re working with?
ELBAUM: So let me—let me put the caveat that I have been here for two months. That’s my experience—my policy experience. I can talk about CS a lot, but not so much about this. But I think that what—I have found a few interesting things. First of all, there is—there’s this huge gap that I see between the use cases that are listed or pursued at those organizations and where the technology is at. The technology is at—you know, at a place that is far, far away from where I see the technology being implemented. It feels like 2016-2017, the use cases that have been described. Not just at the organization, but at most organizations as listed—the use cases that they listed.
So I think that’s a challenge. The second—the second difficulty that I see is, you know, even though you have a lot of directives to bring a chair of responsible AI, and technology in place, and all of that, it feels like there’s such an asymmetry between the capacity of the personnel in place and what it needs to be deployed at technology, that simply saying we’ll have a responsible AI person that comes up with a framework—there’s just a huge distance between that and actually knowing how to do it and having the people that actually can put teeth to it. So I think that’s—I think that’s—I think the distance between the directives and implementation is where I see the challenge, and the differences between the use cases and where the technology is that it feels like, you know, we solved that problem already. So it’s not as much of a technology challenge, as a, you know, social—more social, organizational challenge, in a sense.
DUFFY: Yeah. What is it, culture eats strategy for breakfast? Yeah. Well, I think it’s also—you know, we talk a lot about AI as a dual-purpose technology. But I think what’s so interesting is, in your question, is also that it’s a—it’s a dual purpose theme, right? Or, because it is—for example, if you’re the chief AI officer at State, that’s overwhelmingly been a job about how State uses AI to do State’s job. Which is a fundamentally different question than what State’s foreign policy should be about AI. If you are going to be implementing AI at the Department of Education, that is a fundamentally different job, and question, and skill set than how should the Department of Education be putting out guidance for educators and schools about the use of AI, in terms of theory and principles of education, and then also in terms of implementation.
So you always end up in this land of what is the theoretical policy landscape. Like, what are our ethics and norms around the use of this technology? And then within those ethics and norms, how would you actually do it? And I think what’s so challenging is that, just like generative networks, right, that’s a generative question. And the theory and the norms will shift as the technology capacity shifts as well. And I don’t know that—I would say that—you know, one of the things that I would say the policy space really struggles with is the dynamism that that requires. Because policy has to—is deliberative by nature. And so the speed at which one deliberates, I think, is currently undercut by the speed with which that transformation is occurring.
ELBAUM: Yeah. I think that the pain points in some of these agencies are going to drive the adoption and the diffusion of this technology. So, you know, DOJ may be using the tools of 2017 until it starts seeing cases with technology of today. You know, with deepfakes generating a lot of, you know, fake information, with logging into bank accounts with your voice is no longer secure. So I think that—like everything else, I think necessity and the pain points are going to drive the adoption. It may be a little bit painful because it will have to happen very, very fast.
DUFFY: And it could be old AI and still a dramatic improvement.
ELBAUM: Hmm, that’s true.
DUFFY: Yes. Sorry, I promised to go to the middle. So I’ll go here and here.
Q: Thank you for that very interesting presentation. I’m one of the people that raised my hand for, like, looking kind of under the hood at how these things work. I’m curious. You said that AI models have billions of parameters. Who creates the parameters? Is this—have people created a billion parameters? Or are we using AI to create—is AI training AI?
ELBAUM: Yeah. So—yeah.
DUFFY: You, sir.
ELBAUM: Thank you. Sorry.
Q: Thank you. I wonder if you’d comment on AI for rapid decisions. For example, in response to an attack that requires a response within a few seconds—a few minutes, rather. And, secondly, detection of what the other guy is up to. If there were an arms control agreement that prohibited killer robots or that that made it harder to have a cyber apocalypse, could you tell—and you had an agreement—could you tell whether the other fellow is keeping it?
ELBAUM: OK. So the—
DUFFY: Is the AI training AI, first of all. Which feeds, I think, specifically then into the second question.
ELBAUM: Sounds good. I think the short answer is, no. This process is not manual. I think the developer—before, developers were creating if-then rules. You can think about programming that way. If A happens, do B. If B happens, do C. Now these models learn from data. The developer’s job is to pick the right architecture, OK? And that means how many neurons you want, how do you want them to be structured? But the training process that I described, which is you put data here, you run it through a network, you get some results, you compare that with the expected results, you compute an error, and based on that you adjust the parameters through a process called back propagation. All of that happens automatically.
It’s just that it—even though it happens automatically, and you have super-duper GPUs computing these, it takes a long—a lot of cycles to do this, even for big machines. Which is why the process is so expensive, and it requires so much energy to actually do it. It’s not something that—for these large models, it’s not something that you can do in an organization like this or—you know, even a larger lab, like we have at UVA, it’s a challenge to build a model that is a tenth of that. So I hope that answers your question. So short answer, everything is automated. You can think about the developer designing something, but the implementation is all automated. It has to be. The scale of this is something that we cannot fathom. I mean, it’s that big.
I’ll move to your question. So I think you mentioned several things that trigger thoughts, but let me start with one of them. I think the—when we think about our military today and the scenarios that they envision, and the systems that we’re developing in terms of autonomy and decision power, and what we see in Ukraine with their drones, to be into that level of automation I think the role of the human is going to be outside of this execution loop. I don’t think the human, for this short, seconds decision that you’re talking about, I don’t think having a human is going to be pragmatic. And I’m not even sure it’s going to be ethical to have a human in the loop when the decisions need to be in such a short time that—and by people that may be under a lot of pressure—that personally I would not want a person making those decisions.
I would like to have a person in the design of the system and the assurance that the system is correct before deployment. But this idea—even with some DOD directives of having a human in the loop to make decisions like this—I think it’s not—it’s not realistic for a war context, with the fog of war, and all the pressures and the cognitive limits that we have as humans, and the designs or the scenarios for which these systems are designed. I know this may be—I know this is controversial. But I just see where the technology is today and where is the vision to use it. And I just see that as a significant inconsistency that we’ll have to come to grips with, and really stop having this illusion that having a human in the loop will make everything OK. And instead, really put the focus on constructing these systems well, and build trust in the systems and the people that operate them at the right level.
DUFFY: And I think, you know, one of the conversations that Sebastian and I had very early on was on this question of human in the loop and accountability. And I think, from a policy side, a lot of people say human in the loop because they want to know where human accountability sits. Because human accountability is a core component of how we function societally. And when I say accountability in a war context, I am very much thinking of rule of law, law of war, international humanitarian law, Geneva Convention. Like, who would be accountable? Was there a war crime? Who signed off on it? What does that accountability look like?
When Sebastian hears accountability, he thinks auditability, right? Like he thinks about auditability of the system. So I was saying accountability and he was saying accountability, and we were talking about fundamentally different concepts. And so we had to take a moment to be, like, wait, what are you saying? What are you saying? Oh, we’re having a totally different conversation. We think we’re talking about the same thing, and we’re not. So when I—so also in this sort of broader discussion of how we bridge these spaces, we have a very immature lexicon at the moment for navigating a very complex and unprecedented terrain.
ELBAUM: I’m happy to follow up. I know you the questions had different parts. But I’ll happy to stay and follow up on the recognition of—
Q: You’re putting a lot of weight on the person—on the people who create the training—the training data set to include an awful lot of contextual factors, if you’re going to keep a human out of the loop for rapid decisions. We didn’t talk about the detectability of what—
ELBAUM: Yeah. Yeah. And let me just say I’m a lot of responsibility on those, absolutely right. At the same time, when you need to make a decision in one second or in seconds, the context that the human can have compared with the machine, I don’t believe that that’s going to be an advantage coming forward. Yeah.
DUFFY: So, yeah, we’ll go here and here. And then we’ll go over to this side.
Q: When you showed the slide about Ukraine earlier, you gave some probabilities of various outcomes, or the AI gave some probabilities of various outcomes, based, according to the slide, on information from Chatham House and the Wilson Center. There must be some way in which this system has evaluated multiple sources of information and decided that those matter and others don’t. How do we, as consumers, know how the model makes that choice? And how do we, as potential consumers, know what biases are built into the model in this way when we’re buying it?
DUFFY: That’s a great question.
Q: Hi. I had a question that was more on the kind of commercial outlook of the industry. You talked about foundational models, about the incredible resources required. And obviously, as we all have read in the media, that means that there will be a few winners, at least at the foundational level, that have the capital to invest in developing these models. Do you think that is—we’re looking at an industry that’s dominated just by a few players that have access to capital? Or is there an application layer that’s going to develop, just like it did with, you know, Google Maps being the foundation and so many apps that developed around it?
ELBAUM: Yeah, excellent.
DUFFY: So you have citeability and provenance, essentially, and then the commercial domination.
ELBAUM: Yeah. So I think the first question was about, really, how much stress do we put, or how do we actually assess the quality of information we get? So for some of the stuff that I was showing you here, I was able to constrain the source of information to a few. So that’s one recourse that you have as a user in terms of the scope of information and the sources that it retrieves. But this is still probabilistic model. And that’s why I show you the, you know, it’s predicting the next word. It’s not actually analyzing do I trust more this think tank than think tank. It’s really about probabilities, given a pool of data that it considers.
So critical to these is that if you’re going to use these type of large language models that have been trained on data that we’ll never know, then—and you are not creating your own model where you can actually really curate what goes in and into the model—then I think you have a responsibility to actually create prompts and queries, and multiple queries, that analyze the data in multiple ways, that slice it in different ways. Because in a sense, you’re becoming a query—you’re making a query, and you’re testing the outcome. And the more ways that you have ways to triangulate that the output—the output is sound, that’s what should give you—that is what should give you confidence of these, as a—that’s as a human process on the on the back end.
But it’s an excellent question. I think models that are so general are going to have holes that are not going to be found by the company. They’re not going to be found by the NIST organization on AI safety. They’re going to be—they’re going to be found by users. And I think educated users should be aware of these errors, or the potential failure modes. It’s really important, so that you can actually check the correctness. I mean, my students are writing programs using ChatGPT these days, right? So in a sense, I’m becoming—they’re becoming masters of creating code instead of writing their own. And part of our—part of my job is actually helping to generate better prompts and make sure that they really—if they’re going to use ChatGPT code, they have really good tests.
Let me go to your question. And, yes, it is concerning that the dominance is done by five or six large companies. And they are the usual suspects. And that has—that has a lot of ramifications. But I think—I think you’re correct in the—at least in saying that there is a good possibility of building a lot of technology on top of these models to do very, very good targeted tasks. So this general layer—so you can think about this as an architecture that has a general layer provided by these type of tools where you can specialize with more layers on top, and more layers on top. Now, this architecture where you’re thinking about we built on top of things. I think the difference between that and the Google Maps, is that Google Maps was actually very sliced, very well-defined in scope.
I think with these architectures being so broad, being applied to so many things, and having holes in places that we don’t know about, I think the challenge is that if your high-level system that you build on top of several layers of these AI components fails, then it’s hard to track the failure. It’s hard to attribute to—you know, to the source of the failure. And makes the development of this—of these stack much, much harder.
DUFFY: And then we don’t know who to sue. And nothing is more American than knowing who to sue, right? (Laughter.) OK. We are almost at time, and CFR is very much into punctuality. So we’re going to go here quickly, and we’ll do one last quick question wrap up, but then Sebastian and I can hang out for a little bit for anyone who wants to stay. I just—CFR has drilled into me that we must let everyone go exactly on time. So.
Q: Thank you.
I’m interested in your thoughts on the military industrial angle to this. And especially given the dominance of the private sector in developing these technologies, and the complete decimation of government participation in that field—I mean, you think back to the Manhattan Project and the wholesale sort of envelopment of private sector academic efforts into a government effort. Given sort of your comments about the military role of this technology, have you encountered AI technology that you, yourself, think is dangerous in the private sector? That you, yourself, look at and sort of think, this should be classified. This is something that average people shouldn’t have access to?
ELBAUM: Well, let me answer. I think there were two questions there. And I think you’re right that the cutting-edge technology is not in government anymore. I would have to say, though, that in the last few years, I think, like, the DIU program, and DOD, is particularly interesting to me as an innovation model, where you can actually bring some matching partner from industry and team them up with a with a group at DOD that has a particular problem to solve that is very clear, what are the pain points and what is the business opportunity, to basically cut down this cycle from technology creation to deployment in the military. I think that sounds very auspicious.
Have I encountered AI technology that I’m concerned with? Look, my team has generated millions of tests for autonomous vehicles. I don’t know, do I need to say more? No. (Laughter.) So I think there are—there are deployments that seem premature, absolutely. I think there are—there is technology that is out there that has not been vetted properly. And we find that—find that out afterwards. And I think it’s—I think, with general AI, like this, it’s really, really—I think the bar where I put it with these LLM models, for example, is that if you can Google something and find it, then that’s the same as these LLMs providing you the answer.
I think the question is when they get—and they’re not too far from being able to provide answers by putting together data from many other sources to give you answers that you couldn’t just find by searching the web for it. You know, like weapons or things like that. And I think the big concern here is that can we have the containment and the guide rail mechanisms that will stop that type of information from being assembled and delivered to users? Because, like I said, ChatGPT’s successful everywhere.
DUFFY: And with that, I need to let everyone go because of CFR’s very strict punctuality rules. But Sebastian and I will be around for anyone. I would say, Sebastian’s going to be with us all year here at CFR, and is specifically looking at autonomous systems and the national security and the military implications. So especially for any of you who are working in those areas or have friends or contacts or know members who are working in those areas, if you have any recommendations for Sebastian as he boldly ventures into policy-landia, then also please do volunteer those. We would love to hear them.
ELBAUM: Thank you, Kat.
DUFFY: And with that, I want to thank Sebastian so much for taking the time. Thank all of you. (Applause.)
ELBAUM: Thank you.
DUFFY: It was wonderful to see you all.
(END)
This is an uncorrected transcript.