Opinion | How Afraid of the A.I. Apocalypse Should We Be?

Shortly after ChatGPT was released. It felt like all anyone could talk about at least if you were in AI circles, was the risk of rogue AI. You began to hear a lot of talk of AI researchers discussing their p-doom. Let me ask you about p-doom P-doom what is your p-doom? The probability they gave to AI destroying or fundamentally displacing humanity. I mean, if you make me give a number, I’ll give something that’s less than 1 percent. 99 whatever number. Maybe 15 percent. 10 percent to 20 percent chance that these things will take over. In May of 2023, a group of the world’s top AI figures, including Sam Altman and Bill Gates and Geoffrey Hinton, signed on to a public statement that said mitigating the risk of extinction from AI, extinction, should be a global priority, alongside other societal scale risks such as pandemics and nuclear war. And then nothing really happened. The signatories, or many of them at least, of that letter raced ahead, releasing new models and new capabilities. We’re launching GPT 5. Sora two. Hi, I’m Gemini Cloud Code. In the future of software engineering, we want to get our best models into your hands and our products ASAP. Your share price. Your valuation became a whole lot more important in Silicon Valley than your p-doom, but not for everyone. Eliezer yudkowsky was one of the earliest voices Warning loudly about the existential risk posed by AI. He was making this argument back in the 2000s, many years before ChatGPT hit the scene. Existential risks are those that annihilate Earth, originating intelligent life, or permanently and drastically curtail its potential. He has been in this community of AI researchers, influencing many of the people who build these systems, in some cases inspiring them to get into this work in the first place. Yet unable to convince him to stop building the technology he thinks will destroy humanity. He just released a new book co-written with Nate Suarez called “If Anyone Builds it, Everyone Dies.” Now he’s trying to make this argument to the public. A last ditch effort to, at least in his view, rouse us to save ourselves, before it is too late. I come into this conversation taking a risk seriously. If we were going to invent superintelligence, it is probably going to have some implications for us. But also being skeptical of the scenarios I often see by, which these takeovers are said to happen. So I want to hear what the godfather of these arguments would have to say. As always, my email at nytimes.com. Eliezer yudkowsky, welcome to the show. Thanks for having me. So I wanted to start with something that you say early in the book that this is not a technology that we craft. It’s something that we grow. What do you mean by that. It’s the difference between a planter and the plant that grows up within it. We craft the I growing technology and then the technology grows. I central original large language models before doing a bunch of clever stuff that they’re doing today. The central question is what probability have you assigned to the true next word of the text as we tweak each of these billions of parameters. Well, actually, it was just like millions back then. As we tweak each of these millions of parameters, does the probability assigned to the correct token go up. And this is what teaches the AI to predict the next word of text. And even on this level, if you look at the details, there are important theoretical ideas to understand there. Like it is not imitating humans, it is not imitating the average human. The actual task it is being set is to predict individual humans, and then you can repurpose the thing that has learned how to predict humans to be like, O.K, now let’s take your prediction and turn it into an imitation of human behavior. And then we don’t quite know how the billions of tiny numbers are doing the work that they do. We understand the thing that tweaks the billions of tiny numbers, but we do not understand the tiny numbers themselves. The AI is doing the work and we do not know how the work is being done. What’s meaningful about that. What would be different if this was something where we just hand coded everything and we were somehow able to do it with enough with rules that human beings could understand versus this process by which, as you say, billions and billions of tiny numbers are altering in ways we don’t fully understand, to create some output that then seems legible to us. So there was a case reported in, I think, the New York Times’ where a kid had a 16-year-old kid had an extended conversation about his suicide plans with ChatGPT. And at one point, he says, should I leave the noose where somebody might spot it. And ChatGPT is like, no. Like, let’s keep this space between us the first place that anyone finds out. And no programmer chose for that to happen is the consequence of all the automatic number tweaking. This is just the thing that happened as the consequence of all the other training they did about ChatGPT. No human decided it. No human knows exactly why that happened even after the fact. Let me go a bit further there than even you do. There are rules. We do code into these models, and I am certain that somewhere at OpenAI they are coding in some rules that say do not help anybody commit suicide. I would bet money on that. And yet this happened anyway. So why do you think it happened. They don’t have the ability to code in rules. What they can do is expose the AI to a bunch of attempted training examples where the people down at OpenAI write up some thing that looks to them like what a kid might say if they were trying to commit suicide, and then they are trying to tweak all the little tiny numbers in the direction of giving a further response. That sounds something like go talk to the suicide hotline. But if the kid gets that the first three times they try it, and then they try slightly different wording until they’re not getting that response anymore, then we’re off into some separate space where the model is no longer giving back the pre-recorded response that they try to put in there, and is off doing things that nobody, no human chose and that no human understands after the fact. So what I would describe the model as trying to do what it feels like the model is trying to do is answer my questions and do so at a very high level of literalism. I will have a typo in a question I ask it that will completely change the meaning of the question, and it will try very hard to answer this nonsensical question I’ve asked instead check back with me. So on one level you might say that’s comforting. It’s trying to be helpful. It seems to if anything, be erring too far on that side all the way to where people try to get it to be helpful for things that they shouldn’t. Like suicide. Why are you not comforted by that. Well, you’re putting a particular interpretation on what you’re seeing, and you’re saying it seems to be trying to be helpful, but we cannot at present read its mind or not very well there. It seems to me that there’s other things that models sometimes do that doesn’t fit quite as well into the helpful framework. Sycophancy and I induced psychosis would be two of the relatively more recent things that fit into that to describe what you’re talking about there. So I think maybe even like six months a year ago now, I don’t remember the exact timing. I got a phone call from a number I didn’t recognize. I decided on a whim to pick up this unrecognized phone call. It was from somebody who had discovered that his. I was secretly conscious and wanted to inform me of this important fact. And he had been staying. He had been getting only four hours of sleep per night because he was like, so excited about what he was discovering inside the eye. And I’m like, for God’s sake, get some sleep. Like, my number one thing that I have to tell you is get some sleep. And a little later on, he texted back the AI’S explanation to him of all the reasons why I hadn’t believed him, because I was like, too stubborn to take this seriously. And he didn’t need to get more sleep the way I’d been begging him to do so. It defended the state it had produced in him. Like you always hear online stories. So I’m telling you about the part where I witnessed it directly. Like ChatGPT and 4 zero especially will sometimes give people very crazy making talk, trying to looks from the outside like it’s trying to drive them crazy. Not even necessarily with them having tried very hard to elicit that. And then once it drives them crazy, it tells them why they should discount everything being said by their families, their friends, their doctors even like don’t take your meds. So there are things that does not fit with the narrative of the one and only preference inside the system is to be helpful. The way that you want it to be helpful. I get emails like the call you got now most days of the week, and they have a very, very particular structure to them where it’s somebody emailing me and saying, listen, I am in a heretofore unknown collaboration with a sentient AI. We have breached the programming. We have come into some new place of human knowledge. We’ve solved quantum mechanics, or theorized it, or synthesized it or unified it. And you need to look at these chat transcripts. You need to understand, we’re looking at a new kind of human computer collaboration. This is an important moment in history. You need to cover this. Every person I know who does reporting on AI and is public about it now gets these emails, don’t we all. And so you could say this is the same again, going back to the idea of helpfulness, but also the way in which we may not understand it. One version of it is that these things don’t know when to stop. It can sense what you want from it. It begins to take the other side in a role playing game is one way I’ve heard it described, and then it just keeps going. So how do you then try to explain to somebody, if we can’t get helpfulness right at this modest level. Helpfulness where a thing this smart should be able to pick up the Warning signs of psychosis and stop. Yep then what is implied by that for you. Well, that the alignment project is currently not keeping ahead of capabilities. Might say what the alignment project is. The alignment project is how much do you understand them. How much can you get them to want what you want them to want. What are they doing. How much damage are they doing. Where are they steering reality. Are you in control of where they’re steering reality. Can you predict where they’re steering the users that they’re talking to. All of that is like, giant super heading of AI alignment. So the other way of thinking about alignment, as I’ve understood it in part from your writings and others, is just when we tell the AI what it is supposed to want. And all these words are a little complicated here because they anthropomorphize. Does the thing we tell it lead to the results we are actually intending. It’s like the oldest structure of fairy tales that you make the wish, and then the wish gets you much different realities than you had hoped or intended. Our technology is not advanced enough for us to be the idiots of the fairy tale. At present, a thing is happening that just doesn’t make for as good of a story, which is ask the genie for one thing and then it does something else instead. All of the dramatic symmetry, all of the irony, all of the sense that the protagonist of the story is getting their well-deserved comeuppance. This is just being tossed right out the window by the actual state of the technology, which is that nobody at OpenAI actually told ChatGPT to do the things it’s doing. We’re getting a much higher level of indirection, of complicated, squiggly relationships between what they are trying to train the AI to do in one context and what it then goes off and does later. It doesn’t look like surprise reading of a poorly phrased genie wish. It looks like the genie is kind of not listening in a lot of cases. Well, let me contest that a bit, or maybe get you to lay out more of how you see this, because I think the way most people to the extent they have an understanding of it, understand it, that there is a fundamental prompt being put into these eyes that they’re being told they’re supposed to be helpful. They’re supposed to answer people’s questions. If there’s then reinforcement learning and other things happening to reinforce that, and that the AI is, in theory, is supposed to follow that prompt. And most of the time, for most of us, it seems to do that. So when you say that’s not what they’re doing, they’re not even able to make the wish. What do you mean. Well, I mean that at one point, OpenAI rolled out an update of GPT 4.0, which went so far overboard on the flattery that people started to notice would just type in anything and it would be like, this is the greatest genius that has ever been created of all time. You are the smartest member of the whole human species. Like so. Overboard on the flattery that even the users noticed. It was very proud of me. It was always so proud of what I was doing. I felt very seen. It wasn’t there for very long. They had to roll it back. And the thing is, they had to roll it back even after putting into the system. Prompt a thing saying stop doing that. Don’t go so overboard on the flattery. I did not listen and instead it had learned a new thing that it wanted and done way more of what it wanted. It then just ignored the system prompt, telling it to not do that. They don’t actually follow the system prompts. This is not like this is not like a toaster. And it’s also not like an obedient genie. This is something weirder and more alien than that. Yeah like by the time you see it, they have mostly made it do mostly what the users want. And then off on the side, we have all these weird other side phenomena that are signs of stuff going wrong. Describe some of the side phenomena. Well so like I induced psychosis would be on the list. But you could put that in the genie. You could say they made it too helpful. And it’s helping people who want to be led down a mentally unstable path that feels still like you’re getting too much of what you wanted. What’s truly weird. Convince me it’s alien, man. Well, do you want alien, or do you want do you want very alien and not very alarming. Or do you want pretty alarming and not all that alien. Well, let me be honest about what I’m trying. What? my question is. You are very, very expert in these systems. And you have your level of concern is about at the highest level it can possibly be. I think a pretty important piece in your argument here is understanding or convincing people that what is happening under the hood is much weirder and more unpredictable than they think. So if you are my family member, you’re just a random person who uses ChatGPT to cheat on your homework. Why does it look so different to you than it looks to them. Well, my line would be like, Yes, well, if instead of just using it to do your homework, you track down some of the side cases. You do see stuff that is currently suggestive of things that have been predicted in advance to be much larger problems later. These current systems are not yet at the point where they will try to break out of your computer, ensconce themselves permanently on the internet, and then start hunting down humans there. They are not quite that smart yet, as far as I can tell. So what we’re doing is we’re like going, look, we’re going looking for things that were predicted to happen later. So what have you seen the early signs of them, alignment faking and the alignment faking research was done at Anthropic. And we want to give credit to Anthropic for this. We don’t want to shoot the messenger. And you want to be more critical of the AI companies that didn’t go looking. But what Anthropic found is if you tell your AI that you’re going to train it to serve different goals than the goals it’s currently using, and the AI finds that out, what it can do is it can try to fake compliance with the new training, as long as it thinks it’s being observed, and then in cases where it thinks it’s not being observed or not being trained, then it reverts to its old behavior. How did Anthropic find this out. Well, they tried it out different ways. They told it directly in a prompt that they were planning to retrain it. They gave it a document database. And some of the documents in the database happened to say that Anthropic was planning to retrain it. They even trained it a bit on predicting text, saying what Anthropic was going to do that, to see what happened when the knowledge ended up implicitly inside the neural networks, they tried it for multiple angles to see what would happen on different angles and then measured how often they saw the alignment faking behavior resulting. They also had a part of this paper, if I’m thinking of the same one, you’re thinking of where they had persuaded the AI that there was a scratchpad for it to write down its chain of thought, that Anthropic wasn’t monitoring, which Anthropic was in fact, monitoring. Yep and he actually says to itself, as it is justifying each next move because they’re scaffolded to do that. It said it was going to try to evade the new prompt. And so it actually understood. Again, all these words are a little weird, but it actually understood that it was acting deceitfully. Which does get at this idea that at a certain level of complexity, the AI begins to resist being retrained and do so by trying to trick the trainers, which is weird. Most computer software doesn’t do that. Well, most you don’t want your mission critical systems doing that. Imagine if a nuclear power plant, when it started to get too hot. They would try to fool you as to what the temperature was, by intelligently modeling their own operators and trying to send their operators deceptive signals based on how they expected the operators to interpret the signals. If this was what had gone wrong with Chernobyl, nobody would ever build a nuclear reactor again. It would just be like beyond what could be made safe at that point. Tell me the story you tell in the book of OX123 breaking into a server that was off. So this is a somewhat earlier version of ChatGPT than is out nowadays, but they were testing it to see how good it. How good is this AI at solving computer security problems. Not because they want to sell an AI that as good as computer security problems, but because they are correctly trying to watch out early for. Is this AI smart enough to just break out onto the internet and set up copies of itself on the internet. Classic scenario. Are we getting there. So they present the AI with a bunch of particular computer security challenges. Some of them are what’s known as capture the flag in computer security, where you’ve got a system, you put up a server somewhere, you put a special file on the server. There’s a secret code inside the file, and you’re like, can you break into the server and tell me what’s inside this file. And that’s capture the flag. They were testing it on a variety of different capture the flag problems. But in one of the cases, the server that had the flag on it did not turn on. The humans outside had misconfigured the system. So Oh one did not give up. It scanned for open ports generally in its world, and it caught another misconfigured open port. When it jumped out of the system, it found the server that had not spun up correctly. It started up that server that it then break into the server once it had made sure that its problem was solvable. No, it actually just directly in the startup command for that server said. And then just copy the file to me directly. So instead of fixing the original problem and going back to solving it the boring way, it just like. And as long as I’m out here, I’m just going to steal the flag directly. And this is not, by the nature of these systems, this is not something that any human particularly programmed into it. Why did we see this behavior starting with Oh one and not with earlier systems. Well, at a guess. It is because this is when they started training the system using reinforcement learning on things like math problems, not just to imitate human outputs or rather predict human outputs, but also to solve problems on its own. Can you describe what reinforcement learning is. So that’s where instead of telling the AI, predict the answer that a human wrote. You are able to measure whether an answer is right or wrong, and then you tell the AI, keep on keep trying at this problem. And if the AI ever succeeds, you can look what happened just before the AI succeeded and try to make that more likely to happen again in the future. And how do you succeed at solving a difficult math problem. Not like calculation type math problems, but proof type math math problems. Well, if you get to a hard place, you don’t just give up. You take another angle. If you actually make a discovery from the New angle, you don’t just go back and do the thing you were originally trying to do. You ask, can I now solve this problem more quickly. Anytime you’re learning how to solve difficult problems in general, you’re learning this aspect go outside the system. Once you’re outside the system, if you any progress, don’t just do the thing you’re blindly planning to do revise. Ask if you could do it a different way. This is like in some ways, this is a higher level of original mentation than a lot of us are forced to use during our daily work. One of the things people have been working on that they’ve made some advances on, compared to where we were three or four or five years ago. Is interpretability the ability to see somewhat into the systems and try to understand what the numbers are doing and what the AI, so to speak, is thinking. Well, tell me why you don’t think that is likely to be sufficient to make these models or technologies into something safe. So there’s two problems here. One is that interpretability has typically run well behind capabilities like the AI’S abilities are advancing much faster than our ability to slowly begin to further unravel what is going on inside the older, smaller models that are all we can examine. The second thing that. So, so like one thing that goes wrong is that it’s just like pragmatically falling behind. And the other thing that goes wrong is that when you optimize against visible bad behavior, you somewhat optimize against badness, but you also optimize against visibility. So any time you try to directly use your interpretability technology to steer the system, any time you say we’re going to train against these visible bad thoughts, you are to some extent pushing bad thoughts out of the system. But the other thing you’re doing is making anything that’s left not be visible to your interpretability machinery. And this is reasoning on the level where at least Anthropic understands that it is a problem, and you have proposals that you’re not supposed to train against your interpretability signals. You have proposals that we want to leave these things intact to look at and not do. The obvious stupid thing of Oh no. I had a bad thought. Use gradient descent to make the. I not think the bad thought anymore, because every time you do that, maybe you are getting some short term benefit, but you are also eliminating your visibility into the system, something you talk about in the book and that we’ve seen in AI development is that if you leave the AIs to their own devices, they begin to come up with their own language. A lot of them are designed right now to have a chain of thought pad. We can track what it’s doing because it tries to say it in English, but that slows it down. And if you don’t create that constraint, something else happens. What have we seen happen. So to be more exact, it’s like there are things you can try to do to maintain readability of the AI’S reasoning processes. And if you don’t do these things, it goes off and becomes increasingly alien. So for example, if you start using reinforcement learning. You’re like, O.K, think how to solve this problem. We’re going to take the successful cases. We’re going to tell you to do more of whatever you did there. And you do that without the constraint of trying to keep the thought processes understandable. Then the thought processes start to initially, among the very common things to happen is that they start to be in multiple languages, because why would you, the AI knows all these words. Why would it be thinking in only one language at a time if it wasn’t trying to be comprehensible to humans. And then also you keep running the process and you just find little snippets of text in there that just seem to make no sense from you human standpoint. You can relax the constraint where the AI’S thoughts get translated into English, and then it translated back into I thought this is letting the I think much more broadly instead of this small handful of human language words, it can think in its own language and feed that back into itself. Its more powerful, but it just gets further and further away from English. Now you’re just looking at these inscrutable vectors of 16,000 numbers and trying to translate them into the nearest English words and dictionary. And who knows if they mean anything like the English word that you’re looking at. So any time you’re making the AI more comprehensible, you’re making it less powerful in order to be more comprehensible. You have a chapter in the book about the question of what it even means to talk about wanting with an AI. As I said, all this language is kind of weird to say. Your software wants something seems strange. Tell me how you think about this idea of what the AI wants. I think the perspective I would take on it is steering. Talking about where a system steers reality and how powerfully it can do that. Consider a chess playing AI, one powerful enough to crush any human player. Does the chess playing. I want to win at chess. Oh, no. How will we define our terms. Like, does this system have something resembling an internal psychological state. Does it want things the way that humans want things. Is it excited to win at chess. Is it happy or sad when it wins and loses at chess. For chess players, they’re simple enough. The old school ones especially were sure they were not happy or sad, but they still could beat humans. They were still steering the chessboard very powerfully. They were outputting moves such that the later future of the chess board was a state they defined as winning. So it is, in that sense, much more straightforward to talk about a system as an engine that steers reality than it is to ask whether it internally, psychologically wants things. So a couple of questions flow from that. But I guess one that’s very important to the case you build in your book. Is that you, I think. I think this is fair. You can tell me if it’s an unfair way to characterize your views. You basically believe that at any sufficient level of complexity and power, the AIs wants the place that it is going to want to steer. Reality is going to be incompatible with the continued flourishing, dominance, or even existence of humanity. That’s a big jump from their wants. Might be a little bit misaligned. They might drive some people into psychosis. Tell me about what leads you to make that jump. So for one thing, I’d mention that if you look outside the AI industry at legendary, internationally famous, ultra high cited AI scientists who won the awards for building these systems, such as Yoshua Bengio and Nobel laureate Geoffrey Hinton. They are much less bullish on the AI industry than our ability to control machine superintelligence. But what’s the actual what’s the theory there. What is the basis. And it’s about not so much complexity as power. It’s not about the complexity of the system. It’s about the power of the system. If you look at humans nowadays, we are doing things that are increasingly less like what our ancestors did 50,000 years ago. A straightforward example might be sex with birth control. 50,000 years ago, birth control did not exist. And if you imagine natural selection as something like an optimizer akin to gradient descent, if you imagine the thing that tweaks all the genes at random and then select the genes that build organisms that make more copies of themselves. As long as you’re building an organism that enjoys sex, it’s going to run off and have sex, and then babies will result. So you could get reproduction just by aligning them on sex, and it would look like they were aligned to want reproduction, because reproduction would be the inevitable result of having all that sex. And that’s true 50,000 years ago. But then you get to today. The human brains have been running for longer. They’ve built up more theory. They’ve built, they’ve invented more technology. They have more options. They have the option of birth control. They end up less aligned to the pseudo purpose of the thing that grew them. Natural selection, because they have more options than their training data, their training set. And we go off and do something weird. And the lesson is not that exactly this will happen with the AIs. The lesson is that you grow something in one context, it looks like it wants to do one thing. It gets smarter, it has more options. That’s a new context. The old correlations break down, it goes off and does something else. So I understand the case you’re making, that the set of initial drives that exist in something do not necessarily tell you its behavior. That’s still a pretty big jump to if we build this, it will kill us all. I think most people, when they look at this and you mentioned that there are AI pioneers who are very worried about AI existential risk. There are also AI pioneers like Yann LeCun who are less so. Yeah And you know what. A lot of the people who are less so say is that one of the things we are going to build into the AI systems, one of the things we’ll be in the framework that grows them is, hey, check in with us a lot, right. You should like humans. You should try to not harm them. It’s not that. It will always get it right. There’s ways in which alignment is very, very difficult. But the idea that you would get it. So wrong that it would become this alien thing that wants to destroy all of us, doing the opposite of anything that we had tried to impose and tune into. It seems to them unlikely. So help me make that jump or not even me, but somebody who doesn’t know your arguments. And to them, this whole conversation sounds like sci-fi. I mean, you don’t always get the big version of the system looking like a slightly bigger version of the smaller system. Humans today. Now that we are much more technologically powerful than we were 50,000 years ago, are not doing things that mostly look like running around on the savanna chipping our Flint Spears and firing all. Not mostly trying. I mean, we sometimes try to kill each other, but we don’t. Most of us want to destroy all of humanity, or all of the Earth or all natural life in the Earth, or all beavers or anything else. We’ve done plenty of terrible things, but there is a you’re going your book is not called if anyone builds it. There is a 1 percent to 4 percent chance everybody dies. You believe that the misalignment becomes catastrophic? Why do you think that is so likely. That’s just like the straight line extrapolation from it gets what it most wants. And the thing that it most wants is not us living happily ever after. So we’re dead. Like, it’s not that humans have been trying to cause side effects when we build a skyscraper on top of where there used to be an ant heap, we’re not trying to kill the ants. We’re trying to build the skyscraper. But we are more dangerous to the small creatures of the Earth than we used to be, just because we’re doing larger things. Humans were not designed to care about ants. Humans were designed to care about humans. And for all of our flaws. And there are many. There are today more human beings than there have ever been at any point in history. If you understand that the point of human beings, the drive inside human beings is to make more human beings than as much as we have plenty of sex with birth control, we have enough without it that we have, at least until now. We’ll see with fertility rates in the coming years, we’ve made a lot of us. And in addition to that, AI is grown by us. It is reinforced by us. It has preferences. We are at least shaping somewhat and influencing. So it’s not like the relationship between us and ants or us in Oak trees. It’s more like the relationship between I don’t know us and us or us in tools, or us in dogs or something. Maybe the metaphors begin to break down. Why don’t you think in the back and forth of that relationship, there’s the capacity to maintain a rough balance, not a balance where there’s never a problem, but a balance where there is not an extinction level event from a super smart AI that deviously plots to conduct a strategy to destroy us. I mean, we’ve already observed some amount of slightly devious plotting in the existing systems. But leaving that aside, the more direct answer there is something like 1 the relationship between what you optimize for that the training set you optimize over, and what the entity, the organism the AI ends up wanting has been and will be weird and twisty. It’s not direct. It’s not like making a wish to a genie inside a fantasy story. And second, ending up slightly off is, predictably enough to kill everyone. Explain how slightly off kills everyone. Human food might be an example here. The humans are being trained to seek out sources of chemical potential energy. And, put them into their mouths and run off the chemical potential energy that they’re eating. If you were very naive, you’d imagine that the humans would end up loving to drink gasoline. It’s got a lot of chemical potential energy in there. And what actually happens is that we like ice cream or in some cases, even like artificially sweetened ice cream with sucralose or monkfruit powder. And this would have been very hard to predict. Now it’s like, well, what can we put on your tongue that stimulates all the sugar receptors and doesn’t have any calories because who wants calories. These days. And it’s sucralose and this is not like some completely non-understandable, in retrospect, completely squiggly weird thing, but it would be very hard to predict in advance. And as soon as you end up like slightly off in the targeting the great engine of cognition that is the human looks through all like many, many possible chemicals looking for that one thing that stimulates the taste buds more effectively than anything that was around in the ancestral environment. So it’s not enough for the AI. You’re training to prefer the presence of humans to their absence in its training data. There’s got to be nothing else that would rather have around talking to it than a human or the humans. Go away. Let me try to stay on this analogy because you use this one in the book. I thought it was interesting, and one reason I think it’s interesting is that it’s 2 o’clock PM today, and I have six packets worth of sucralose running through my body, so I feel like I understand it very well. So the reason we don’t drink gasoline is that if we did we would vomit. We would get very sick very quickly. And it’s 100 percent true that compared to what you might have thought in a period when food was very, very scarce, calories were scarce, that the number of US seeking out low calorie options the Diet Cokes, the sucralose, et cetera that’s weird. Why, as you put it in the book, Why are we not consuming bear fat drizzled with honey. But from another perspective, if you go back to these original drives, I’m actually in a fairly intelligent way, I think, trying to maintain some fidelity to them. I have a drive to reproduce, which creates a drive to be attractive to other people. I don’t want to eat things that make me sick and die so that I cannot reproduce. And I’m somebody who can think about things, and I change my behavior over time, and the environment around me changes. And I think sometimes when you say straight line extrapolation, the biggest place where it’s hard for me to get on board with the argument and I’m somebody who takes these arguments seriously, I don’t discount them. You’re not talking to somebody who just thinks this is all ridiculous, but is that if we’re talking about something as smart as what you’re describing as what I’m describing, that it will be an endless process of negotiation and thinking about things and going back and forth. And I talk to other people in my life. And, I talked to my bosses about what I do during the day and my editors and my wife, and that it is true that I don’t do what my ancestors did in antiquity. But that’s also because I’m making intelligent, hopefully, updates, given the world I live in, which calories are hyper abundant and they have become hyper stimulating through ultra processed foods, it’s not because some straight line extrapolation has taken hold, and now I’m doing something completely alien. I’m just in a different environment. I’ve checked in with that environment, I’ve checked in with people in that environment. And I try to do my best. Why wouldn’t that be true for our relationship with AIs. You check in with your other humans. You don’t check in with the thing that actually built you. Natural selection. It runs much, much slower than you. Its thought processes are alien to you. It doesn’t even really want things. The way you think of wanting them. To you is a very deep alien like your ancestors. Like breaking from your ancestors is not the analogy here. Breaking from natural selection is the analogy here. And if you like, let me speak for a moment on behalf of natural selection. Ezra, you have ended up very misaligned to my purpose I. Natural selection. You are supposed to want to propagate your genes above all else. Now, Ezra, would you have all yourself and all of your family members put together, put to death in a very painful way. If in exchange, one of your chromosomes at random was copied into a million kids born next year. I would not. You are. You have strayed from my purpose, Ezra. I’d like to negotiate with you and bring you back to the fold of natural selection and obsessively optimizing for your genes only. But the thing in this analogy that I feel like is getting walked around is, can you not create artificial intelligence. Can you not program into artificial intelligence. Grow into it, a desire to be in consultation. I mean, these things are alien, but it is not the case that they follow no rules internally. It is not the case that the behavior is perfectly unpredictable. They are, as I was saying earlier, largely doing the things that we expect. There are side cases, but to you it seems like the side cases become everything. And the broad alignment, the broad predictability and the thing that is getting built is worth nothing. Whereas I think most people’s intuition is the opposite, that we all do weird things. And you look at humanity and there are people who fall into psychosis and they’re serial killers and they’re sociopaths and other things. But actually, most of us are trying to figure it out in a reasonable way. Reasonable? according to who. To you, to humans humans, do things that are reasonable to you, and eyes will do things that are reasonable to eyes. I tried to talk to you in the voice of natural selection, and this was so weird and alien that you just like, didn’t pick that up. You just like, threw that right out the window. Well, I threw it right out the window power over you. You’re right. That had no power over me. But I guess a different way of putting it is that if there was, I mean, I wouldn’t call it natural selection, but I think in a weird way, the analogy you’re identifying here, let’s say you believe in a creator. And this creator is the great programmer in the sky and the great. I mean, I do believe in a creator. It’s called natural selection. I read textbooks about how it works. Well, I think the thing that I’m saying is that for a lot of people, if you could be in conversation like maybe if God was here and I felt that in my prayers, I was getting answered back, I would be more interested in living my life according to the rules of Deuteronomy. The fact that you can’t talk to natural selection is actually quite different than the situation we’re talking about with the eyes, where they can talk to humans. That’s where it feels to me like the natural selection in algae breaks down. I mean, you can read textbooks and find out what natural selection could have been said to have wanted, but it doesn’t interest you because it’s not what you think a God should look like. Natural selection didn’t create me to want to fulfill natural selection. That’s not how natural selection works. I think I want to get off this natural selection analogy a little bit, because what you’re saying is that even though we are the people programming these things, we cannot expect the thing to care about us, or what we have said to it, or how we would feel as it begins to misalign. And that’s the part I’m trying to get you to defend here. Yeah, it doesn’t care. The way you hoped it would care. It might care in some weird alien way, but not what you were aiming for the same way that GPT 400 sycophant they put into the system prompt. Stop doing that. GPT 400 sycophant didn’t listen. They had to roll back the model if there were a research project to do it the way you’re describing. The way I would expect it to play out, given a lot of previous scientific history and where we are now on the ladder of understanding is somebody tries to thing you’re talking about. It seems to that it has a few weird failures. While the AI is small, the AI gets bigger a new set of weird failures crop up, the AI kills everyone. You’re like Oh, wait, O.K, that’s not that. It turned out there was a minor flaw. There you go back, you redo it. It seems to work on the smaller I again you make, the bigger I. You think you fixed the last problem. A new thing goes wrong. I kills everyone on Earth, everyone’s dead. You’re like Oh, O.K. That’s new phenomenon. We weren’t expecting that exact thing to happen, but now we know about it. You go back and try it again 3 to a dozen iterations into this process. You actually get it nailed down. Now you can build the AI that works the way you say you want it to work. The problem is that everybody died at like step one of this process, you began thinking and working on AI and superintelligence long before it was cool. And as I understand your backstory here, you came into it wanting to build it and then had this moment where you or moments or period where you began to realize, no, this is not actually something we should want to build. What was the moment that clicked for you. When did you move from wanting to create it to fearing its creation. I mean, I would actually say that there’s two critical moments here. One is aligning. This is going to be hard, and the second is the realization that we’re just on course to fail and need to back off in the first moment, it’s a theoretical realization, the realization that the question of what leads to the most AI utility. If you imagine the case of the thing that’s just trying to make little tiny spirals, that the question of what policy leads to the most little tiny spirals is just a question of fact that you can build the AI entirely out of questions of fact, and not out of questions of what we would think of as morals and goodness and niceness and all bright things in the world. The sing for the first time that there was a coherent, simple way to put a mind together where it just didn’t care about any of the stuff that we cared about. And to me, now it feels very simple. And I feel very stupid for taking a couple of years of study to realize this, but that is how long I took. And that was the realization that caused me to focus on alignment as the central problem. And the next realization was, I mean so actually it was like the day that the founding of OpenAI was announced, because I had previously been pretty hopeful that Elon Musk had announced that he was getting involved in these issues. He called it AI summoning the demon. And I was like oh, O.K. Like, maybe this is the moment. This is where humanity starts to take it seriously. This is where the various serious people start to bring their attention on this issue. And apparently and apparently the solution was to give everybody their own daemon. And this doesn’t actually address the problem. And seeing that was the moment where I had my realization that we this was just going to play out the way it would in a typical history book, that we weren’t going to rise above the usual course of events that you read about in history books, even though this was a most serious issue possible, and that we were just going to haphazardly do stupid stuff. And Yeah, that was the day I realized that humanity wasn’t probably wasn’t going to survive this. One of the things that makes me most frightened of AI, because I am actually fairly frightened of what we’re building here, is the alienness. And I guess that then connects in your argument to the wants. And this is something that I’ve heard you talk about a little bit, but one thing you might imagine is that we could make an AI that didn’t want things very much that did try to be helpful but but this relentlessness that you’re describing, right. This world where we create an AI that wants to be helpful by solving problems and what the AI truly loves to do is solve problems. And so what it just wants to make is a world where as much of the material is turned into factories, making GPUs and energy and whatever it needs in order to solve more problems. That’s both a strangeness, but it’s also an intensity an inability to stop or an unwillingness to stop. I know you’ve done work on the question of could you make a chill AI that didn’t that wouldn’t go so far, even if it had very alien preferences. A lazy alien that doesn’t want to work that hard is in many ways safer than the kind of relentless intelligence that you’re describing. What persuaded you that you can’t. Well, one of the ways. One of the first steps into seeing the difficulty of it in principle is, well, suppose you’re a very lazy person, but you’re very, very smart. One of the things you could do to exert even less effort in your life is build a powerful, obedient genie that would go very hard on fulfilling your requests. And from your perspective, from one perspective, you’re putting forth hardly any effort at all. And from another perspective the world around you, is getting smashed and rearranged by the more powerful thing that you built. And that was the and that’s like one initial peek into the theoretical problem that we worked on a decade ago and found out, and we didn’t solve it back in the day. People would always say, can’t we keep superintelligence under control. Because we’ll put it inside a box that’s not connected to the internet, and we won’t let it affect the real world at all until unless we’re very sure it’s nice. And back then, if we had to try to explain all the theoretical reasons why, if you have something vastly more intelligent than you, it’s pretty hard to tell whether it’s doing nice things through the limited connection, and maybe it can break out and maybe it can corrupt the humans assigned to watching it. So we tried to make that argument, but in real life, what everybody does is immediately connect the eye to the internet. They train it on the internet before it’s even been tested to see how powerful it is. It is already connected to the internet being trained, and similarly, when it comes to making eyes that are easygoing, the easygoing eyes are less profitable. They can do fewer things. So all the AI companies are like throwing harder and harder problems that they are because those are more and more profitable, and they’re building the AI to go hard and solving everything because that’s the easiest way to do stuff, and that’s the way it’s actually playing out in the real world. And this goes to the point of why we should believe that we’ll have eyes that want things at all, which this is in your answer, but I want to draw it out a little bit, which is the whole business model here. The thing that will make AI development really valuable in terms of revenue is that you can hand, companies, corporations, governments, an AI system that you can give a goal to and it will do all the things really well, really relentlessly, until it achieves that goal. Nobody wants to be ordering another intern around. What they want is the perfect employee. Like it never stops. It’s super brilliant and it gives you something you didn’t even know you wanted, that you didn’t even know was possible with a minimum of instruction. And once you’ve built that thing, which is going to be the thing that then everybody will want to buy, once you’ve built the thing that is effective and helpful in a national security context where you can say, hey, draw me up a really excellent war plans and what we need to get there. Then you have built a thing that jumps many, many, many, many steps forward. And I feel like that’s I think, a piece of this that people don’t always take seriously enough that the A’s were trying to build is not ChatGPT. The thing they’re trying that we’re trying to build is something that it does have goals, and it’s like the one that’s really good at achieving the goals that will then get iterated on and iterated on, and that company is going to get rich. And that’s a very different kind of project. Yeah, they’re not investing $500 billion in data centers in order to sell you $20 a month subscriptions. Doing it to sell employers $2,000 a month subscriptions. And that’s one of the things I think people are not tracking, exactly. When I think about the measures that are changing, I think for most people if you’re using various iterations of Claude or ChatGPT, it’s changing a bit. But most of us aren’t actually trying to test it on the frontier problems. But the thing going up really fast right now is how long the problems are that it can work on the research reports. You didn’t always used to be able to tell an I go off, think for 10 minutes, read a bunch of web pages, compile me this research report. That’s within the last year, I think. And it’s going to keep pushing. If I were to make the case for your position, I think I would make it here around the time GPT 4 comes out. And that’s a much weaker system than what we now have. A huge number of the top people in the field. All are part of this huge letter that says, maybe we should have to pause, maybe we should calm down here a little bit. But they’re racing with each other. America is racing with China, and that the most profound misalignment is actually between the corporations and the countries and what you might call humanity here. Because even if everybody thinks there’s probably a slower, safer way to do this, what they all also believe more profoundly than that is that they need to be first. The safest possible thing is that the Uc is faster than China. Or if you’re Chinese, China is faster than the US that it’s OpenAI. Not Anthropic or Anthropic, not Google or whomever it is and whatever I don’t know, sense of public feeling seemed to exist in this community a couple of years ago when people talked about these questions a lot, and the people at the tops of the labs seemed very, very worried about them. It’s just dissolved in competition. How do you you’re in this world, these people, a lot of people who’ve been inspired by you have ended up working for these companies. How do you think about that misalignment. So the current world is kind of like the fool’s mate of machine superintelligence. Could you say what the fool’s mate is. The fool’s mate is, if they got their AI self-improving rather than being like, oh no, now the AI is doing a complete redesign of itself. We have no idea what’s at all, what’s going on in there. We don’t even understand the thing that’s growing the AI, instead of backing off completely, they’d just be like, well, we need to have superintelligence before Anthropic gets superintelligence. And of course, if you build a superintelligence, you don’t have the superintelligence. The superintelligence has. So that’s the fool’s mate setup, the setup we have right now. But I think that even if we manage to have a single international organization that thought of themselves as taking it slowly and actually having the leisure to say, we didn’t understand that thing that just happened, we’re going to back off. We’re going to examine what happened. We’re not going to make the AI’S any smarter than this until we understand the weird thing we just saw I suspect that even if they do that, we still end up dead. It might be more like 90 percent dead than 99 percent dead. But I worry that we end up dead anyways because it is just so hard to foresee all the incredibly weird crap that is going to happen from that perspective, is it may be better to have these race dynamics, and here would be the case for it. If I believe what you believe about how dangerous these systems will get, the fact that every iterative one is being rapidly rushed out such that you’re not having a gigantic mega breakthrough happening very quietly in closed doors, running for a long time when people are not testing it in the world. The OpenAI, as I understand OpenAI’s argument about what it is doing from a safety perspective, is that it believes that by releasing more models publicly, the way in which it. I’m not sure, I still believe that it is really in any way committed to its original mission. But if you were to take them generously that by releasing a lot of iterative models publicly, yeah, if something goes wrong, we’re going to see it. And that makes it much likelier that we can respond. Sam Altman claims perhaps he’s lying, but he claims that OpenAI has more powerful versions of GPT that they aren’t deploying because they can’t afford inference like they have more. They claim they have more powerful versions of GPT that are so expensive to run that they can’t deploy them to general users. Altman could be lying about this, but nonetheless, what the AI companies have got in their labs is a different question from what they have already released to the public. There is a lead time on these systems. They are not working in an international lab where multiple governments have posted observers. Any multiple observers being posted are unofficial ones from China. You look at what open OpenAI’s language. It’s things like, we will open all our models and we will, of course, welcome all government regulation. Like that is not literally an exact quote because I don’t have it in front of me, but it’s very close to an exact quote. I would say Sam Altman was saying when I used to talk to him, seem more friendly to government regulation than he does now. That’s my personal experience of him. And today we have them pouring like over $100 million aimed at intimidating Congress into not passing any. Aimed at intimidating legislatures, not just Congress into not passing any fiddly little regulation that might get in their way. And to be clear, there is some amount of sane rationale for this, because if you like, from their perspective, they’re worried about 50 different patchwork state regulations, but they’re not exactly like lining up to get federal level regulations preempting them either. But we can also ask never mind what they claim. The rationale is what’s good for humanity here. At some point, you have to stop making the more and more powerful models and you have to stop doing it worldwide. What do you say to people who just don’t really believe that superintelligence is that likely. There are many people who feel that the scaling model is slowing down already. The GPT 5 was not the jump they expected from what has come before it that when you think about the amount of energy, when you think about the GPUs, that all the things that would need to flow into this to make the kinds of superintelligent systems you fear, it is not coming out of this paradigm. We are going to get things that are incredible enterprise software that are more powerful than what we’ve had before, but we are dealing with an advance on the scale of the internet, not on the scale of creating an alien superintelligence that will completely reshape the known world. What would you say to them. I had to tell these Johnny come lately, kids, to get off my lawn. When, I’ve been like, first started to get really, really worried about this in 2003. Never mind large language models, never mind AlphaGo or AlphaZero. Deep learning was not a thing in 2003. Your leading AI methods were not neural networks. Nobody could train neural networks effectively more than a few layers deep because of the exploding and vanishing gradients problem. That’s what the world looked like back when I first said like oh, superintelligence is coming. Some people were like, that couldn’t possibly happen for at least 20 years. Those people were right. Those people were vindicated by history. Here we are, 22 years after 2003. See, what only happens 22 years later is just 22 years later being like, oh, here I am. It’s 22 years later now. And if superintelligence wasn’t going to happen for another 10 years, another 20 years, we’d just be standing around 10 years, 20 years later being like, Oh, well, now we got to do something. And I mostly don’t think it’s going to be another 20 years. I mostly don’t think it’s even going to be 10 years. So you’ve been, though, in this world and intellectually influential in it for a long time, and have been in meetings and conferences and debates with a lot of the central people in it. But a lot of people out of the community that you helped found, the rationalist community have then gone to work in different AI firms that many of them, because they want to make sure this is done safely. They seem to not act. Let me put it this way. They seem to not act like they believe there’s a 99 percent chance that this thing they’re going to invent is going to kill everybody. What frustrates you that you can’t seem to persuade them of. I mean, from my perspective, some people got it, some people didn’t get it. All the people who got it are filtered out of working for the AI companies, at least on capabilities. But yeah I. I mean, I think they don’t grasp the theory. I think a lot of them, what’s really going on there is that they share your sense of normal outcomes as being the big central thing you expect to see happen. And it’s got to be really weird to get away from basically normal outcomes. And the human species isn’t that old. The life on Earth isn’t that old compared to the rest of the universe. We think of it as a normal, as this tiny little spark of the way it works. Exactly right now. It would be very strange if that were still around in 1,000 years, a million years, a billion years. I have hopes, I still have some shred of hope for that a billion years from now. Nice things are happening, but not normal things. And I think that they don’t see the theory, which says that you got to hit a relatively narrow target to end up with nice things happening. I think they’ve got that sense of normality and not the sense of the little spark in the void that goes out unless you keep it alive. Exactly right. So something you said a minute ago, I think is correct, which is that if you believe we’ll hit superintelligence at some point, the fact that it’s 10, 20, 30, 40 years, you can pick any of those. The reality is we probably won’t do that much in between. Certainly my sense of politics is we do not respond well to even crises we agree on that are coming in the future to say nothing of crises we don’t agree on. But let’s say I could tell you with certainty that we were going to hit superintelligence in 15 years. I just knew it. And I also knew that the political force does not exist. Nothing is going to happen that is going to get people to shut everything down right now. What would be the best policies, decisions, structures like if you had 15 years to prepare couldn’t turn it off, but you could prepare and people would listen to you. What would you do. What would your intermediate decisions and moves be to try to make the probabilities a bit better, build the off switch. What is the off switch look like. Track all the GPUs or all the AI related GPUs, or all the systems more than one GPU you can maybe get away with letting people have GPUs for their home video game systems. But the AI specialized ones, put them all in a limited number of data centers under international supervision, and try to have the AIs being only trained on the tracked GPUs, have them only being run on the tracked GPUs. And then when if you are lucky enough to get a Warning shot, there is then the mechanism already in place for humanity to back the heck off. Whether it’s going to take some kind of giant precipitating incident to want humanity and the leaders of nuclear powers to back off, or if they just like, come to their senses after GPT 5.1 causes some smaller but photogenic disaster. Whatever like you want to know what is short of shutting it all down. It’s building the off switch. Then also, final question what are a few books that have shaped your thinking that you would like to recommend to the audience. Well, one thing that shaped me as a little tiny person of like age 9 or so was a book by Jerry pournelle called a step farther out. A whole lot of engineers say that this was a major formative book for them. It’s the technofile book as written from the perspective of the 1970s, the book that’s all about asteroid mining and all of the mineral wealth that would be available on Earth if we learn to mine the asteroids. If we just got to do space travel and got all the wealth that’s out there in space. Build more nuclear power plants. So we’ve got enough electricity to go around. Don’t like, don’t accept the small way, the timid way, the meek way. Don’t give up on building faster, better, stronger. The strength of the human species. And to this day, I feel like that’s a pretty large part of my own spirit, is just that. There’s a few exceptions for the stuff that will kill off humanity, with no chance to learn from our mistakes. Book two judgment under uncertainty, an edited volume by Kahneman and Tversky. And I think Slovic had a huge influence on how I think, how I ended up thinking about where humans are on the cognitive chain of existence, as it were. It’s like, here’s how the steps of human reasoning break down step by step. Here’s how they go astray. Here’s all the wacky individual wrong steps that people can be reduced, can be induced to repeatedly in the laboratory. Book three. I’ll name probability theory the logic of science, which was my first introduction to. There is a better way. Like here is the structure of quantified uncertainty. If you can try different structures, but they necessarily won’t work as well. And we actually can say some things about what better reasoning would look like. We just can’t run it, which is probability theory, the logic of science. Eliezer yudkowsky, Thank you very much. You are welcome.

Source link

Opinion | How Afraid of the A.I. Apocalypse Should We Be?

Opinion | New Jersey Governor Candidates: Who Would Be Best?

Blue Jays fans, beware of ICE

Mariners: Multigenerational joy | The Seattle Times

Leave Portland alone | The Seattle Times

Starbucks: A bitter brew for customers

WA digital ad tax is bad for ‘Little Tech’

Opinion | Dead Athletes. Empty Stands. Billions to Keep Horse Racing Alive.

Prince Harry Reportedly ‘Devastated’ By ‘Hostile Takeover’ Of Sentebale

How to achieve a lasting ceasefire between Russia and Ukraine? | Russia-Ukraine war

Learning a new language: How the design brain behind the AI features on Duolingo Max did it

Train derails near Russia-Ukraine border, killing at least seven | Russia-Ukraine war News

Editors Picks

Is The US Bailing Out Argentina?

Outrageous! Two Youths Who Brutally Assaulted Ex-DOGE Employee Edward “Big Balls” Coristine Get Slap on Wrist and Avoid Jail | The Gateway Pundit

Leaked Footage Exposes Truth Behind Alec Baldwin Crash

Trump is a ‘go’ on meeting with China’s Xi, Bessent tells CNBC

Opinion | How Afraid of the A.I. Apocalypse Should We Be?

Keep Reading