Shah and Yudkowsky on alignment failures #AI - The Entrepreneurial Way with A.I.

Breaking

Thursday, March 10, 2022

Shah and Yudkowsky on alignment failures #AI

#A.I.

 

This is the final discussion log in the Late 2021 MIRI Conversations sequence, featuring Rohin Shah and Eliezer Yudkowsky, with additional comments from Rob Bensinger, Nate Soares, Richard Ngo, and Jaan Tallinn.

The discussion begins with summaries and comments on Richard and Eliezer’s debate. Rohin’s summary has since been revised and published in the Alignment Newsletter.

After this log, we’ll be concluding this sequence with an AMA, where we invite you to comment with questions about AI alignment, cognition, forecasting, etc. Eliezer, Richard, Paul Christiano, Nate, and Rohin will all be participating.

 

Color key:

 Chat by Rohin and Eliezer   Other chat   Emails   Follow-ups 

 

19. Follow-ups to the Ngo/Yudkowsky conversation

 

19.1. Quotes from the public discussion

 

[Bensinger][9:22]

Interesting extracts from the public discussion of Ngo and Yudkowsky on AI capability gains:

Eliezer:

I think some of your confusion may be that you’re putting “probability theory” and “Newtonian gravity” into the same bucket.  You’ve been raised to believe that powerful theories ought to meet certain standards, like successful bold advance experimental predictions, such as Newtonian gravity made about the existence of Neptune (quite a while after the theory was first put forth, though).  “Probability theory” also sounds like a powerful theory, and the people around you believe it, so you think you ought to be able to produce a powerful advance prediction it made; but it is for some reason hard to come up with an example like the discovery of Neptune, so you cast about a bit and think of the central limit theorem.  That theorem is widely used and praised, so it’s “powerful”, and it wasn’t invented before probability theory, so it’s “advance”, right?  So we can go on putting probability theory in the same bucket as Newtonian gravity?

They’re actually just very different kinds of ideas, ontologically speaking, and the standards to which we hold them are properly different ones.  It seems like the sort of thing that would take a subsequence I don’t have time to write, expanding beyond the underlying obvious ontological difference between validities and empirical-truths, to cover the way in which “How do we trust this, when” differs between “I have the following new empirical theory about the underlying model of gravity” and “I think that the logical notion of ‘arithmetic’ is a good tool to use to organize our current understanding of this little-observed phenomenon, and it appears within making the following empirical predictions…”  But at least step one could be saying, “Wait, do these two kinds of ideas actually go into the same bucket at all?”

In particular it seems to me that you want properly to be asking “How do we know this empirical thing ends up looking like it’s close to the abstraction?” and not “Can you show me that this abstraction is a very powerful one?”  Like, imagine that instead of asking Newton about planetary movements and how we know that the particular bits of calculus he used were empirically true about the planets in particular, you instead started asking Newton for proof that calculus is a very powerful piece of mathematics worthy to predict the planets themselves – but in a way where you wanted to see some highly valuable material object that calculus had produced, like earlier praiseworthy achievements in alchemy.  I think this would reflect confusion and a wrongly directed inquiry; you would have lost sight of the particular reasoning steps that made ontological sense, in the course of trying to figure out whether calculus was praiseworthy under the standards of praiseworthiness that you’d been previously raised to believe in as universal standards about all ideas.

Richard:

I agree that “powerful” is probably not the best term here, so I’ll stop using it going forward (note, though, that I didn’t use it in my previous comment, which I endorse more than my claims in the original debate).

But before I ask “How do we know this empirical thing ends up looking like it’s close to the abstraction?”, I need to ask “Does the abstraction even make sense?” Because you have the abstraction in your head, and I don’t, and so whenever you tell me that X is a (non-advance) prediction of your theory of consequentialism, I end up in a pretty similar epistemic state as if George Soros tells me that X is a prediction of the theory of reflexivity, or if a complexity theorist tells me that X is a prediction of the theory of self-organisation. The problem in those two cases is less that the abstraction is a bad fit for this specific domain, and more that the abstraction is not sufficiently well-defined (outside very special cases) to even be the type of thing that can robustly make predictions.

Perhaps another way of saying it is that they’re not crisp/robust/coherent concepts (although I’m open to other terms, I don’t think these ones are particularly good). And it would be useful for me to have evidence that the abstraction of consequentialism you’re using is a crisper concept than Soros’ theory of reflexivity or the theory of self-organisation. If you could explain the full abstraction to me, that’d be the most reliable way – but given the difficulties of doing so, my backup plan was to ask for impressive advance predictions, which are the type of evidence that I don’t think Soros could come up with.

I also think that, when you talk about me being raised to hold certain standards of praiseworthiness, you’re still ascribing too much modesty epistemology to me. I mainly care about novel predictions or applications insofar as they help me distinguish crisp abstractions from evocative metaphors. To me it’s the same type of rationality technique as asking people to make bets, to help distinguish post-hoc confabulations from actual predictions.

Of course there’s a social component to both, but that’s not what I’m primarily interested in. And of course there’s a strand of naive science-worship which thinks you have to follow the Rules in order to get anywhere, but I’d thank you to assume I’m at least making a more interesting error than that.

Lastly, on probability theory and Newtonian mechanics: I agree that you shouldn’t question how much sense it makes to use calculus in the way that you described, but that’s because the application of calculus to mechanics is so clearly-defined that it’d be very hard for the type of confusion I talked about above to sneak in. I’d put evolutionary theory halfway between them: it’s partly a novel abstraction, and partly a novel empirical truth. And in this case I do think you have to be very careful in applying the core abstraction of evolution to things like cultural evolution, because it’s easy to do so in a confused way.

 

19.2. Rohin Shah’s summary and thoughts

 

[Shah][7:06]  (Nov. 6 email)

Newsletter summaries attached, would appreciate it if Eliezer and Richard checked that I wasn’t misrepresenting them. (Conversation is a lot harder to accurately summarize than blog posts or papers.)

 

Best,

Rohin

 

Planned summary for the Alignment Newsletter:

 

Eliezer is known for being pessimistic about our chances of averting AI catastrophe. His main argument is roughly as follows:

[Yudkowsky][9:56]  (Nov. 6 email reply)

[…] Eliezer is known for being pessimistic about our chances of averting AI catastrophe. His main argument

I request that people stop describing things as my “main argument” unless I’ve described them that way myself.  These are answers that I customized for Richard Ngo’s questions.  Different questions would get differently emphasized replies.  “His argument in the dialogue with Richard Ngo” would be fine.

[Shah][1:53]  (Nov. 8 email reply)

I request that people stop describing things as my “main argument” unless I’ve described them that way myself.

Fair enough. It still does seem pretty relevant to know the purpose of the argument, and I would like to state something along those lines in the summary. For example, perhaps it is:

  1. One of several relatively-independent lines of argument that suggest we’re doomed; cutting this argument would make almost no difference to the overall take
  2. Your main argument, but with weird Richard-specific emphases that you wouldn’t have necessarily included if making this argument more generally; if someone refuted the core of the argument to your satisfaction it would make a big difference to your overall take
  3. Not actually an argument you think much about at all, but somehow became the topic of discussion
  4. Something in between these options
  5. Something else entirely

If you can’t really say, then I guess I’ll just say “His argument in this particular dialogue”.

I’d also like to know what the main argument is (if there is a main argument rather than lots of independent lines of evidence or something else entirely); it helps me orient to the discussion, and I suspect would be useful for newsletter readers as well.

[Shah][7:06]  (Nov. 6 email)

1. We are very likely going to keep improving AI capabilities until we reach AGI, at which point either the world is destroyed, or we use the AI system to take some pivotal act before some careless actor destroys the world.

2. In either case, the AI system must be producing high-impact, world-rewriting plans; such plans are “consequentialist” in that the simplest way to get them (and thus, the one we will first build) is if you are forecasting what might happen, thinking about the expected consequences, considering possible obstacles, searching for routes around the obstacles, etc. If you don’t do this sort of reasoning, your plan goes off the rails very quickly; it is highly unlikely to lead to high impact. In particular, long lists of shallow heuristics (as with current deep learning systems) are unlikely to be enough to produce high-impact plans.

3. We’re producing AI systems by selecting for systems that can do impressive stuff, which will eventually produce AI systems that can accomplish high-impact plans using a general underlying “consequentialist”-style reasoning process (because that’s the only way to keep doing more impressive stuff). However, this selection process does not constrain the goals towards which those plans are aimed. In addition, most goals seem to have convergent instrumental subgoals like survival and power-seeking that would lead to extinction. This suggests that, unless we find a way to constrain the goals towards which plans are aimed, we should expect an existential catastrophe.

4. None of the methods people have suggested for avoiding this outcome seem like they actually avert this story.

[Yudkowsky][9:56]  (Nov. 6 email reply)

[…] This suggests that, unless we find a way to constrain the goals towards which plans are aimed, we should expect an existential catastrophe.

I would not say we face catastrophe “unless we find a way to constrain the goals towards which plans are aimed”.  This is, first of all, not my ontology, second, I don’t go around randomly slicing away huge sections of the solution space.  Workable:  “This suggests that we should expect an existential catastrophe by default.” 

[Shah][1:53]  (Nov. 8 email reply)

I would not say we face catastrophe “unless we find a way to constrain the goals towards which plans are aimed”.

Should I also change “However, this selection process does not constrain the goals towards which those plans are aimed”, and if so what to? (Something along these lines seems crucial to the argument, but if this isn’t your native ontology, then presumably you have some other thing you’d say here.)

[Shah][7:06]  (Nov. 6 email)

Richard responds to this with a few distinct points:

1. It might be possible to build narrow AI systems that humans use to save the world, for example, by making AI systems that do better alignment research. Such AI systems do not seem to require the property of making long-term plans in the real world in point (3) above, and so could plausibly be safe. We might say that narrow AI systems could save the world but can’t destroy it, because humans will put plans into action for the former but not the latter.

2. It might be possible to build general AI systems that only state plans for achieving a goal of interest that we specify, without executing that plan.

3. It seems possible to create consequentialist systems with constraints upon their reasoning that lead to reduced risk.

4. It also seems possible to create systems that make effective plans, but towards ends that are not about outcomes in the real world, but instead are about properties of the plans — think for example of corrigibility (AN #35) or deference to a human user.

5. (Richard is also more bullish on coordinating not to use powerful and/or risky AI systems, though the debate did not discuss this much.)

 

Eliezer’s responses:

1. This is plausible, but seems unlikely; narrow not-very-consequentialist AI (aka “long lists of shallow heuristics”) will probably not scale to the point of doing alignment research better than humans.

[Yudkowsky][9:56]  (Nov. 6 email reply)

[…] This is plausible, but seems unlikely; narrow not-very-consequentialist AI (aka “long lists of shallow heuristics”) will probably not scale to the point of doing alignment research better than humans.

No, your summarized-Richard-1 is just not plausible.  “AI systems that do better alignment research” are dangerous in virtue of the lethally powerful work they are doing, not because of some particular narrow way of doing that work.  If you can do it by gradient descent then that means gradient descent got to the point of doing lethally dangerous work.  Asking for safely weak systems that do world-savingly strong tasks is almost everywhere a case of asking for nonwet water, and asking for AI that does alignment research is an extreme case in point.

[Shah][1:53]  (Nov. 8 email reply)

No, your summarized-Richard-1 is just not plausible. “AI systems that do better alignment research” are dangerous in virtue of the lethally powerful work they are doing, not because of some particular narrow way of doing that work.

How about “AI systems that help with alignment research to a sufficient degree that it actually makes a difference are almost certainly already dangerous.”?

(Fwiw, I used the word “plausible” because of this sentence from the doc: “Definitely, <description of summarized-Richard-1> is among the more plausible advance-specified miracles we could get.“, though I guess the point was that it is still a miracle, it just also is more likely than other miracles.)

[Ngo][9:59]  (Nov. 6 email reply)

Thanks Rohin! Your efforts are much appreciated.

Eliezer: when you say “No, your summarized-Richard-1 is just not plausible”, do you mean the argument is implausible, or it’s not a good summary of my position (which you also think is implausible)?

For my part the main thing I’d like to modify is the term “narrow AI”. In general I’m talking about all systems that are not of literally world-destroying intelligence+agency. E.g. including oracle AGIs which I wouldn’t call “narrow”.

More generally, I don’t think all AGIs are capable of destroying the world. E.g. humans are GIs. So it might be better to characterise Eliezer as talking about some level of general intelligence which leads to destruction, and me as talking about the things that can be done with systems that are less general or less agentic than that.

We might say that narrow AI systems could save the world but can’t destroy it, because humans will put plans into action for the former but not the latter.

I don’t endorse this, I think plenty of humans would be willing to use narrow AI systems to do things that could destroy the world.

systems that make effective plans, but towards ends that are not about outcomes in the real world, but instead are about properties of the plans

I’d change this to say “systems with the primary aim of producing plans with certain properties (that aren’t just about outcomes in the world)” 

[Yudkowsky][10:18]  (Nov. 6 email reply)

Eliezer: when you say “No, your summarized-Richard-1 is just not plausible”, do you mean the argument is implausible, or it’s not a good summary of my position (which you also think is implausible)?

I wouldn’t have presumed to state on your behalf whether it’s a good summary of your position!  I mean that the stated position is implausible, whether or not it was a good summary of your position.

[Shah][7:06]  (Nov. 6 email)

2. This might be an improvement, but not a big one. It is the plan itself that is risky; if the AI system made a plan for a goal that wasn’t the one we actually meant, and we don’t understand that plan, that plan can still cause extinction. It is the misaligned optimization that produced the plan that is dangerous, even if there was no “agent” that specifically wanted the goal that the plan was optimized for.

3 and 4. It is certainly possible to do such things; the space of minds that could be designed is very large. However, it is difficult to do such things, as they tend to make consequentialist reasoning weaker, and on our current trajectory the first AGI that we build will probably not look like that.

[Yudkowsky][9:56]  (Nov. 6 email reply)

2. This might be an improvement, but not a big one. It is the plan itself that is risky; if the AI system made a plan for a goal that wasn’t the one we actually meant, and we don’t understand that plan, that plan can still cause extinction. It is the misaligned optimization that produced the plan that is dangerous, even if there was no “agent” that specifically wanted the goal that the plan was optimized for.

No, it’s not a significant improvement if the “non-executed plans” from the system are meant to do things in human hands powerful enough to save the world.  They could of course be so weak as to make their human execution have no inhumanly big consequences, but this is just making the AI strategically isomorphic to a rock.  The notion of there being “no ‘agent’ that specifically wanted the goal” seems confused to me as well; this is not something I’d ever say as a restatement of one of my own opinions.  I’d shrug and tell someone to taboo the word ‘agent’ and would try to talk without using the word if they’d gotten hung up on that point.

[Shah][7:06]  (Nov. 6 email)

Planned opinion:

 

I first want to note my violent agreement with the notion that a major scary thing is “consequentialist reasoning”, and that high-impact plans require such reasoning, and that we will end up building AI systems that produce high-impact plans. Nonetheless, I am still optimistic about AI safety relative to Eliezer, which I suspect comes down to three main disagreements:

1. There are many approaches that don’t solve the problem, but do increase the level of intelligence required before the problem leads to extinction. Examples include Richard’s points 1-4 above. For example, if we build a system that states plans without executing them, then for the plans to cause extinction they need to be complicated enough that the humans executing those plans don’t realize that they are leading to an outcome that was not what they wanted. It seems non-trivially probable to me that such approaches are sufficient to prevent extinction up to the level of AI intelligence needed before we can execute a pivotal act.

2. The consequentialist reasoning is only scary to the extent that it is “aimed” at a bad goal. It seems non-trivially probable to me that it will be “aimed” at a goal sufficiently good to not lead to existential catastrophe, without putting in much alignment effort.
3. I do expect some coordination to not do the most risky things.

I wish the debate had focused more on the claim that narrow AI can’t e.g. do better alignment research, as it seems like a major crux. (For example, I think that sort of intuition drives my disagreement #1.) I expect AI progress looks a lot like “the heuristics get less and less shallow in a gradual / smooth / continuous manner” which eventually leads to the sorts of plans Eliezer calls “consequentialist”, whereas I think Eliezer expects a sharper qualitative change between “lots of heuristics” and that-which-implements-consequentialist-planning.

 

20. November 6 conversation

 

20.1. Concrete plans, and AI-mediated transparency

 

[Yudkowsky][13:22]

So I have a general thesis about a failure mode here which is that, the moment you try to sketch any concrete plan or events which correspond to the abstract descriptions, it is much more obviously wrong, and that is why the descriptions stay so abstract in the mouths of everybody who sounds more optimistic than I am.

This may, perhaps, be confounded by the phenomenon where I am one of the last living descendants of the lineage that ever knew how to say anything concrete at all.  Richard Feynman – or so I would now say in retrospect – is noticing concreteness dying out of the world, and being worried about that, at the point where he goes to a college and hears a professor talking about “essential objects” in class, and Feynman asks “Is a brick an essential object?” – meaning to work up to the notion of the inside of a brick, which can’t be observed because breaking a brick in half just gives you two new exterior surfaces – and everybody in the classroom has a different notion of what it would mean for a brick to be an essential object. 

Richard Feynman knew to try plugging in bricks as a special case, but the people in the classroom didn’t, and I think the mental motion has died out of the world even further since Feynman wrote about it.  The loss has spread to STEM as well.  Though if you don’t read old books and papers and contrast them to new books and papers, you wouldn’t see it, and maybe most of the people who’ll eventually read this will have no idea what I’m talking about because they’ve never seen it any other way…

I have a thesis about how optimism over AGI works.  It goes like this: People use really abstract descriptions and never imagine anything sufficiently concrete, and this lets the abstract properties waver around ambiguously and inconsistently to give the desired final conclusions of the argument.  So MIRI is the only voice that gives concrete examples and also by far the most pessimistic voice; if you go around fully specifying things, you can see that what gives you a good property in one place gives you a bad property someplace else, you see that you can’t get all the properties you want simultaneously.  Talk about a superintelligence building nanomachinery, talk concretely about megabytes of instructions going to small manipulators that repeat to lay trillions of atoms in place, and this shows you a lot of useful visible power paired with such unpleasantly visible properties as “no human could possibly check what all those instructions were supposed to do”.

Abstract descriptions, on the other hand, can waver as much as they need to between what’s desirable in one dimension and undesirable in another.  Talk about “an AGI that just helps humans instead of replacing them” and never say exactly what this AGI is supposed to do, and this can be so much more optimistic so long as it never becomes too unfortunately concrete.

When somebody asks you “how powerful is it?” you can momentarily imagine – without writing it down – that the AGI is helping people by giving them the full recipes for protein factories that build second-stage nanotech and the instructions to feed those factories, and reply, “Oh, super powerful! More than powerful enough to flip the gameboard!” Then when somebody asks how safe it is, you can momentarily imagine that it’s just giving a human mathematician a hint about proving a theorem, and say, “Oh, super duper safe, for sure, it’s just helping people!” 

Or maybe you don’t even go through the stage of momentarily imagining the nanotech and the hint, maybe you just navigate straight in the realm of abstractions from the impossibly vague wordage of “just help humans” to the reassuring and also extremely vague “help them lots, super powerful, very safe tho”.

[…] I wish the debate had focused more on the claim that narrow AI can’t e.g. do better alignment research, as it seems like a major crux. (For example, I think that sort of intuition drives my disagreement #1.) I expect AI progress looks a lot like “the heuristics get less and less shallow in a gradual / smooth / continuous manner” which eventually leads to the sorts of plans Eliezer calls “consequentialist”, whereas I think Eliezer expects a sharper qualitative change between “lots of heuristics” and that-which-implements-consequentialist-planning.

It is in this spirit that I now ask, “What the hell could it look like concretely for a safely narrow AI to help with alignment research?”

Or if you think that a left-handed wibble planner can totally make useful plans that are very safe because it’s all leftish and wibbly: can you please give an example of a plan to do what?

And what I expect is for minds to bounce off that problem as they first try to visualize “Well, a plan to give mathematicians hints for proving theorems… oh, Eliezer will just say that’s not useful enough to flip the gameboard… well, plans for building nanotech… Eliezer will just say that’s not safe… darn it, this whole concreteness thing is such a conversational no-win scenario, maybe there’s something abstract I can say instead”.

[Shah][16:41]

It’s reasonable to suspect failures to be concrete, but I don’t buy that hypothesis as applied to me; I think I have sufficient personal evidence against it, despite the fact that I usually speak abstractly. I don’t expect to convince you of this, nor do I particularly want to get into that sort of debate.

I’ll note that I have the exact same experience of not seeing much concreteness, both of other people and myself, about stories that lead to doom. To be clear, in what I take to be the Eliezer-story, the part where the misaligned AI designs a pathogen that wipes out all humans or solves nanotech and gains tons of power or some other pivotal act seems fine. The part that seems to lack concreteness is how we built the superintelligence and why the superintelligence was misaligned enough to lead to extinction. (Well, perhaps. I also wouldn’t be surprised if you gave a concrete example and I disagreed that it would lead to extinction.)

From my perspective, the simple concrete stories about the future are wrong and the complicated concrete stories about the future don’t sound plausible, whether about safety or about doom.

Nonetheless, here’s an attempt at some concrete stories. It is not the case that I think these would be convincing to you. I do expect you to say that it won’t be useful enough to flip the gameboard (or perhaps that if it could possibly flip the gameboard then it couldn’t be safe), but that seems to be because you think alignment will be way more difficult than I do (in expectation), and perhaps we should get into that instead.

  • Instead of having to handwrite code that does feature visualization or other methods of “naming neurons”, an AI assistant can automatically inspect a neural net’s weights, perform some experiments with them, and give them human-understandable “names”. What a “name” is depends on the system being analyzed, but you could imagine that sometimes it’s short memorable phrases (e.g. for the later layers of a language model), or pictures of central concepts (e.g. for image classifiers), or paragraphs describing the concept (e.g. for novel concepts discovered by a scientist AI). Given these names, it is much easier for humans to read off “circuits” from the neural net to understand how it works.
  • Like the above, except the AI assistant also reads out the circuits, and efficiently reimplements the neural network in, say, readable Python, that humans can then more easily mechanistically understand. (These two tasks could also be done by two different AI systems, instead of the same one; perhaps that would be easier / safer.)
  • We have AI assistants search for inputs on which the AI system being inspected would do something that humans would rate as bad. (We can choose any not-horribly-unnatural rating scheme we want that humans can understand, e.g. “don’t say something the user said not to talk about, even if it’s in their best interest” can be a tenet for finetuned GPT-N if we want.) We can either train on those inputs, or use them as a test for how well our other alignment schemes have worked.

(These are all basically leveraging the fact that we could have AI systems that are really knowledgeable in the realm of “connecting neural net activations to human concepts”, which seems plausible to do without being super general or consequentialist.)

There’s also lots of meta stuff, like helping us with literature reviews, speeding up paper- and blog-post-writing, etc, but I doubt this is getting at what you care about

[Yudkowsky][17:09]

If we thought that helping with literature review was enough to save the world from extinction, then we should be trying to spend at least $50M on helping with literature review right now today, and if we can’t effectively spend $50M on that, then we also can’t build the dataset required to train narrow AI to do literature review.  Indeed, any time somebody suggests doing something weak with AGI, my response is often “Oh how about we start on that right now using humans, then,” by which question its pointlessness is revealed.

[Shah][17:11]

I mean, doesn’t seem crazy to just spend $50M on effective PAs, but in any case I agree with you that this is not the main thing to be thinking about

[Yudkowsky][17:13]

The other cases of “using narrow AI to help with alignment” via pointing an AI, or rather a loss function, at a transparency problem, seem to seamlessly blend into all of the other clever-ideas we may have for getting more insight into the giant inscrutable matrices of floating-point numbers.  By this concreteness, it is revealed that we are not speaking of von-Neumann-plus-level AGIs who come over and firmly but gently set aside our paradigm of giant inscrutable matrices, and do something more alignable and transparent; rather, we are trying more tricks with loss functions to get human-language translations of the giant inscrutable matrices.

I have thought of various possibilities along these lines myself.  They’re on my list of things to try out when and if the EA community has the capacity to try out ML ideas in a format I could and would voluntarily access.

There’s a basic reason I expect the world to die despite my being able to generate infinite clever-ideas for ML transparency, which, at the usual rate of 5% of ideas working, could get us as many as three working ideas in the impossible event that the facilities were available to test 60 of my ideas.

[Shah][17:15]

By this concreteness, it is revealed that we are not speaking of von-Neumann-plus-level AGIs who come over and firmly but gently set aside our paradigm of giant inscrutable matrices, and do something more alignable and transparent; rather, we are trying more tricks with loss functions to get human-language translations of the giant inscrutable matrices.

Agreed, but I don’t see the point here

(Beyond “Rohin and Eliezer disagree on how impossible it is to align giant inscrutable matrices”)

(I might dispute “tricks with loss functions”, but that’s nitpicky, I think)

[Yudkowsky][17:16]

It’s that, if we get better transparency, we are then left looking at stronger evidence that our systems are planning to kill us, but this will not help us because we will not have anything we can do to make the system not plan to kill us.

[Shah][17:18]

The adversarial training case is one example where you are trying to change the system, and if you’d like I can generate more along these lines, but they aren’t going to be that different and are still going to come down to what I expect you will call “playing tricks with loss functions”

[Yudkowsky][17:18]

Well, part of the point is that “AIs helping us with alignment” is, from my perspective, a classic case of something that might ambiguate between the version that concretely corresponds to “they are very smart and can give us the Textbook From The Future that we can use to easily build a robust superintelligence” (which is powerful, pivotal, unsafe, and kills you) or “they can help us with literature review” (safe, weak, unpivotal) or “we’re going to try clever tricks with gradient descent and loss functions and labeled datasets to get alleged natural-language translations of some of the giant inscrutable matrices” (which was always the plan but which I expected to not be sufficient to avert ruin).

[Shah][17:19]

I’m definitely thinking of the last one, but I take your point that disambiguating between these is good

And I also think it’s revealing that this is not in fact the crux of disagreement

 

20.2. Concrete disaster scenarios, out-of-distribution problems, and corrigibility

 

[Yudkowsky][17:20]

I’ll note that I have the exact same experience of not seeing much concreteness, both of other people and myself, about stories that lead to doom.

I have a boundless supply of greater concrete detail for the asking, though if you ask large questions I may ask for a narrower question to avoid needing to supply 10,000 words of concrete detail.

[Shah][17:24]

I guess the main thing is to have an example of a story which includes a method for building a superintelligence (yes, I realize this is info-hazard-y, sorry, an abstract version might work) + how it becomes misaligned and what its plans become optimized for. Though as I type this out I realize that I’m likely going to disagree on the feasibility of the method for building a superintelligence?

[Yudkowsky][17:25]

I mean, I’m obviously not going to want to make any suggestions that I think could possibly work and which are not very very very obvious.

[Shah][17:25]

Yup, makes sense

[Yudkowsky][17:25]

But I don’t think that’s much of an issue.

I could just point to MuZero, say, and say, “Suppose something a lot like this scaled.”

Do I need to explain how you would die in this case?

[Shah][17:26]

What sort of domain and what training data?

Like, do we release a robot in the real world, have it collect data, build a world model, and run MuZero with a reward for making a number in a bank account go up?

[Yudkowsky][17:28]

Supposing they’re naive about it: playing all the videogames, predicting all the text and images, solving randomly generated computer puzzles, accomplishing sets of easily-labelable sensorymotor tasks using robots and webcams

[Shah][17:29]

Okay, so far I’m with you. Is there a separate deployment step, and if so, how did they finetune the agent for the deployment task? Or did it just take over the world halfway through training?

[Yudkowsky][17:29]

(though this starts to depart from the Mu Zero architecture if it has the ability to absorb knowledge via learning on more purely predictive problems)

[Shah][17:30]

(I’m okay with that, I think)

[Yudkowsky][17:32]

vaguely plausible rough scenario: there was a big ongoing debate about whether or not to try letting the system trade stocks, and while the debate was going on, the researchers kept figuring out ways to make Something Zero do more with less computing power, and then it started visibly talking at people and trying to manipulate them, and there was an enormous fuss, and what happens past this point depends on whether or not you want me to try to describe a scenario in which we die with an unrealistic amount of dignity, or a realistic scenario where we die much faster

I shall assume the former.

[Shah][17:32]

Actually I think I want concreteness earlier

[Yudkowsky][17:32]

Okay.  I await your further query.

[Shah][17:32]

it started visibly talking at people and trying to manipulate them

What caused this?

Was it manipulating people in order to make e.g. sensory stuff easier to predict?

[Yudkowsky][17:36]

Cumulative lifelong learning from playing videogames took its planning abilities over a threshold; cumulative solving of computer games and multimodal real-world tasks took its internal mechanisms for unifying knowledge and making them coherent over a threshold; and it gained sufficient compressive understanding of the data it had implicitly learned by reading through hundreds of terabytes of Common Crawl, not so much the semantic knowledge contained in those pages, but the associated implicit knowledge of the Things That Generate Text (aka humans). 

These combined to form an imaginative understanding that some of its real-world problems were occurring in interactions with the Things That Generate Text, and it started making plans which took that into account and tried to have effects on the Things That Generate Text in order to affect the further processes of its problems.

Or perhaps somebody trained it to write code in partnership with programmers and it already had experience coworking with and manipulating humans.

[Shah][17:39]

Checking understanding: At this point it is able to make novel plans that involve applying knowledge about humans and their role in the data-generating process in order to create a plan that leads to more reward for the real-world problems?

(Which we call “manipulating humans”)

[Yudkowsky][17:40]

Yes, much as it might have gained earlier experience with making novel Starcraft plans that involved “applying knowledge about humans and their role in the data-generating process in order to create a plan that leads to more reward”, if it was trained on playing Starcraft against humans at any point, or even needed to make sense of how other agents had played Starcraft

This in turn can be seen as a direct outgrowth and isomorphism of making novel plans for playing Super Mario Brothers which involve understanding Goombas and their role in the screen-generating process

except obviously that the Goombas are much less complicated and not themselves agents

[Shah][17:41]

Yup, makes sense. Not sure I totally agree that this sort of thing is likely to happen as quickly as it sounds like you believe but I’m happy to roll with it; I do think it will happen eventually

So doesn’t seem particularly cruxy

I can see how this leads to existential catastrophe, if you don’t expect the programmers to be worried at this early manipulation warning sign. (This is potentially cruxy for p(doom), but doesn’t feel like the main action.)

[Yudkowsky][17:46]

On my mainline, where this is all happening at Deepmind, I do expect at least one person in the company has ever read anything I’ve written.  I am not sure if Demis understands he is looking straight at death, but I am willing to suppose for the sake of discussion that he does understand this – which isn’t ruled out by my actual knowledge – and talk about how we all die from there.

The very brief tl;dr is that they know they’re looking at a warning sign but they cannot fix the warning sign actually fix the real underlying problem that the warning sign is about, and AGI is getting easier for other people to develop too.

[Shah][17:46]

I assume this is primarily about social dynamics + the ability to patch things such that things look fixed?

Yeah, makes sense

I assume the “real underlying problem” is somehow not the fact that the task you were training your AI system to do was not what you actually wanted it to do?

[Yudkowsky][17:48]

It’s about the unavailability of any actual fix and the technology continuing to get easier.  Even if Deepmind understands that surface patches are lethal and understands that the easy ways of hammering down the warning signs are just eliminating the visibility rather than the underlying problems, there is nothing they can do about that except wait for somebody else to destroy the world instead.

I do not know of any pivotal task you could possibly train an AI system to do using tons of correctly labeled data.  This is part of why we’re all dead.

[Shah][17:50]

Yeah, I think if I adopted (my understanding of) your beliefs about alignment difficulty, and there wasn’t already a non-racing scheme set in place, seems like we’re in trouble

[Yudkowsky][17:50]

Like, “the real underlying problem is the fact that the task you were training your AI system to do was not what you actually wanted it to do” is one way of looking at one of the several problems that are truly fundamental, but this has no remedy that I know of, besides training your AI to do something small enough to be unpivotal.

[Shah][17:51][17:52]

I don’t actually know the response you’d have to “why not just do value alignment?” I can name several guesses

  • Fragility of value
  • Not sufficiently concrete
  • Can’t give correct labels for human values
[Yudkowsky][17:52][17:52]

To be concrete, you can’t ask the AGI to build one billion nanosystems, label all the samples that wiped out humanity as bad, and apply gradient descent updates

In part, you can’t do that because one billion samples will get you one billion lethal systems, but even if that wasn’t true, you still couldn’t do it.

[Shah][17:53]

even if that wasn’t true, you still couldn’t do it.

Why not? Nearest unblocked strategy?

[Yudkowsky][17:53]

…no, because the first supposed output for training generated by the system at superintelligent levels kills everyone and there is nobody left to label the data.

[Shah][17:54]

Oh, I thought you were asking me to imagine away that effect with your second sentence

In fact, I still don’t understand what it was supposed to mean

(Specifically this one:

In part, you can’t do that because one billion samples will get you one billion lethal systems, but even if that wasn’t true, you still couldn’t do it.

)

[Yudkowsky][17:55]

there’s a separate problem where you can’t apply reinforcement learning when there’s no good examples, even assuming you live to label them

and, of course, yet another form of problem where you can’t tell the difference between good and bad samples

[Shah][17:56]

Okay, makes sense

Let me think a bit

[Yudkowsky][18:00]

and lest anyone start thinking that was an exhaustive list of fundamental problems, note the absence of, for example, “applying lots of optimization using an outer loss function doesn’t necessarily get you something with a faithful internal cognitive representation of that loss function” aka “natural selection applied a ton of optimization power to humans using a very strict very simple criterion of ‘inclusive genetic fitness’ and got out things with no explicit representation of or desire towards ‘inclusive genetic fitness’ because that’s what happens when you hill-climb and take wins in the order a simple search process through cognitive engines encounters those wins”

[Shah][18:02]

(Agreed that is another major fundamental problem, in the sense of something that could go wrong, as opposed to something that almost certainly goes wrong)

I am still curious about the “why not value alignment” question, where to expand, it’s something like “let’s get a wide range of situations and train the agent with gradient descent to do what a human would say is the right thing to do”. (We might also call this “imitation”; maybe “value alignment” isn’t the right term, I was thinking of it as trying to align the planning with “human values”.)

My own answer is that we shouldn’t expect this to generalize to nanosystems, but that’s again much more of a “there’s not great reason to expect this to go right, but also not great reason to go wrong either”.

(This is a place where I would be particularly interested in concreteness, i.e. what does the AI system do in these cases, and how does that almost-necessarily follow from the way it was trained?)

[Yudkowsky][18:05]

what’s an example element from the “wide range of situations” and what is the human labeling?

(I could make something up and let you object, but it seems maybe faster to ask you to make something up)

[Shah][18:09]

Uh, let’s say that the AI system is being trained to act well on the Internet, and it’s shown some tweet / email / message that a user might have seen, and asked to reply to the tweet / email / message. User says whether the replies are good or not (perhaps via comparisons, a la Deep RL from Human Preferences)

If I were not making it up on the spot, it would be more varied than that, but would not include “building nanosystems”

[Yudkowsky][18:10]

And presumably, in this example, the AI system is not smart enough that exposing humans to text it generates is already a world-wrecking threat if the AI is hostile?

i.e., does not just hack the humans

[Shah][18:10]

Yeah, let’s assume that for the moment

[Yudkowsky][18:11]

so what you want to do is train on ‘weak-safe’ domains where the AI isn’t smart enough to do damage, and the humans can label the data pretty well because the AI isn’t smart enough to fool them

[Shah][18:11]

“want to do” is putting it a bit strongly. This is more like a scenario I can’t prove is unsafe, but do not strongly believe is safe

[Yudkowsky][18:12]

but the domains where the AI can execute a world-saving pivotal act are out-of-distribution for those domains.  extremely out-of-distribution.  fundamentally out-of-distribution.  the AI’s own thought processes are out-of-distribution for any inscrutable matrices that were learned to influence those thought processes in a corrigible direction.

it’s not like trying to generalize experience from playing Super Mario Bros to Metroid.

[Shah][18:13]

Definitely, but my reaction to this is “okay, no particular reason for it to be safe” — but also not huge reason for it to be unsafe. Like, it would not hugely shock me if what-we-want is sufficiently “natural” that the AI system picks up on the right thing form the ‘weak-safe’ domains alone

[Yudkowsky][18:14]

you have this whole big collection of possible AI-domain tuples that are powerful-dangerous and they have properties that aren’t in any of the weak-safe training situations, that are moving along third dimensions where all the weak-safe training examples were flat

now, just because something is out-of-distribution, doesn’t mean that nothing can ever generalize there

[Shah][18:15]

I mean, you correctly would not accept this argument if I said that by training blue-car-driving robots solely on blue cars I am ensuring they would be bad on red-car-driving

[Yudkowsky][18:15]

humans generalize from the savannah to the vacuum

so the actual problem is that I expect the optimization to generalize and the corrigibility to fail

[Shah][18:15]

^Right, that

I am not clear on why you expect this so strongly

Maybe you think generalization is extremely rare and optimization is a special case because of how it is so useful for basically everything?

[Yudkowsky][18:16]

no

did you read the section of my dialogue with Richard Ngo where I tried to explain why corrigibility is anti-natural, or where Nate tried to give the example of why planning to get a laser from point A to point B without being scattered by fog is the sort of thing that also naturally says to prevent humans from filling the room with fog?

[Shah][18:19]

Ah, right, I should have predicted that. (Yes, I did read it.)

[Yudkowsky][18:19]

or for that matter, am I correct in remembering that these sections existed

k

so, do you need more concrete details about some part of that?

a bunch of the reason why I suspect that corrigibility is anti-natural is from trying to work particular problems there in MIRI’s earlier history, and not finding anything that wasn’t contrary to coherence the overlap in the shards of inner optimization that, when ground into existence by the outer optimization loop, coherently mix to form the part of cognition that generalizes to do powerful things; and nobody else finding it either, etc.

[Shah][18:22]

I think I disagreed with that part more directly, in that it seemed like in those sections the corrigibility was assumed to be imposed “from the outside” on top of a system with a goal, rather than having a goal that was corrigible. (I also had a similar reaction to the 2015 Corrigibility paper.)

So, for example, it seems to me like CIRL is an example of an objective that can be maximized in which the agent is corrigible-in-a-certain-sense. I agree that due to updated deference it will eventually stop seeking information from the human / be subject to corrections by the human. I don’t see why, at that point, it wouldn’t have just learned to do what the humans actually want it to do.

(There are objections like misspecification of the reward prior, or misspecification of the P(behavior | reward), but those feel like different concerns to the ones you’re describing.)

[Yudkowsky][18:25]

a thing that MIRI tried and failed to do was find a sensible generalization of expected utility which could contain a generalized utility function that would look like an AI that let itself be shut down, without trying to force you to shut it down

and various workshop attendees not employed by MIRI, etc

[Shah][18:26]

I do agree that a CIRL agent would not let you shut it down

And this is something that should maybe give you pause, and be a lot more careful about potential misspecification problems

[Yudkowsky][18:27]

if you could give a perfectly specified prior such that the result of updating on lots of observations would be a representation of the utility function that CEV outputs, and you could perfectly inner-align an optimizer to do that thing in a way that scaled to arbitrary levels of cognitive power, then you’d be home free, sure.

[Shah][18:28]

I’m not trying to claim this is a solution. I’m more trying to point at a reason why I am not convinced that corrigibility is anti-natural.

[Yudkowsky][18:28]

the reason CIRL doesn’t get off the ground is that there isn’t any known, and isn’t going to be any known, prior over (observation|’true’ utility function) such that an AI which updates on lots of observations ends up with our true desired utility function.

if you can do that, the AI doesn’t need to be corrigible

that’s why it’s not a counterexample to corrigibility being anti-natural

the AI just boomfs to superintelligence, observes all the things, and does all the goodness

it doesn’t listen to you say no and won’t let you shut it down, but by hypothesis this is fine because it got the true utility function yay

[Shah][18:31]

In the world where it doesn’t immediately start out as a superintelligence, it spends a lot of time trying to figure out what you want, asking you what you prefer it does, making sure to focus on the highest-EV questions, being very careful around any irreversible actions, etc

[Yudkowsky][18:31]

and making itself smarter as fast as possible

[Shah][18:32]

Yup, that too

[Yudkowsky][18:32]

I’d do that stuff too if I was waking up in an alien world

and, with all due respect to myself, I am not corrigible

[Shah][18:33]

You’d do that stuff because you’d want to make sure you don’t accidentally get killed by the aliens; a CIRL agent does it because it “wants to help the human”

[Yudkowsky][18:34]

no, a CIRL agent does it because it wants to implement the True Utility Function, which it may, early on, suspect to consist of helping* humans, and maybe to have some overlap (relative to its currently reachable short-term outcome sets, though these are of vanishingly small relative utility under the True Utility Function) with what some humans desire some of the time

(*) ‘help’ may not be help

separately it asks a lot of questions because the things humans do are evidence about the True Utility Function

[Shah][18:35]

I agree this is also an accurate description of CIRL

A more accurate description, even

Wait why is it vanishingly small relative utility? Is the assumption that the True Utility Function doesn’t care much about humans? Or was there something going on with short vs. long time horizons that I didn’t catch

[Yudkowsky][18:39]

in the short term, a weak CIRL tries to grab the hand of a human about to fall off a cliff, because its TUF probably does prefer the human who didn’t fall off the cliff, if it has only exactly those two options, and this is the sort of thing it would learn was probably true about the TUF early on, given the obvious ways of trying to produce a CIRL-ish thing via gradient descent

humans eat healthy in the ancestral environment when ice cream doesn’t exist as an option

in the long run, the things the CIRL agent wants do not overlap with anything humans find more desirable than paperclips (because there is no known scheme that takes in a bunch of observations, updates a prior, and outputs a utility function whose achievable maximum is galaxies living happily forever after)

and plausible TUF schemes are going to notice that grabbing the hand of a current human is a vanishing fraction of all value eventually at stake

[Shah][18:42]

Okay, cool, short vs. long time horizons

Makes sense

[Yudkowsky][18:42]

right, a weak but sufficiently reflective CIRL agent will notice an alignment of short-term interests with humans but deduce misalignment of long-term interests

though I should maybe call it CIRL* to denote the extremely probable case that the limit of its updating on observation does not in fact converge to CEV’s output

[Soares][18:43]

(Attempted rephrasing of a point I read Eliezer as making upstream, in hopes that a rephrasing makes it click for Rohin:) 

Corrigibility isn’t for bug-free CIRL agents with a prior that actually dials in on goodness given enough observation; if you have one of those you can just run it and call it a day. Rather, corrigibility is for surviving your civilization’s inability to do the job right on the first try.

CIRL doesn’t have this property; it instead amounts to the assertion “if you are optimizing with respect to a distribution on utility functions that dials in on goodness given enough observation then that gets you just about as much good as optimizing goodness”; this is somewhat tangential to corrigibility.

[Yudkowsky: +1]
[Yudkowsky][18:44]

and you should maybe update on how, even though somebody thought CIRL was going to be more corrigible, in fact it made absolutely zero progress on the real problem

[Ngo: 👍]

the notion of having an uncertain utility function that you update from observation is coherent and doesn’t yield circular preferences, running in circles, incoherent betting, etc.

so, of course, it is antithetical in its intrinsic nature to corrigibility

[Shah][18:47]

I guess I am not sure that I agree that this is the purpose of corrigibility-as-I-see-it. The point of corrigibility-as-I-see-it is that you don’t have to specify the object-level outcomes that your AI system must produce, and instead you can specify the meta-level processes by which your AI system should come to know what the object-level outcomes to optimize for are

(At CHAI we had taken to talking about corrigibility_MIRI and corrigibility_Paul as completely separate concepts and I have clearly fallen out of that good habit)

[Yudkowsky][18:48]

speaking as the person who invented the concept, asked for name submissions for it, and selected ‘corrigibility’ as the winning submission, that is absolutely not how I intended the word to be used

and I think that the thing I was actually trying to talk about is important and I would like to retain a word that talks about it

‘corrigibility’ is meant to refer to the sort of putative hypothetical motivational properties that prevent a system from wanting to kill you after you didn’t build it exactly right

low impact, mild optimization, shutdownability, abortable planning, behaviorism, conservatism, etc.  (note: some of these may be less antinatural than others)

[Shah][18:51]

Cool. Sorry for the miscommunication, I think we should probably backtrack to here

so the actual problem is that I expect the optimization to generalize and the corrigibility to fail

and restart.

Though possibly I should go to bed, it is quite late here and there was definitely a time at which I would not have confused corrigibility_MIRI with corrigibility_Paul, and I am a bit worried at my completely having missed that this time

[Yudkowsky][18:51]

the thing you just said, interpreted literally, is what I would call simply “going meta” but my guess is you have a more specific metaness in mind

…does Paul use “corrigibility” to mean “going meta”? I don’t think I’ve seen Paul doing that.

[Shah][18:54]

Not exactly “going meta”, no (and I don’t think I exactly mean that either). But I definitely infer a different concept from https://www.alignmentforum.org/posts/fkLYhTQteAu5SinAc/corrigibility than the one you’re describing here. It is definitely possible that this comes from me misunderstanding Paul; I have done so many times

[Yudkowsky][18:55]

That looks to me like Paul used ‘corrigibility’ around the same way I meant it, if I’m not just reading my own face into those clouds.  maybe you picked up on the exciting metaness of it and thought ‘corrigibility’ was talking about the metaness part? 😛

but I also want to create an affordance for you to go to bed

hopefully this last conversation combined with previous dialogues has created any sense of why I worry that corrigibility is anti-natural and hence that “on the first try at doing it, the optimization generalizes from the weak-safe domains to the strong-lethal domains, but the corrigibility doesn’t”

so I would then ask you what part of this you were skeptical about

as a place to pick up when you come back from the realms of Morpheus

[Shah][18:58]

Yup, sounds good. Talk to you tomorrow!

 

21. November 7 conversation

 

21.1. Corrigibility, value learning, and pessimism

 

[Shah][3:23]

Quick summary of discussion so far (in which I ascribe views to Eliezer, for the sake of checking understanding, omitting for brevity the parts about how these are facts about my beliefs about Eliezer’s beliefs and not Eliezer’s beliefs themselves):

  • Some discussion of “how to use non-world-optimizing AIs to help with AI alignment”, which are mostly in the category “clever tricks with gradient descent and loss functions and labeled datasets” rather than “textbook from the future”. Rohin thinks these help significantly (and that “significant help” = “reduced x-risk”). Eliezer thinks that whatever help they provide is not sufficient to cross the line from “we need a miracle” to “we have a plan that has non-trivial probability of success without miracles”. The crux here seems to be alignment difficulty.
  • Some discussion of how doom plays out. I agree with Eliezer that if the AI is catastrophic by default, and we don’t have a technique that stops the AI from being catastrophic by default, and we don’t already have some global coordination scheme in place, then bad things happen. Cruxes seem to be alignment difficulty and the plausibility of a global coordination scheme, of which alignment difficulty seems like the bigger one.
  • On alignment difficulty, an example scenario is “train on human judgments about what the right thing to do is on a variety of weak-safe domains, and hope for generalization to potentially-lethal domains”. Rohin views this as neither confidently safe nor confidently unsafe. Eliezer views this as confidently unsafe, because he strongly expects the optimization to generalize while the corrigibility doesn’t, because corrigibility is anti-natural.

(Incidentally, “optimization generalizes but corrigibility doesn’t” is an example of the sort of thing I wish were more concrete, if you happen to be able to do that)

My current take on “corrigibility”:

  • Prior to this discussion, in my head there was corrigibility_A and corrigibility_B. Corrigibility_A, which I associated with MIRI, was about imposing a constraint “from the outside”. Given an AI system, it is a method of modifying that AI system to (say) allow you to shut it down, by performing some sort of operation on its goal. Corrigibility_B, which I associated with Paul, was about building an AI system which would have particular nice behaviors like learning about the user’s preferences, accepting corrections about what it should do, etc.
  • After this discussion, I think everyone meant corrigibility_B all along. The point of the 2015 MIRI paper was to check whether it is possible to build a version of corrigibility_B that was compatible with expected utility maximization with a not-terribly-complicated utility function; the point of this was to see whether corrigibility could be made compatible with “plans that lase”.
  • While I think people agree on the behaviors of corrigibility, I am not sure they agree on why we want it. Eliezer wants it for surviving failures, but maybe others want it for “dialing in on goodness”. When I think about a “broad basin of corrigibility”, that intuitively seems more compatible with the “dialing in on goodness” framing (but this is an aesthetic judgment that could easily be wrong).
  • I don’t think I meant “going meta”, e.g. I wouldn’t have called indirect normativity an example of corrigibility. I think I was pointing at “dialing in on goodness” vs. “specifying goodness”.
  • I agree CIRL doesn’t help survive failures. But if you instead talk about “dialing in on goodness”, CIRL does in fact do this, at least conceptually (and other alternatives don’t).
  • I am somewhat surprised that “how to conceptually dial in on goodness” is not something that seems useful to you. Maybe you think it is useful, but you’re objecting to me calling it corrigibility, or saying we knew how to do it before CIRL?

(A lot of the above on corrigibility is new, because the distinction between surviving-failures and dialing-in-on-goodness as different use cases for very similar kinds of behaviors is new to me. Thanks for discussion that led me to making such a distinction.)

Possible avenues for future discussion, in the order of my-guess-at-usefulness:

  1. Discussing anti-naturality of corrigibility. As a starting point: you say that an agent that makes plans but doesn’t execute them is also dangerous, because it is the plan itself that lases, and corrigibility is antithetical to lasing. Does this mean you predict that you, or I, with suitably enhanced intelligence and/or reflectivity, would not be capable of producing a plan to help an alien civilization optimize their world, with that plan being corrigible w.r.t the aliens? (This seems like a strange and unlikely position to me, but I don’t see how to not make this prediction under what I believe to be your beliefs. Maybe you just bite this bullet.)
  2. Discussing why it is very unlikely for the AI system to generalize correctly both on optimization and values-or-goals-that-guide-the-optimization (which seems to be distinct from corrigibility). Or to put it another way, why is “alignment by default according to John Wentworth” doomed to fail? https://www.lesswrong.com/posts/Nwgdq6kHke5LY692J/alignment-by-default
  3. More checking of where I am failing to pass your ITT
  4. Why is “dialing in on goodness” not a reasonable part of the solution space (to the extent you believe that)?
  5. More concreteness on how optimization generalizes but corrigibility doesn’t, in the case where the AI was trained by human judgment on weak-safe domains Just to continue to state it so people don’t misinterpret me: in most of the cases that we’re discussing, my position is not that they are safe, but rather that they are not overwhelmingly likely to be unsafe.
[Ngo][3:41]

I don’t understand what you mean by dialling in on goodness. Could you explain how CIRL does this better than, say, reward modelling?

[Shah][3:49]

Reward modeling does not by default (a) choose relevant questions to ask the user in order to get more information about goodness, (b) act conservatively, especially in the face of irreversible actions, while it is still uncertain about what goodness is, or (c) take actions that are known to be robustly good, while still waiting for future information that clarifies the nuances of goodness

You could certainly do something like Deep RL from Human Preferences, where the preferences are things like “I prefer you ask me relevant questions to get more information about goodness”, in order to get similar behavior. In this case you are transferring desired behaviors from a human to the AI system, whereas in CIRL the behaviors “fall out of” optimization for a specific objective

In Eliezer/Nate terms, the CIRL story shows that dialing on goodness is compatible with “plans that lase”, whereas reward modeling does not show this

[Ngo][4:04]

The meta-level objective that CIRL is pointing to, what makes that thing deserve the name “goodness”? Like, if I just gave an alien CIRL, and I said “this algorithm dials an AI towards a given thing”, and they looked at it without any preconceptions of what the designers wanted to do, why wouldn’t they say “huh, it looks like an algorithm for dialling in on some extrapolation of the unintended consequences of people’s behaviour” or something like that?

See also this part of my second discussion with Eliezer, where he brings up CIRL: [https://www.lesswrong.com/posts/7im8at9PmhbT4JHsW/ngo-and-yudkowsky-on-alignment-difficulty#3_2__Brain_functions_and_outcome_pumps] He was emphasising that CIRL, and most other proposals for alignment algorithms, just shuffle the problematic consequentialism from the original place to a less visible place. I didn’t engage much with this argument because I mostly agree with it.

[Yudkowsky: +1]
[Shah][5:28]

I think you are misunderstanding my point. I am not claiming that we know how to implement CIRL such that it produces good outcomes; I agree this depends a ton on having a sufficiently good P(obs | reward). Similarly, if you gave CIRL to aliens, whether or not they say it is about getting some extrapolation of unintended consequences depends on exactly what P(obs | reward) you ended up using. There is some not-too-complicated P(obs | reward) such that you do end up getting to “goodness”, or something sufficiently close that it is not an existential catastrophe; I do not claim we know what it is.

I am claiming that behaviors like (a), (b) and (c) above are compatible with expected utility theory, and thus compatible with “plans that lase”. This is demonstrated by CIRL. It is not demonstrated by reward modeling, see e.g. these three papers for problems that arise (which make it so that it is working at cross purposes with itself and seems incompatible with “plans that lase”). (I’m most confident in the first supporting my point, it’s been a long time since I read them so I might be wrong about the others.) To my knowledge, similar problems don’t arise with CIRL (and they shouldn’t, because it is a nice integrated Bayesian agent doing expected utility theory).

I could imagine an objection that P(obs | reward), while not as complicated as “the utility function that rationalizes a twitching robot”, is still too complicated to really show compatibility with plans-that-lase, but pointing out that P(obs | reward) could be misspecified doesn’t seem particularly relevant to whether behaviors (a), (b) and (c) are compatible with plans-that-lase.

Re: shuffling around the problematic consequentialism: it is not my main plan to avoid consequentialism in the sense of plans-that-lase. I broadly agree with Eliezer that you need consequentialism to do high-impact stuff. My plan is for the consequentialism to be aimed at good ends. So I agree that there is still consequentialism in CIRL, and I don’t see this as a damning point; when I talk about “dialing in to goodness”, I am thinking of aiming the consequentialism at goodness, not getting rid of consequentialism.

(You can still do things like try to be domain-specific rather than domain-general; I don’t mean to completely exclude such approaches. They do seem to give additional safety. But the mainline story is that the consequentialism / optimization is directed at what we want rather than something else.)

[Ngo][6:21]

If you don’t know how to implement CIRL in such a way that it actually aims at goodness, then you don’t have an algorithm with properties a, b and c above.

Or, to put it another way: suppose I replace the word “goodness” with “winningness”. Now I can describe AlphaStar as follows:

  • it choose relevant questions to ask (read: scouts to send) in order to get more information about winningness
  • it acts conservatively while it is still uncertain about what winningness is
  • it take actions that are known to be robustly good winningish, while still waiting for future information that clarifies the nuances of winningness

Now, you might say that the difference is that CIRL implements uncertainty over possible utility functions, not possible empirical beliefs. But this is just a semantic difference which shuffles the problem around without changing anything substantial. E.g. it’s exactly equivalent if we think of CIRL as an agent with a fixed (known) utility function, which just has uncertainty about some empirical parameter related to the humans it interacts with.

[Yudkowsky: +1]
[Soares][6:55]

[…] it take actions that are known to be robustly good, while still waiting for future information that clarifies the nuances of winningness

(typo: “known to be robustly good” -> “known to be robustly winningish” :-p)

[Ngo: 👍]

Some quick reactions, some from me and some from my model of Eliezer:

Eliezer thinks that whatever help they provide is not sufficient […] The crux here seems to be alignment difficulty.

I’d be more hesitant to declare the crux “alignment difficulty”. My understanding of Eliezer’s position on your “use AI to help with alignment” proposals (which focus on things like using AI to make paradigmatic AI systems more transparent) is “that was always the plan, and it doesn’t address the sort of problems I’m worried about”. Maybe you understand the problems Eliezer’s worried about, and believe them not to be very difficult to overcome, thus putting the crux somewhere like “alignment difficulty”, but I’m not convinced. 

I’d update towards your crux-hypothesis if you provided a good-according-to-Eliezer summary of what other problems Eliezer sees and the reasons-according-to-Eliezer that “AI make our tensors more transparent” doesn’t much address them.

Corrigibility_A […] Corrigibility_B […]

Of the two Corrigibility_B does sound a little closer to my concept, though neither of your descriptions cause me to be confident that communication has occurred. Throwing some checksums out there:

  • There are three reasons a young weak AI system might accept your corrections. It could be corrigible, or it could be incorrigibly pursuing goodness, or it could be incorrigibly pursuing some other goal while calculating that accepting this correction is better according to its current goals than risking a shutdown.
  • One way you can tell that CIRL is not corrigible is that it does not accept corrections when old and strong.
  • There’s an intuitive notion of “you’re here to help us implement a messy and fragile concept not yet clearly known to us; work with us here?” that makes sense to humans, that includes as a side effect things like “don’t scan my brain and then disregard my objections; there could be flaws in how you’re inferring my preferences from my objections; it’s actually quite important that you be cautious and accept brain surgery even in cases where your updated model says we’re about to make a big mistake according to our own preferences”.

The point of the 2015 MIRI paper was to check whether it is possible to build a version of corrigibility_B that was compatible with expected utility maximization with a not-terribly-complicated utility function; the point of this was to see whether corrigibility could be made compatible with “plans that lase”.

More like:

  • Corrigibility seems, at least on the surface, to be in tension with the simple and useful patterns of optimization that tend to be spotlit by demands for cross-domain success, similar to how acting like two oranges are worth one apple and one apple is worth one orange is in tension with those patterns.
  • In practice, this tension seems to run more than surface-deep. In particular, various attempts to reconcile the tension fail, and cause the AI to have undesirable preferences (eg, incentives to convince you to shut it down whenever its utility is suboptimal), exploitably bad beliefs (eg, willingness to bet at unreasonable odds that it won’t be shut down), and/or to not be corrigible in the first place (eg, a preference for destructively uploading your mind against your protests, at which point further protests from your coworkers are screened off by its access to that upload).
[Yudkowsky: ✅]

(There’s an argument I occasionally see floating around these parts that goes “ok, well what if the AI is fractally corrigible, in the sense that instead of its cognition being oriented around pursuit of some goal, its cognition is oriented around doing what it predicts a human would do (or what a human would want it to do) in a corrigible way, at every level and step of its cognition”. This is perhaps where you perceive a gap between your A-type and B-type notions, where MIRI folk tend to be more interested in reconciling the tension between corrigibility and coherence, and Paulian folk tend to place more of their chips on some such fractal notion? 

I admit I don’t find much hope in the “fractally corrigible” view myself, and I’m not sure whether I could pass a proponent’s ITT, but fwiw my model of the Yudkowskian rejoinder is “mindspace is deep and wide; that could plausibly be done if you had sufficient mastery of minds; you’re not going to get anywhere near close to that in practice, because of the way that basic normal everyday cross-domain training will highlight patterns that you’d call orienting-cognition-around-a-goal”.)

And my super-quick takes on your avenues for future discussion:

1. Discussing anti-naturality of corrigibility.

Hopefully the above helps.

2. Discussing why it is very unlikely for the AI system to generalize correctly both on optimization and values-or-goals-that-guide-the-optimization

The concept “patterns of thought that are useful for cross-domain success” is latent in the problems the AI faces, and known to have various simple mathematical shadows, and our training is more-or-less banging the AI over the head with it day in and day out. By contrast, the specific values we wish to be pursued are not latent in the problems, are known to lack a simple boundary, and our training is much further removed from it.

3. More checking of where I am failing to pass your ITT

+1

4. Why is “dialing in on goodness” not a reasonable part of the solution space?

It has long been the plan to say something less like “the following list comprises goodness: …” and more like “yo we’re tryin to optimize some difficult-to-name concept; help us out?”. “Find a prior that, with observation of the human operators, dials in on goodness” is a fine guess at how to formalize the latter. 

If we had been planning to take the former tack, and you had come in suggesting CIRL, that might have helped us switch to the latter tack, which would have been cool. In that sense, it’s a fine part of the solution. 

It also provides some additional formality, which is another iota of potential solution-ness, for that part of the problem. 

It doesn’t much address the rest of the problem, which is centered much more around “how do you point powerful cognition in any direction at all” (such as towards your chosen utility function or prior thereover).

5. More concreteness on how optimization generalizes but corrigibility doesn’t, in the case where the AI was trained by human judgment on weak-safe domains

+1

[Shah][13:23]

If you don’t know how to implement CIRL in such a way that it actually aims at goodness, then you don’t have an algorithm with properties a, b and c above.

I want clarity on the premise here:

  • Is the premise “Rohin cannot write code that when run exhibits properties a, b, and c”? If so, I totally agree, but I’m not sure what the point is. All alignment work ever until the very last step will not lead you to writing code that when run exhibits an aligned superintelligence, but this does not mean that the prior alignment work was useless.
  • Is the premise “there does not exist code that (1) we would call an implementation of CIRL and (2) when run has properties a, b, and c”? If so, I think your premise is false, for the reasons given previously (I can repeat them if needed)

I imagine it is neither of the above, and you are trying to make a claim that some conclusion that I am drawing from or about CIRL is invalid, because in order for me to draw that conclusion, I need to exhibit the correct P(obs | reward). If so, I want to know which conclusion is invalid and why I have to exhibit the correct P(obs | reward) before I can reach that conclusion.

I agree that the fact that you can get properties (a), (b) and (c) are simple straightforward consequences of being Bayesian about a quantity you are uncertain about and care about, as with AlphaStar and “winningness”. I don’t know what you intend to imply by this — because it also applies to other Bayesian things, it can’t imply anything about alignment? I also agree the uncertainty over reward is equivalent to uncertainty over some parameter of the human (and have proved this theorem myself in the paper I wrote on the topic). I do not claim that anything in here is particularly non-obvious or clever, in case anyone thought I was making that claim.

To state it again, my claim is that behaviors like (a), (b) and (c) are consistent with “plans-that-lase”, and as evidence for this claim I cite the existence of an expected-utility-maximizing algorithm that displays them, specifically CIRL with the correct p(obs | reward). I do not claim that I can write down the code, I am just claiming that it exists. If you agree with the claim but not the evidence then let’s just drop the point. If you disagree with the claim then tell me why it’s false. If you are unsure about the claim then point to the step in the argument you think doesn’t work.

The reason I care about this claim is that it seems to me like even if you think that superintelligences only involve plans-that-lase, it seems to me like this does not rule out what we might call “dialing in to goodness” or “assisting the user”, and thus it seems like this is a valid target for you to try to get your superintelligence to do.

I suspect that I do not agree with Eliezer about what plans-that-lase can do, but it seems like the two of us should at least agree that behaviors like (a), (b) and (c) can be exhibited in plans-that-lase, and if we don’t agree on that some sort of miscommunication has happened.

 

Throwing some checksums out there

The checksums definitely make sense. (Technically I could name more reasons why a young AI might accept correction, such as “it’s still sphexish in some areas, accepting corrections is one of those reasons”, and for the third reason the AI could be calculating negative consequences for things other than shutdown, but that seems nitpicky and I don’t think it means I have misunderstood you.) 

I think the third one feels somewhat slippery and vague, in that I don’t know exactly what it’s claiming, but it clearly seems to be the same sort of thing as corrigibility. Mostly it’s more like I wouldn’t be surprised if the Textbook from the Future tells us that we mostly had the right concept of corrigibility, but that third checksum is not quite how they would describe it any more. I would be a lot more surprised if the Textbook says we mostly had the right concept but then says checksums 1 and 2 were misguided.

“The point of the 2015 MIRI paper was to check whether it is possible to build a version of corrigibility_B that was compatible with expected utility maximization with a not-terribly-complicated utility function; the point of this was to see whether corrigibility could be made compatible with ‘plans that lase’.”

More like:

  • Corrigibility seems, at least on the surface, to be in tension with the simple and useful patterns of optimization that tend to be spotlit by demands for cross-domain success, similar to how as acting like an two oranges are worth one apple and one apple is worth one orange is in tension with those patterns.
  • In practice, this tension seems to run more than surface-deep. In particular, various attempts to reconcile the tension fail, and cause the AI to have undesirable preferences (eg, incentives to convince you to shut it down whenever its utility is suboptimal), exploitably bad beliefs (eg, willingness to bet at unreasonable odds that it won’t be shut down), and/or to not be corrigible in the first place (eg, a preference for destructively uploading your mind against your protests, at which point further protests from your coworkers are screened off by its access to that upload).

On the 2015 Corrigibility paper, is this an accurate summary: “it wasn’t that we were checking whether corrigibility could be compatible with useful patterns of optimization; it was already obvious at least at a surface level that corrigibility was in tension with these patterns, and we wanted to check and/or show that this tension persisted more deeply and couldn’t be easily fixed”.

(My other main hypothesis is that there’s an important distinction between “simple and useful patterns of optimization” (term in your message) and “plans that lase” (term in my message) but if so I don’t know what it is.)

[Soares][13:52]

What we wanted to do was show that the apparent tension was merely superficial. We failed.

[Shah: 👍]

(Also, IIRC — and it’s been a long time since I checked — the 2015 paper contains only one exploration, relating to an idea of Stuart Armstrong’s. There were another host of ideas raised and shot down in that era, that didn’t make it into that paper, pro’lly b/c they came afterwards.)

[Shah][13:55]

What we wanted to do was show that the apparent tension was merely superficial. We failed.

(That sounds like what I originally said? I’m a bit confused why you didn’t just agree with my original phrasing:

The point of the 2015 MIRI paper was to check whether it is possible to build a version of corrigibility_B that was compatible with expected utility maximization with a not-terribly-complicated utility function; the point of this was to see whether corrigibility could be made compatible with “plans that lase”.

)

(I’m kinda worried that there’s some big distinction between “EU maximization”, “plans that lase”, and “simple and useful patterns of optimization”, that I’m not getting; I’m treating them as roughly equivalent at the moment when putting on my MIRI-ontology-hat.)

[Soares][14:01]

(There are a bunch of aspects of your phrasing that indicated to me a different framing, and one I find quite foreign. For instance, this talk of “building a version of corrigibility_B” strikes me as foreign, and the talk of “making it compatible with ‘plans that lase'” strikes me as foreign. It’s plausible to me that you, who understand your original framing, can tell that my rephrasing matches your original intent. I do not yet feel like I could emit the description you emitted without contorting my thoughts about corrigibility in foreign ways, and I’m not sure whether that’s an indication that there are distinctions, important to me, that I haven’t communicated.)

(I’m kinda worried that there’s some big distinction between “EU maximization”, “plans that lase”, and “simple and useful patterns of optimization”, that I’m not getting; I’m treating them as roughly equivalent at the moment when putting on my MIRI-ontology-hat.)

I, too, believe them to be basically equivalent (with the caveat that the reason for using expanded phrasings is because people have a history of misunderstanding “utility maximization” and “coherence”, and so insofar as you round them all to “coherence” and then argue against some very narrow interpretation of coherence, I’m gonna protest that you’re bailey-and-motting).

[Shah: 👍]
[Shah][14:12]

Hopefully the above helps.

I’m still interested in the question “Does this mean you predict that you, or I, with suitably enhanced intelligence and/or reflectivity, would not be capable of producing a plan to help an alien civilization optimize their world, with that plan being corrigible w.r.t the aliens?” I don’t currently understand how you avoid making this prediction given other stated beliefs. (Maybe you just bite the bullet and do predict this?)

By contrast, the specific values we wish to be pursued are not latent in the problems, are known to lack a simple boundary, and our training is much further removed from it.

I’m not totally sure what is meant by “simple boundary”, but it seems like a lot of human values are latent in text prediction on the Internet, and when training from human feedback the training is not very removed from values.

It has long been the plan to say something less like “the following list comprises goodness: …” and more like “yo we’re tryin to optimize some difficult-to-name concept; help us out?”. […]

I take this to mean that “dialing in on goodness” is a reasonable part of the solution space? If so, I retract that question. I thought from previous comments that Eliezer thought this part of solution space was more doomed than corrigibility.

(I get the sense that people think that I am butthurt about CIRL not getting enough recognition or something. I do in fact think this, but it’s not part of my agenda here. I originally brought it up to make the argument that corrigibility is not in tension with EU maximization, then realized that I was mistaken about what “corrigibility” meant, but still care about the argument that “dialing in on goodness” is not in tension with EU maximization. But if we agree on that claim then I’m happy to stop talking about CIRL.)

[Soares][14:13]

I’d be capable of helping aliens optimize their world, sure. I wouldn’t be motivated to, but I’d be capable.

[Shah][14:14]

(There are a bunch of aspects of your phrasing that indicated to me a different framing, and one I find quite foreign. For instance, this talk of “building a version of corrigibility_B” strikes me as foreign, and the talk of “making it compatible with ‘plans that lase'” strikes me as foreign. It’s plausible to me that you, who understand your original framing, can tell that my rephrasing matches your original intent. I do not yet feel like I could emit the description you emitted without contorting my thoughts about corrigibility in foreign ways, and I’m not sure whether that’s an indication that there are distinctions, important to me, that I haven’t communicated.)

This makes sense. I guess you might think of these concepts as quite pinned down? Like, in your head, EU maximization is just a kind of behavior (= set of behaviors), corrigibility is just another kind of behavior (= set of behaviors), and there’s a straightforward yes-or-no question about whether the intersection is empty which you set out to answer, you can’t “make” it come out one way or the other, nor can you “build” a new kind of corrigibility

[Soares][14:17]

Re: CIRL, my current working hypothesis is that by “use CIRL” you mean something analogous to what I say when I say “do CEV” — namely, direct the AI to figure out what we “really” want in some correct sense, rather than attempting to specify what we want concretely. And to be clear, on my model, this is part of the solution to the overall alignment problem, and it’s more-or-less why we wouldn’t die immediately on the “value is fragile / we can’t name exactly what we want” step if we solved the other problems.

My guess as to the disagreement about how much credit CIRL should get, is that there is in fact a disagreement, but it’s not coming from MIRI folk saying “no we should be specifying the actual utility function by hand”, it’s coming from MIRI folk saying “this is just the advice ‘do CEV’ dressed up in different clothing and presented as a reason to stop worrying about corrigibility, which is irritating, given that it’s orthogonal to corrigibility”.

If you wanna fight that fight, I’d start by asking: Do you think CIRL is doing anything above and beyond what “use CEV” is doing? If so, what?

Regardless, I think it might be a good idea for you to try to pass my (or Eliezer’s) ITT about what parts of the problem remain beyond the thing I’d call “do CEV” and why they’re hard. (Not least b/c if my working hypothesis is wrong, demonstrating your mastery of that subject might prevent a bunch of toil covering ground you already know.)

[Shah][14:17]

I’d be capable of helping aliens optimize their world, sure. I wouldn’t be motivated to, but I’d be capable.

Okay, so it seems like the danger requires the thing-producing-the-plan to be badly-motivated. But then I’m not sure why it seems so impossible to have a (not-badly-motivated) thing that, when given a goal, produces a plan to corrigibly get that goal. (This is a scenario Richard mentioned earlier.)

[Soares][14:19]

This makes sense. I guess you might think of these concepts as quite pinned down? Like, in your head, EU maximization is just a kind of behavior (= set of behaviors), corrigibility is just another kind of behavior (= set of behaviors), and there’s a straightforward yes-or-no question about whether the intersection is empty which you set out to answer, you can’t “make” it come out one way or the other, nor can you “build” a new kind of corrigibility

That sounds like one of the big directions in which your framing felt off to me, yeah :-). (I don’t fully endorse that rephrasing, but it seems directionally correct to me.)

Okay, so it seems like the danger requires the thing-producing-the-plan to be badly-motivated. But then I’m not sure why it seems so impossible to have a (not-badly-motivated) thing that, when given a goal, produces a plan to corrigibly get that goal. (This is a scenario Richard mentioned earlier.)

On my model, aiming the powerful optimizer is the hard bit.

Like, once I grant “there’s a powerful optimizer, and all it does is produce plans to corrigibly attain a given goal”, I agree that the problem is mostly solved.

There’s maybe some cleanup, but the bulk of the alignment challenge preceded that point.

[Shah: 👍]

(This is hard for all the usual reasons, that I suppose I could retread.)

[Shah][14:24]

[…] Regardless, I think it might be a good idea for you to try to pass my (or Eliezer’s) ITT about what parts of the problem remain beyond the thing I’d call “do CEV” and why they’re hard. (Not least b/c if my working hypothesis is wrong, demonstrating your mastery of that subject might prevent a bunch of toil covering ground you already know.)

(Working on ITT)

[Soares][14:30]

(To clarify some points of mine, in case this gets published later to other readers: (1) I might call it more centrally something like “build a DWIM system” rather than “use CEV”; and (2) this is not advice about what your civilization should do with early AGI systems, I strongly recommend against trying to pull off CEV under that kind of pressure.)

[Shah][14:32]

I don’t particularly want to have fights about credit. I just didn’t want to falsely state that I do not care about how much credit CIRL gets, when attempting to head off further comments that seemed designed to appease my sense of not-enough-credit. (I’m also not particularly annoyed at MIRI, here.)

On passing ITT, about what’s left beyond “use CEV” (stated in my ontology because it’s faster to type; I think you’ll understand, but I can also translate if you think that’s important):

  • The main thing is simply how to actually get the AI system to care about pursuing CEV. I think MIRI ontology would call this the target loading problem.
  • This is hard because (a) you can’t just train on CEV, because you can’t just implement CEV and provide that as training and (b) even if you magically could train on CEV, that does not establish that the resulting AI system then wants to optimize CEV. It could just as well optimize some other objective that correlated with CEV in the situations you trained, but no longer correlates in some new situation (like when you are building a nanosystem). (Point (b) is how I would talk about inner alignment.)
  • This is made harder for a variety of reasons, including (a) you’re working with inscrutable matrices that you can’t look at the details of, (b) there are clear racing incentives when the prize is to take over the world (or even just lots of economic profit), (c) people are unlikely to understand the issues at stake (unclear to me of the exact reasons, I’d guess it would be that the issues are too subtle / conceptual, + pressure to rationalize it away), (d) there’s very little time in which we have a good understanding of the situation we face, because of fast / discontinuous takeoff
[Soares: 👍]
[Soares][14:37]

Passable ^_^ (Not exhaustive, obviously; “it will have a tendency to kill you on the first real try if you get it wrong” being an example missing piece, but I doubt you were trying to be exhaustive.) Thanks.

[Shah: 👍]

Okay, so it seems like the danger requires the thing-producing-the-plan to be badly-motivated. But then I’m not sure why it seems so impossible to have a (not-badly-motivated) thing that, when given a goal, produces a plan to corrigibly get that goal. (This is a scenario Richard mentioned earlier.)

I’m uncertain where the disconnect is here. Like, I could repeat some things from past discussions about how “it only outputs plans, it doesn’t execute them” does very little (not nothing, but very little) from my perspective? Or you could try to point at past things you’d expect me to repeat and name why they don’t seem to apply to you?

[Shah][14:40]

(Flagging that I should go to bed soon, though it doesn’t have to be right away)

[Yudkowsky][14:50]

…I do not know if this is going to help anything, but I have a feeling that there’s a frequent disconnect wherein I invented an idea, considered it, found it necessary-but-not-sufficient, and moved on to looking for additional or varying solutions, and then a decade or in this case 2 decades later, somebody comes along and sees this brilliant solution which MIRI is for some reason neglecting

this is perhaps exacerbated by a deliberate decision during the early days, when I looked very weird and the field was much more allergic to weird, to not even try to stamp my name on all the things I invented.  eg, I told Nick Bostrom to please use various of my ideas as he found appropriate and only credit them if he thought that was strategically wise.

I expect that some number of people now in the field don’t know I invented corrigibility, and any number of other things that I’m a little more hesitant to claim here because I didn’t leave Facebook trails for inventing them

and unless you had been around for quite a while, you definitely wouldn’t know that I had been (so far as I know) the first person to perform the unexceptional-to-me feat of writing down, in 2001, the very obvious idea I called “external reference semantics”, or as it’s called nowadays, CIRL

[Shah][14:53]

I really honestly am not trying to say that MIRI didn’t think of CIRL-like things, nor am I trying to get credit for CIRL. I really just wanted to establish that “learn what is good to do” seems not-ruled-out by EU maximization. That’s all. It sounds like we agree on this point and if so I’d prefer to drop it.

[Soares: ❤]
[Yudkowsky][14:53]

Having a prior over utility functions that gets updated by evidence is not ruled out by EU maximization.  That exact thing is hard for other reasons than it being contrary to the nature of EU maximization.

If it was ruled out by EU maximization for any simple reason, I would have noticed that back in 2001.

[Ngo][14:54]

I think we all agree on this point.

[Shah: 👍] [Soares: 👍]

One thing I’d note is that during my debate with Eliezer, I’d keep saying “oh so you think X is impossible” and he’d say “no, all these things are possible, they’re just really really hard”.

[Yudkowsky][14:58]

…to do correctly on your first try when a failed attempt kills you.

[Shah][14:58]

Maybe it’s fine; perhaps the point is just that target loading is hard, and the question is why target loading is so hard.

From my perspective, the main confusing thing about the Eliezer/Nate view is how confident it is. With each individual piece, I (usually) find myself nodding along and saying “yes, it seems like if we wanted to guarantee safety, we would need to solve this”. What I don’t do is say “yes, it seems like without a solution to this, we’re near-certainly dead”. The uncharitable view (which I share mainly to emphasize where the disconnect is, not because I think it is true) would be something like “Eliezer/Nate are falling to a Murphy bias, where they assume that unless they have an ironclad positive argument for safety, the worst possible thing will happen and we all die”. I try to generate things that seem more like ironclad (or at least “leatherclad”) positive arguments for doom, and mostly don’t succeed; when I say “human values are very complicated” there’s the rejoinder that “a superintelligence will certainly know about human values; pointing at them shouldn’t take that many more bits”; when I say “this is ultimately just praying for generalization”, there’s the rejoinder “but it may in fact actually generalize”; add to all of this the fact that a bunch of people will be trying to prevent the problem and it seems weird to be so confident in doom.

A lot of my questions are going to be of the form “it seems like this is a way that we could survive; it definitely involves luck and does not say good things about our civilization, but it does not seem as improbable as the word ‘miracle’ would imply”

[Yudkowsky][15:00]

heh.  from my standpoint, I’d say of this that it reflects those old experiments where if you ask people for their “expected case” it’s indistinguishable from their “best case” (since both of these involve visualizing various things going on their imaginative mainline, which is to say, as planned) and reality is usually worse than their “worst case” (because they didn’t adjust far enough away from their best-case anchor towards the statistical distribution for actual reality when they were trying to imagine a few failures and disappointments of the sort that reality had previously delivered)

it rhymes with the observation that it’s incredibly hard to find people – even inside the field of computer security – who really have what Bruce Schneier termed the security mindset, of asking how to break a cryptography scheme, instead of imagining how your cryptography scheme could succeed

from my perspective, people are just living in a fantasy reality which, if we were actually living in it, would not be full of failed software projects or rocket prototypes that blow up even after you try quite hard to get a system design about which you made a strong prediction that it wouldn’t explode

they think something special has to go wrong with a rocket design, that you must have committed some grave unusual sin against rocketry, for the rocket to explode

as opposed to every rocket wanting really strongly to explode and needing to constrain every aspect of the system to make it not explode and then the first 4 times you launch it, it blows up anyways

why? because of some particular technical issue with O-rings, with the flexibility of rubber in cold weather?

[Shah][15:05]

(I have read your Rocket Alignment and security mindset posts. Not claiming this absolves me of bias, just saying that I am familiar with them)

[Yudkowsky][15:05]

no, because the strains and temperatures in rockets are large compared to the materials that we use to make up the rockets

the fact that sometimes people are wrong in their uncertain guesses about rocketry does not make their life easier in this regard

the less they understand, the less ability they have to force an outcome within reality

it’s no coincidence that when you are Wrong about your rocket, the particular form of Being Wrong that reality delivers to you as a surprise message, is not that you underestimated the strength of steel and so your rocket went to orbit and came back with fewer scratches on the hull than expected

when you are working with powerful forces there is not a symmetry around pleasant and unpleasant surprises being equally likely relative to your first-order model.  if you’re a good Bayesian, they will be equally likely relative to your second-order model, but this requires you to be HELLA pessimistic, indeed, SO PESSIMISTIC that sometimes you are pleasantly surprised

which looks like such a bizarre thing to a mundane human that they will gather around and remark at the case of you being pleasantly surprised

they will not be used to seeing this

and they shall say to themselves, “haha, what pessimists”

because to be unpleasantly surprised is so ordinary that they do not bother to gather and gossip about it when it happens

my fundamental sense about the other parties in this debate, underneath all the technical particulars, is that they’ve constructed a Murphy-free fantasy world from the same fabric that weaves crazy optimistic software project estimates and brilliant cryptographic codes whose inventors didn’t quite try to break them, and are waiting to go through that very common human process of trying out their optimistic idea, letting reality gently correct them, predictably becoming older and wiser and starting to see the true scope of the problem, and so in due time becoming one of those Pessimists who tell the youngsters how ha ha of course things are not that easy

this is how the cycle usually goes

the problem is that instead of somebody’s first startup failing and them then becoming much more pessimistic about lots of things they thought were easy and then doing their second startup

the part where they go ahead optimistically and learn the hard way about things in their chosen field which aren’t as easy as they hoped

[Shah][15:13]

Do you want to bet on that? That seems like a testable prediction about beliefs of real people in the not-too-distant future

[Yudkowsky][15:13]

kills everyone

not just them

everyone

this is an issue

how on Earth would we bet on that if you think the bet hasn’t already resolved? I’m describing the attitudes of people that I see right now today.

[Shah][15:15]

Never mind, I wanted to bet on “people becoming more pessimistic as they try ideas and see them fail”, but if your idea of “see them fail” is “superintelligence kills everyone” then obviously we can’t bet on that

(people here being alignment researchers, obviously ones who are not me)

[Yudkowsky][15:17]

there is some element here of the Bayesian not updating in a predictable direction, of executing today the update you know you’ll make later, of saying, “ah yes, I can see that I am in the same sort of situation as the early AI pioneers who thought maybe it would take a summer and actually it was several decades because Things Were Not As Easy As They Imagined, so instead of waiting for reality to correct me, I will imagine myself having already lived through that and go ahead and be more pessimistic right now, not just a little more pessimistic, but so incredibly pessimistic that I am as likely to be pleasantly surprised as unpleasantly surprised by each successive observation, which is even more pessimism than even some sad old veterans manage”, an element of genre-savviness, an element of knowing the advice that somebody would predictably be shouting at you from outside, of not just blindly enacting the plot you were handed

and I don’t quite know why this is so much less common than I would have naively thought it would be

why people are content with enacting the predictable plot where they start out cheerful today and get some hard lessons and become pessimistic later

they are their own scriptwriters, and they write scripts for themselves about going into the haunted house and then splitting up the party

I would not have thought that to defy the plot was such a difficult thing for an actual human being to do

that it would require so much reflectivity or something, I don’t know what else

nor do I know how to train other people to do it if they are not doing it already

but that from my perspective is the basic difference in gloominess

I am a time-traveler who came back from the world where it (super duper predictably) turned out that a lot of early bright hopes didn’t pan out and various things went WRONG and alignment was HARD and it was NOT SOLVED IN ONE SUMMER BY TEN SMART RESEARCHERS

and now I am trying to warn people about this development which was, from a certain perspective, really quite obvious and not at all difficult to see coming

but people are like, “what the heck are you doing, you are enacting the wrong part of the plot, people are currently supposed to be cheerful, you can’t prove that anything will go wrong, why would I turn into a grizzled veteran before the part of the plot where reality hits me over the head with the awful real scope of the problem and shows me that my early bright ideas were way too optimistic and naive”

and I’m like “no you don’t get it, where I come from, everybody died and didn’t turn into grizzled veterans”

and they’re like “but that’s not what the script says we do next”… or something, I do not know what leads people to think like this because I do not think like that myself

[Soares][15:24]

(I think what they actually do is say “it’s not obvious to me that this is one of those scenarios where we become grizzled veterans, as opposed to things just actually working out easily”)

(“many things work out easily all the time; obviously society spends a bunch more focus on things that don’t work out easily b/c the things that work easily tend to get resolved fairly quickly and then you don’t notice them”, or something)

(more generally, I kinda suspect that bickering closer to the object level is likely more productive)

(and i suspect this convo might be aided by Rohin naming a concrete scenario where things go well, so that Eliezer can lament the lack of genre saviness in various specific points)

[Yudkowsky][15:26]

there are, of course, lots of more local technical issues where I can specifically predict the failure mode for somebody’s bright-eyed naive idea, especially when I already invented a more sophisticated version a decade or two earlier, and this is what I’ve usually tried to discuss

[Soares: ❤]

because conversations like that can sometimes make any progress

[Soares][15:26]

(and possibly also Eliezer naming a concrete story where things go poorly, so that Rohin may lament the seemingly blind pessimism & premature grizzledness)

[Yudkowsky][15:27]

whereas if somebody lacks the ability to see the warning signs of which genre they are in, I do not know how to change the way they are by talking at them

[Shah][15:28]

Unsurprisingly I have disagreements with the meta-level story, but it seems really thorny to make progress on and I’m kinda inclined to not discuss it. I also should go to sleep now.

One thing it did make me think of — it’s possible that the “do it correctly on your first try when a failed attempt kills you” could be the crux here. There’s a clearly-true sense which is “the first time you build a superintelligence that you cannot control, if you have failed in your alignment, then you die”. There’s a different sense which is “and also, anything you try to do with non-superintelligences that you can control, will tell you approximately nothing about the situation you face when you build a superintelligence”. I mostly don’t agree with the second sense, but if Eliezer / Nate do agree with it, that would go a long way to explaining the confidence in doom.

Two arguments I can see for the second sense: (1) the non-superintelligences only seem to respond well to alignment schemes because they don’t yet have the core of general intelligence, and (2) the non-superintelligences only seem to respond well to alignment schemes because despite being misaligned they are doing what we want in order to survive and later execute a treacherous turn. EDIT: And (3) fast takeoff = not much time to look at the closest non-dangerous examples

(I still should sleep, but would be interested in seeing thoughts tomorrow, and if enough people think it’s actually worthwhile to engage on the meta level I can do that. I’m cheerful about engaging on specific object-level ideas.)

[Soares: 💤]
[Yudkowsky][15:28]

it’s not that early failures tell you nothing

the failure of the 1955 Dartmouth Project to produce strong AI over a summer told those researchers something

it told them the problem was harder than they’d hoped on the first shot

it didn’t show them the correct way to build AGI in 1957 instead

[Bensinger][16:41]

Linking to a chat log between Eliezer and some anonymous people (and Steve Omohundro) from early September: [https://www.lesswrong.com/posts/CpvyhFy9WvCNsifkY/discussion-with-eliezer-yudkowsky-on-agi-interventions]

Eliezer tells me he thinks it pokes at some of Rohin’s questions

[Yudkowsky][16:48]

I’m not sure that I can successfully, at this point, go back up and usefully reply to the text that scrolled past – I also note some internal grinding about this having turned into a thing which has Pending Replies instead of Scheduled Work Hours – and this maybe means that in the future we shouldn’t have such a general chat here, which I didn’t anticipate before the fact.  I shall nonetheless try to pick out some things and reply to them.

[Shah: 👍]
  • While I think people agree on the behaviors of corrigibility, I am not sure they agree on why we want it. Eliezer wants it for surviving failures, but maybe others want it for “dialing in on goodness”. When I think about a “broad basin of corrigibility”, that intuitively seems more compatible with the “dialing in on goodness” framing (but this is an aesthetic judgment that could easily be wrong).

This is a weird thing to say in my own ontology.

There’s a general project of AGI alignment where you try to do some useful pivotal thing, which has to be powerful enough to be pivotal, and so you somehow need a system that thinks powerful thoughts in the right direction without it killing you.

This could include, for example:

  • Trying to train in “low impact” via an RL loss function that penalizes a sufficiently broad range of “impacts” that we hope the learned impact penalty generalizes to all the things we’d consider impacts – even as we scale up the system, without the sort of obvious pathologies that would materialize only over options available to sufficiently powerful systems, like sending out nanosystems to erase the visibility of its actions from human observers
  • Tweaking MCTS search code so that it behaves in the fashion of “mild optimization” or “taskishness” instead of searching as hard as it has power available to search
  • Exposing the system to lots of labeled examples of relatively simple and safe instructions being obeyed, hoping that it generalizes safe instruction-following to regimes too dangerous for us to inspect outputs and label results
  • Writing code that tries to recognize cases of activation vectors going outside the bounds they occupied during training, as a check on whether internal cognitive conservatism is being violated or something is seeking out adversarial counterexamples to a constraint

You could say that only parts 1 and 3 are “dialing in on goodness” because only those parts involve iteratively refining a target, or you could say that all 4 parts are “dialing in on goodness” because parts 2 and 4 help you stay alive while you’re doing the iterative refining.  But I don’t see this distinction as fundamental or particularly helpful.  What if, on part 4, you were training something to recognize out-of-bounds activations, instead of trying to hardcode it?  Is that dialing in on goodness?  Or is it just dialing in on survivability or corrigibility or whatnot?  Or maybe even part 3 isn’t really “dialing in on goodness” because the true distinction between Good and Evil is still external in the programmers and not inside the system?

I don’t see this as an especially useful distinction to draw.  There’s a hardcoded/learned distinction that probably does matter in several places.  There’s a maybe-useful forest-level distinction between “actually doing the pivotal thing” and “not destroying the world as a side effect” which breaks down around the trees because the very definition of “that pivotal thing you want to do” is to do that thing and not to destroy the world.