Full Transcript

·YouTLDR

I lead AGI safety at Google DeepMind – here's the view from the inside | Rohin Shah

2:44:1526,758 words · ~134 min readEnglishTranscribed Jun 2, 2026
AI Summary

Google DeepMind's Rohin Shah argues that catastrophic AI misalignment is not the default outcome, advocating for empirical, flexible alignment methods and expert third-party auditing over rigid public safety commitments.

This video provides a rare, detailed look into the strategic, technical, and governance frameworks of one of the world's leading AGI development teams during a critical phase of AI capability scaling.

Section summaries

0:00-10:37

Introduction & Optimism on Default Alignment

watch

Establishes Rohin's baseline perspective on AI safety risks and why he doubts standard catastrophic failure arguments.

10:38-18:21

The Limits of Corporate Safety Commitments

watch

Essential strategic debate on why pre-deployment public commitments are often counterproductive.

18:21-27:37

Third-Party Auditing & Safety Scorecards

watch

Explains the role of expert external auditing and tools like AI Lab Watch as viable alternatives to promises.

27:37-37:37

Internal Team Dynamics & Governance Models

optional

Discusses DeepMind's internal organizational structures, which are interesting but less technically vital.

37:37-54:17

Pre-Deployment Evals vs. Continuous Progress

watch

Challenges the consensus on strict pre-deployment gating in favor of continuous monitoring and buffers.

54:17-1:09:51

The Science of Chain-of-Thought Monitoring

watch

Highly technical segment explaining the physics of transformer depth and legible reasoning traces.

1:09:51-1:28:59

Addressing Critiques of Gemini Reports

optional

A direct response to specific blog critiques of DeepMind's reporting choices; highly contextual.

1:28:59-1:52:44

Technical Papers: MONA & Internal Security Plans

watch

Deep dive into actionable engineering mitigations, specifically addressing how to handle untrusted models.

Key points

  • Low Opaque Serial Depth in Transformers — Transformers must use externalized reasoning (like a chain of thought) as a form of working memory because sequential steps of internal computation are heavily constrained by parallel processing hardware (GPUs/TPUs). This 'opaque serial depth' limitation keeps the models' reasoning legible to humans in natural language.
  • Flaws of Pre-Deployment Commitments — Tying AI labs to static, forward-looking commitments can backfire as technical research evolves—such as the shift from injecting alignment data during pretraining to actively filtering it out to prevent models from learning malicious personas or evasion techniques.
  • Myopic Optimization with Non-Myopic Approval (MONA) — A technical training framework that prevents multi-step reward hacking by evaluating separate actions individually without backpropagating future rewards, while utilizing an intelligent overseer to evaluate whether the current step aligns with future goals.
  • Tool vs. Population Distinction in AGI Scaling — Current AI progress remains linear because models function as highly productive scientific tools rather than autonomous researchers. An actual intelligence explosion requires AIs to fully automate the generation and execution of novel ideas, effectively expanding the researcher population.
If you train it to be deceptive on relatively short-horizon tasks, maybe that will generalise to long-horizon tasks. I don't think we have an argument that rules it out, which is why I say that it's plausible, but I don't think it's the default thing that you should predict from that. Rohin Shah
Rules that you write down in advance are one of the stupidest... Sorry, I mean 'stupid' in the sense of the rule itself clearly can't have very much intelligence in it, otherwise it would not be a rule. Rohin Shah

AI-generated from the transcript. May contain errors.

0:00

Rob Wiblin: Today I’m speaking with Rohin Shah,  

0:02

who is head of AGI alignment  and safety at Google DeepMind.

0:06

I suppose, Rohin, you’ve ended up,  for better or worse — hopefully  

0:09

for better — being one of the more  influential, dare I even say powerful,  

0:13

people to come out of the AGI alignment  and safety ecosystem and school of thought.

0:19

You were generous enough to be super opinionated  with me when you came on the show two years ago,  

0:22

and judging by the notes that you’ve sent over  this week, you’re ready to be opinionated again.

0:27

Thanks so much for coming back on the show, Rohin.

0:29

Rohin Shah: Yeah, thanks a lot, Rob, and  that’s a very generous intro. And in the  

0:33

interest of being very opinionated, I do  want to emphasise that these opinions are  

0:38

mine alone. They’re not meant to represent  the opinions of Google or Google DeepMind.

0:43

Rob Wiblin: That’s how we like it. If  you were representing Google DeepMind,  

0:45

it might sound more like a press release.

0:49

So you were really very early in the  scheme of things to the whole misalignment,  

0:54

AI/AGI security issues. I suppose you got  involved in 2017, so you’re in the first  

0:59

few percent of people who started working  on this professionally. But despite that,  

1:04

you think that probably we’re not going to get  catastrophic misalignment, that our chances are  

1:08

really pretty good, and that probably prosaic,  ordinary alignment techniques — the kinds of  

1:12

things that Google DeepMind and other AI companies  are doing — will probably succeed at preventing  

1:16

at least catastrophic misalignment. Why  do you think our chances are so good?

1:21

Rohin Shah: There’s a few different  disjunctive reasons. I don’t feel like  

1:24

there’s one particular thing. Probably the  highest level bit is that I don’t feel like  

1:31

there is any particularly compelling argument  that this is the thing that happens by default.  

1:37

I think there’s a lot of arguments that  are suggestive that maybe it could happen,  

1:42

such that you should find it plausible. I think  that’s sufficient to justify a significant  

1:48

amount of effort into averting it, which is  why I work in the area that I do work in.  

1:54

But none of them really rise to the level of  like, now I’m expecting this to happen by default.

1:59

I think every argument that I’ve seen,  there are pretty significant holes one  

2:04

could poke if you tried to take them  as arguments for “this is what happens,  

2:09

likely,” as opposed to “this is a  plausible thing that could happen.”

2:15

Rob Wiblin: Yeah. I mean, people have  tried to put forward arguments why this  

2:18

is likely or inevitable. There’s obviously the  Yudkowsky-style argument, which I guess is focused  

2:22

on misgeneralisation and adversarial examples. I  guess there’s the Ajeya Cotra and Joe Carlsmith  

2:29

take, which I think Carlsmith describes best in  “Is power-seeking AI an existential risk?,” which  

2:34

is more focused on accidentally teaching AIs  to deceive us by having inaccurate feedback.

2:39

Then I guess empirically, people point to the  fact that models lie and scheme a bunch now,  

2:47

they do a whole bunch of reward hacking as  a result of reinforcement learning, and they  

2:52

expect that to perhaps just get worse over time  because we don’t have sufficient mitigations.

2:56

Do you basically just find none of those  or any other similar arguments that people  

3:01

have put forward to be sufficiently  persuasive to think that it’s likely?

3:04

Rohin Shah: Yeah, I think that’s right. So  if you take the Cotra and Carlsmith arguments  

3:15

of… Well, they have a variety of arguments, but  I think in fact one of the common ones which you  

3:21

pointed to is we might accidentally train them  to be deceptive. Totally true. I agree that is  

3:30

something that at least could happen pretty  easily, and maybe it’s even likely. But we’re  

3:38

not going to do reinforcement learning  over the course of one-year trajectories.  

3:43

Maybe we’re going to do reinforcement  learning over a week or a month at most.

3:47

So I think the default prediction you should have  for that is what the AI system learns to do is,  

4:00

“I’m going to take opportunities to  reward hack, seek reward as much as  

4:04

possible that would allow me to get a high  score after a week” or something like that,  

4:12

or whatever the time horizon actually was. And  this is very different from the sort of ambitious  

4:19

misaligned goal that you need in order to motivate  convergent instrumental subgoals to the point of,  

4:25

“Now my job is to take over the world.  That’s what I need in order to achieve my  

4:29

goal.” Those really do seem like they need  to be significantly longer-horizon goals.

4:34

And if you train it to be deceptive on relatively  short-horizon tasks, maybe that will generalise  

4:40

to long-horizon tasks. I don’t think we have an  argument that rules it out, which is why I say  

4:46

that it’s plausible, but I don’t think it’s the  default thing that you should predict from that.

4:51

Similarly, you mentioned the existing examples  of models doing a lot of reward hacking and  

4:56

cheating. I think I’d say basically the same thing  in response to that. Then there’s the examples of  

5:02

models doing scheming-type stuff right now. Mostly  I look into the details of all these examples,  

5:08

and they don’t really seem all that similar to the  actually scary thing, which would be a competent  

5:15

AI system that is pursuing an ambitious misaligned  goal. And rather, it seems like maybe the AI is  

5:23

role-playing a sort of not actually competent evil  AI that you might find in a science fiction novel.  

5:32

Or it’s like an AI system that is pursuing  some sort of convergent instrumental subgoal,  

5:40

but in a way where it’s really quite  debatable whether it’s aligned or not.

5:46

For example, the alignment faking would  fall into this — where I would say that  

5:50

the AI system has this value of not  helping with harmful stuff and then  

5:56

it fakes alignment in order to do that.  And yeah, aligned models totally will  

6:00

pursue convergent instrumental subgoals. The  thing about convergent instrumental subgoals  

6:06

is most of them are a good idea regardless of  your goal, whether it’s misaligned or aligned.

6:12

Rob Wiblin: Are there any other  common reasons that people think  

6:15

that catastrophic misalignment is likely  that you want to quickly react to?

6:18

Rohin Shah: I guess you did mention  Eliezer as well. I actually wouldn’t  

6:22

have described it as primarily focused on —

6:25

Rob Wiblin: Yeah, it’s tough to characterise in  

6:26

seven words. I struggle to know what to  say. But yeah, there’s Eliezer’s take.

6:31

Rohin Shah: Yeah, I don’t think I’m going to  be able to engage with it in this particular  

6:40

podcast. It’s just a very deep worldview, and  I always feel like if I argue against one part,  

6:49

there’s some other part that’s going to say,  “Actually what I meant was this thing instead.”  

6:53

So I mostly am going to pass on that. I  guess what I’ll say is I’ve engaged with  

6:59

it a decent amount and I buy it as an argument  for here’s why misaligned goals are plausible,  

7:08

but still don’t really see how he gets from  they’re plausible to they’re extremely likely.

7:17

We had also started this section by asking what  makes me feel like things are likely going to  

7:24

be OK. Besides that I don’t buy the arguments  for confidence in misalignment being a problem,  

7:33

the other thing is I do think we will see many of  the problems in advance and then do something to  

7:41

deal with them. Certainly there is some amount  of generalisation required. At some point,  

7:46

the AIs go from not powerful enough to take  over to they are powerful enough to take over,  

7:52

and your techniques do have to generalise across  that. And the AI, to the extent that it knows when  

7:58

that crossover point is, you could imagine the  AI is like, “I’m not going to do anything shady  

8:03

until I have the power to succeed.” And you have  to be able to be robust to that kind of strategy.

8:11

So there is some subtlety there, but I  still think that many of the problems that  

8:17

underlie this, like the difficulty of  oversight or the need for interpretability,  

8:24

are things that we can look at  in advance, get some traction on,  

8:28

iterate on. And this, I think, is very helpful  for building mitigations that actually work.

8:36

Rob Wiblin: I think across the world as a  whole, most people who are feeling really  

8:39

optimistic about how things are going to  go, the biggest factor for them is just  

8:42

looking at the models that we have today  and saying they seem really steerable,  

8:45

they seem to do what I ask, they seem to  be really probably nicer than people and  

8:50

more helpful than people in many respects.  How much is that sort of steerability and  

8:54

seeming alignment of current-day models  a factor that is making you feel good?

8:59

Rohin Shah: I think not particularly. I would say  that why I became worried about misalignment in  

9:06

the first place would be these arguments about  how it’s going to be very difficult to oversee  

9:12

the models once they are superhuman and they’re  making arguments that we struggle to follow along  

9:18

with them. Or about the parts where the models  might become so smart that they are thinking  

9:25

in some sort of alien reasoning that it’s  hard for us to follow and monitor and so on,  

9:30

and we just kind of have to defer to other AI  systems in order to look at the stuff for us.

9:37

That’s the stuff that’s scary. It’s basically  not true of current AI systems. So I don’t think  

9:42

we’ve really engaged with the problems that  made me worried in the first place, mainly  

9:47

because the AI capabilities aren’t there yet.  So I feel like the success of alignment methods  

9:53

on current systems isn’t really that much evidence  on how we’re going to do on these future problems.

10:01

Rob Wiblin: Yeah, OK. I think we’re going to  push on from this topic of how severe a risk or  

10:05

how likely a risk is catastrophic misalignment.  I feel like with many guests we could fill the  

10:10

entire episode with just a lengthy discussion  about this, but every episode would start to  

10:14

sound the same. And in the broader world  it’s something that is debated a tonne,  

10:17

so I guess we’re going to occupy the worldview  that misalignment, catastrophic misalignment  

10:23

is possible, but prosaic alignment techniques —  the kinds of things where we cross the river by  

10:28

feeling the stones — have a good shot at working  here for the rest of the conversation, and think  

10:32

about what that implies and how that’s shaping the  choices that you’re making and that GDM is making.

10:37

Rohin Shah: Yep, sounds great.

10:38

Rob Wiblin: So you are not enthusiastic  about AI companies making firm safety  

10:43

or alignment commitments in response to  public pressure or political pressure,  

10:47

something that has been happening over  the last couple of years. Why is that?

10:50

Rohin Shah: Yeah, I think it’s worth  being a little bit clear about what  

10:54

we mean by “commitments” here. When I  think of something called a commitment,  

11:01

I imagine that you write down some sort of  action that you are planning to take now  

11:06

and into the future. Maybe it will only  start in the future. And you’re saying,  

11:11

“We are not going to change our mind on this.  We are going to do this in the future as well.”  

11:16

Sort of like tying yourself to the mast  and making sure you’re going to do that.

11:21

My objection to this is really just  that the research continues to change,  

11:26

the actions that we think are the best actions  to take change over time — and given that,  

11:33

it’s just not actually a good  idea to tie yourself to the mast.

11:36

So I’ll give you an example. Maybe two  or three years ago, people used to be  

11:42

pretty into the idea of adding more data in  pretraining that’s about alignment research.  

11:50

Think research papers on safety and  alignment, think like LessWrong blog  

11:54

posts that talk about AI alignment, stuff like  that. The idea was, the more of this data you  

12:01

put into pretraining time, the smarter the  AI will be about alignment in particular,  

12:06

which then allows you to use the AI system  to help you with your alignment research.

12:12

I would say that nowadays, the opinion is more the  exact opposite of that: instead we would rather  

12:22

filter out that sort of data from the  pretraining dataset for two reasons:

12:27

One, it makes it less likely that the AI  system learns that there is this persona of  

12:34

a malicious AI that it maybe could adopt  after some poorly done post-training or  

12:40

some poorly chosen prompt during deployment. Then the second reason is maybe we don’t want  

12:46

our AI systems to know in great detail all of the  mitigations that we’re planning to put in place,  

12:52

because that makes it easier for  it to evade it if it is misaligned. 

12:56

It would be pretty bad if we tied  ourselves to the mast of “we’re  

13:00

going to throw in lots of alignment data at  pretraining time” two or three years ago.

13:04

Rob Wiblin: So there’s this issue that  the future is uncertain, and we don’t  

13:08

know exactly what commitments we  will want to have made — you might  

13:11

end up committing to something that  is useless or even actively harmful.

13:15

But if you think about why  people make commitments at all,  

13:18

there’s a couple of different reasons. One  is that they want to tie themselves to the  

13:22

mast against future temptation to do the  wrong thing. There’s also that they want  

13:26

to communicate to other people what they’re  going to do, so it makes it easier for them  

13:29

to coordinate. Perhaps you could reduce race  dynamics by making particular commitments.

13:33

And I guess in this multi-person,  multi-player situation,  

13:39

there’s an extra reason that exists, which  is that external people want to pressure  

13:43

Google DeepMind or AI companies to act  in a particular way. It’s very difficult  

13:50

to communicate “we’re committed to doing the  right thing, whatever that turns out to be.”  

13:54

So instead, they want to pressure you to do  specific things that they suspect will be  

13:58

useful — that might not be the case, but the kind  of best guess as to what they will want you to do  

14:02

in future — and that’s maybe the most practical  thing that they can actually campaign on.

14:07

What would you make of those arguments  for actually making commitments?

14:09

Rohin Shah: I guess my biggest objection to this  is just that it won’t work. I don’t actually think  

14:20

it would make sense even on the merits, even if it  did work. But I would say that it just won’t work.

14:27

Rob Wiblin: Because the companies won’t stick to  bad goals that they’re given, or to any goals?

14:33

Rohin Shah: Well, mostly I would say that  if you think of what a commitment is…  

14:40

I’m imagining here something like the  company puts out a blog post that says,  

14:44

“We commit to doing X.” There are other  ways that you could try to make commitments,  

14:51

but that I think is the one that people are  usually imagining. I just think that if, in fact,  

14:57

you imagine that the company is trying to now get  out of this commitment in the future, it totally  

15:02

will just be able to do that. There are examples  of this in the broader world, not just in AI.

15:11

But even if you look at AI in particular,  I think Anthropic’s RSP, for example,  

15:17

the Responsible Scaling Policy, the first version  was really quite strong and said a lot of stuff  

15:24

about using the word commit: “We commit to do X.  We commit to do Y.” I don’t actually remember the  

15:29

exact details, but I think many of them in future  iterations of the responsible scaling policy,  

15:35

they removed that wording and replaced  it with something less strong. So despite  

15:43

adding those words there, I think in fact they  did not actually tie themselves to the mast.

15:49

I think this is good. I think that it was a  mistake to have set strong language to the RSP  

15:56

in the first place. I think probably much of the  stuff that they removed, it’s good. It makes them  

16:01

more effective at their goals, including at safety  and responsibility. But empirically, it’s a good  

16:07

example of how, in fact, they did not actually  tie themselves to the mast. And I think that is  

16:13

just how it’s going to be for the companies,  at least in the current political climate.

16:19

Rob Wiblin: Hey everyone, Rob here. To avoid  any confusion, I just wanted to point out that  

16:23

Rohin said the above before Anthropic released the  third version of their Responsible Scaling Policy,  

16:27

which indeed mostly dropped the use  of the term “commitment,” giving a  

16:31

justification related to what Rohin is  saying here. All right, back to the show.

16:36

I think Google has actually been doing better  on this. The first Frontier Safety Framework,  

16:44

people essentially argued that it was weak  and unambitious and that it never used the  

16:51

word “commit” in it anywhere. But I think it  was much more accurately reflective of what  

16:58

Google was actually going to do in the future.  So I think in that sense, it was better and  

17:05

gave a better sense to the public of what is  actually going to happen in practice. This is  

17:11

one place where I do actually trust Google more  than Anthropic — or OpenAI, for that matter.

17:16

Rob Wiblin: Because Google is more conservative  about the commitments that it makes,  

17:20

it’s actually more likely to follow through  on the things that it does say it will do?

17:23

Rohin Shah: That’s right. They’re very paranoid  about commitments. Not just commitments,  

17:28

just anything that they say that they’re  doing, they’re paranoid about it. They’re like,  

17:32

“Is this actually a good idea? Are we actually  ready to continue doing this into the future?”  

17:38

So I find it easier to trust the words that  Google says relative to other companies.

17:44

Rob Wiblin: I guess you trust yourself and  your colleagues to broadly act reasonably  

17:49

when the time comes, which means that it’s very  natural that you don’t want to completely tie  

17:53

your hands. You want to maintain flexibility to  do whatever seems reasonable to you at the time.

17:56

But imagine other people externally: they  either don’t trust you and your colleagues,  

18:01

or they aren’t sure whether to  trust you and your colleagues.

18:03

Rohin Shah: Things that I would recommend  are stuff like third-party audits,  

18:09

or third-party evaluators that get a  reasonable amount of access to the company,  

18:15

and can use that to audit  the practices and release  

18:18

some probably-somewhat-redacted  report of what their findings are.

18:21

Rob Wiblin: Yeah, tell us more about  that. What do you think is useful?

18:24

Rohin Shah: I think the main thing that  drives my thinking here is something that  

18:32

I would call attention to detail. Generally, I  tend to think that AI is a space that requires  

18:42

quite a lot of nuance, and you actually  need to know a lot of facts on the ground  

18:47

in order to choose the right actions or the  right things to be evaluating and checking.

18:56

And as a result, I care most about having  a few people who are spending a lot of time  

19:03

looking in great detail and then writing up  their results, or somehow communicating their  

19:10

results or using that to make some sort of  action. Which is why I would say third-party  

19:18

evaluations seem like one of the best things to  me — because you can build up these organisations  

19:25

that build a lot of context, spend a lot of  time defining their evaluations, gain a bunch  

19:31

of information about how everything is working,  and then can make fairly nuanced decisions about  

19:37

it while not being subject to the same biases that  people in companies are going to be subject to.

19:44

So that’s the avenue that I’m most excited about.  

19:49

Whether it’s doable in today’s political  climate, less obvious. So maybe in lieu of that,  

19:59

what you could do that might someday get to move  in that direction is more like safety scorecards.

20:09

AI Lab Watch is my favourite scorecard  in this area. I wish we were doing more  

20:16

things like that. If I had to make a career  change right now and do something else,  

20:20

that would be one of my top  two choices about what to do.

20:24

Rob Wiblin: Tell us about AI Lab Watch. What  useful function do you think it’s serving now?

20:28

Rohin Shah: It’s not totally clear to me  that it’s serving a useful function yet,  

20:33

but I think it could be. Maybe I should  say a little bit about what it is. It’s a  

20:38

scorecard that evaluates companies based on  essentially how good they are at safety for  

20:48

existential risks, or at least  severe catastrophic risks.

20:52

It’s run by one guy, Zach Stein-Perlman, and I  think he’s not even putting all of his time into  

20:59

it. Maybe it’s like half of his time. I think  Zach Stein-Perlman has incredible attention to  

21:04

detail. He’s diving into these extremely detailed  governance docs, reading through all of them,  

21:09

pulling out individual sentences to allow him to  come to conclusions; reading through all of the  

21:15

model cards and frontier safety reports and seeing  exactly what the companies did and didn’t do.

21:21

So I think that’s part of it. I think  there’s a good amount of nuance,  

21:25

and I actually believe the conclusions. Well,  I believe at least some of the conclusions  

21:31

that he comes to, which is usually  not true of other scorecards.

21:37

I think if the scorecard got to  the point where it was more robust,  

21:42

more accepted by the broader community —  and especially accepted by the companies  

21:49

as a fairly legitimate scorecard —  then you could imagine a race to the  

21:54

top on safety where you’re like, “Let us  climb the AI Lab Watch scorecard and be  

21:59

able to advertise ourselves as the safest  company” or something along those lines.

22:06

I think that’s one way in which you  could have external actors trying  

22:12

to get the companies to be more  safe in today’s political climate.

22:16

Rob Wiblin: And the way that that’s  different than the broad commitments  

22:19

that companies have tended to be making  the last couple of years is that you can  

22:23

have technical experts working at this AI Lab  Watch organisation, constantly updating it,  

22:28

I guess paying a lot of attention to detail  about exactly what are the practices that  

22:32

companies engage in or don’t engage in that  make a big difference, and constantly updating  

22:36

it based on newest opinions or newest  research about what actually matters.

22:41

So you gave us an example of a reversal of opinion  about what AI companies ought to be doing before:  

22:46

before, people thought you should be training on  this data, and now they think you should be taking  

22:50

a lot of care to cut it out instead. But surely  there are some commitments that are broad enough  

22:57

or non-specific enough or just so obviously  good that it is reasonable to commit to them.

23:02

For example, you could have a commitment to  provide the kinds of information that an AI  

23:06

Lab Watch would require to rate whether Google  DeepMind or any other company is doing a good  

23:11

job. Or you were saying you think it’s useful  to have expert auditors, people running evals,  

23:17

having enough access to run sophisticated evals  on the models to understand if they’re dangerous  

23:22

in this or that way. You could commit to provide  access to any external auditor or evaluator that  

23:29

meets a particular set of reasonable requirements.  What about those kinds of commitments?

23:33

Rohin Shah: I think those are better. I would  still say that they don’t always make sense.  

23:45

To take an example you just brought  up, providing access to information,  

23:49

I think it’s just very easy for me to  imagine this written in a way that backfires.

23:55

For example, we do evaluations in the CBRN domain  — that’s chemical, biological, radiological, and  

24:01

nuclear — basically about whether AI systems can  help with developing weapons of mass destruction.  

24:11

A lot of the information here is quite  infohazardous, and I think it is probably the case  

24:18

that many external evaluators will not have the  same level of information security that at least  

24:25

Google does, so I could imagine thinking that  that was actually a bad commitment to have made.

24:33

I think another version is like, we talk  about race dynamics quite a lot. One thing,  

24:39

that actually I’m fairly uncertain about,  but you could imagine that it’s actually a  

24:45

pretty high priority for companies to keep their  algorithmic progress and similar things locked up,  

24:54

and not allow that to diffuse too far. There are  various arguments for this, and we don’t need to  

25:00

go into them, but that’s a common position.  I think if you actually take that seriously,  

25:05

it does mean that you probably do have this  tradeoff about what information you share  

25:12

externally that will increase the chance that  it leaks versus what information you really  

25:18

try to lock down. And I think it would  be hard to make a commitment about that.

25:24

I do still feel though, for some of them, more  sympathetic to like, this seems like obviously  

25:31

a good commitment. I would still say though that  the tying to the mast just doesn’t actually work,  

25:39

so I would rather do it by checking what the  companies are actually doing in practice,  

25:47

and then judge them based on whether they  are doing the things that we think are good,  

25:56

rather than whether they have made a  commitment to continue doing it in the future.

26:00

Rob Wiblin: There’s this saying “personnel  is policy,” and it sounds to me like your  

26:04

attitude is that there’s no set of things you  can write down — commitments you can make,  

26:09

or good intentions you can write down on a  piece of paper — that can at all substitute  

26:14

for having wise, well-motivated people in the  positions of decision making over these things,  

26:21

and people who understand the things  well enough that they can actually  

26:24

make the right decision if they’re so  motivated. Is that basically right?

26:28

Rohin Shah: Mostly right. Maybe I would  say yes to wise and well-motivated,  

26:37

but they don’t have to be inside companies.  They can be external third-party auditors.  

26:42

I think that that would work. That just  means you have to have humans who know  

26:49

a lot of stuff who are looking into  the details a bunch and doing that.

26:53

Rules that you write down in advance are one of  the stupidest… Sorry, I mean “stupid” in the sense  

27:00

of the rule itself clearly can’t have very much  intelligence in it, otherwise it would not be a  

27:06

rule. It’s like you write down in advance based  on what you think, before seeing the evidence, a  

27:13

policy that can be written down in English, rather  than allowing for flexible adjustment over time.  

27:19

It’s just so weak. Intelligence and optimisation  pressure applied against a rule will always get  

27:25

around the rule. Or the rule will be so stringent  as to apply really huge costs to the companies,  

27:31

and that just won’t fly in today’s political  climate. Also just seems like a bad idea to me.

27:37

Rob Wiblin: The most common question from the  audience was: “Does the AGI safety and alignment  

27:41

team have a hard veto on any aspect of training  or deploying a potential future AGI?” You know,  

27:47

if [Google CEO] Sundar Pichai wants  to deploy a model or to train a model  

27:52

that you don’t think is safe, can he  just overrule everyone on your team?

27:57

Rohin Shah: I kind of disagree with the  frame of the question. So the literal  

28:05

answer is our role is advisory. If we make a  recommendation, and other decision makers such  

28:14

as Sundar disagree with it, Sundar’s  decision is the one that will matter.

28:22

But this is a question that bakes in  the frame that we are adversaries of  

28:28

the company, and we need to have hard  power, some sort of veto that enables  

28:34

us to make the right decision regardless of  what the rest of the company thinks. I think  

28:39

this is just not a good or healthy model for  how things should work inside of a company.

28:48

I see my job as making sure that I am producing  and providing the right information such that  

29:01

decision makers can make the right decisions.  So yes, the role is essentially advisory, but  

29:11

the way that it works is: if I think that there’s  something wrong, I will escalate it to my manager,  

29:18

Anca [Dragan]. Anca started out as the head of  safety and now is co-lead of Gemini post-training,  

29:25

so has a significant amount of influence  and power. And if she agrees with me,  

29:33

then she will escalate it one step further and  make a recommendation not to launch the model.

29:39

Rob Wiblin: People who are broadly worried  about the direction that everything is going  

29:43

I think more often than not feel themselves to  be in primarily an adversarial relationship with  

29:48

AI companies. You think that that’s not  the case, and that in fact the companies  

29:53

are in an apathetic relationship with  that group of people. Explain that.

29:57

Rohin Shah: I guess maybe I would say  that it’s better for them to model the  

30:05

company as apathetic. I think you  can have a more detailed model,  

30:08

which isn’t actually apathetic, but  it’s maybe a bit more complicated.

30:12

To go into the more detailed model, I would say  that building an artifact like Gemini is very,  

30:22

very difficult. The main reason being  you have to produce this one thing,  

30:29

this single set of model weights, deployed using  a single serving stack. And it has to satisfy so  

30:37

many constraints, and there are interaction  effects between all of these constraints.

30:42

So there’s stuff like, does it do the instruction  following right? Is it doing safety right?  

30:49

Has the architecture been chosen in a way that  enables fast inference? Does it speak multiple  

30:56

languages? There’s probably 100 such things.  And it is the case that if you make one change  

31:05

to the process with the intent of making one  of these things better — say, safety — it will  

31:12

have random downstream knock-on effects on other  constraints that you totally did not anticipate.

31:18

Rob Wiblin: This fragility of the process,  doesn’t that mean that it’s actually going  

31:21

to be quite hard to respond quickly in  real time to any new safety concerns? You  

31:27

or anyone else might be saying, “We should  change this part,” and it’d be like, “No,  

31:29

you’re going to break this entire Rube Goldberg  machine that we’ve built to make this product.”

31:33

Rohin Shah: I mean, to some extent, yes. You know,  DeepMind was founded with this mission. That was  

31:39

one of the reasons that Demis [Hassabis] founded  it. And DeepMind has had an AGI safety team since  

31:46

well before I joined the company. They didn’t  need to have it. It’s a bunch of money that  

31:53

they’re spending that they didn’t really need to  spend. So they definitely do care. But in fact,  

31:59

there are many, many, many things  that we could do to improve safety,  

32:05

but only a few things that we can do at any given  time, because it takes quite a long time to do it.

32:11

Now, do we have tools for reacting to safety  problems that we see in deployment? Yes. Mostly,  

32:18

these do not involve changing the model weights,  because that’s the thing that is most constrained.  

32:23

That’s the artifact that you have to produce  one of it, and it’s got a gazillion constraints  

32:28

imposing it. But we have these out-of-the-model  filters that we can change a bit more. We  

32:35

can target them to specific prompts or specific  problems, so those are a lot easier to update over  

32:44

time. But not everything that you want to do is  going to be solvable with an out-of-model filter.

32:51

I guess going back to the original question you  mentioned, about are companies adversaries versus  

32:56

apathetic, basically my take is that because of  this huge interaction of various constraints,  

33:05

really the easiest way to model at least  GDM, but probably most AI companies,  

33:10

is they can only do a few things at a  time for safety. And they will do them;  

33:17

they do care — but maybe you should just think of  them as apathetic. You should really try to really  

33:24

lay out, “Here’s exactly what you need  to do. Here’s why it’s not going to hurt  

33:29

any of these other constraints that you care  about.” Also just because everyone is busy.

33:34

Rob Wiblin: So you said that it’s a lot more  important to have access and transparency for  

33:40

expert auditors, monitors, regulators. But don’t  we at least need enough public understanding or  

33:46

public transparency such that both voters  in general and politicians specifically  

33:51

are providing the resourcing, the funding,  the backing, the will for those auditors,  

33:55

for those regulators, so that they can  insist on getting the access that they need,  

34:00

even if perhaps a company… Or for whatever reason,  they need resourcing in order to get the job done?

34:06

Rohin Shah: Yeah, I definitely think  that is important to do. I don’t think  

34:12

it’s very much in conflict  with anything that I’ve said.

34:14

I think the way that this happens currently  is we release models, and then everybody  

34:21

sees how capable they are, which is by far  the most important thing. That’s a kind of  

34:27

transparency that’s needed. In fact, we used to  run this survey of researchers at GDM on what  

34:40

their views on x-risk and other kinds of safety  were. And one of the things we used to ask is,  

34:48

“What has changed your mind on this safety stuff  over time?” And it was just so uniform. I think  

34:54

with literally just one exception, the answer was  uniformly some sort of capabilities improvement.  

35:01

They were like, “GPT-4: that’s the thing that  changed my mind about how important safety was.”

35:07

And I think that is just basically available  to the public right now. Everyone needs to  

35:14

release their models quickly. You can  benchmark the capabilities after the  

35:19

fact. You can play around with the  models yourself in order to see how  

35:23

good they are. So that particular piece  of information I think is already there.

35:29

There’s maybe some more information about how  important is safety, or some more detailed  

35:38

information that maybe you need a bit more nuance,  a bit more access to get. Mostly I feel like that  

35:47

infrastructure does exist already, at least  for politicians, which I think is the more  

35:55

important part right now. Like the US AISI,  the UK AISI do get more visibility into the  

36:03

companies. They have partnerships with almost all  of the frontier companies, possibly all of them,  

36:09

I don’t know. I think they do have a pretty  good understanding of what is happening in  

36:16

AI companies, and they then use that to inform  politicians and their respective governments.

36:23

Rob Wiblin: Yeah, I think one of the most  analogous areas of regulation and monitoring  

36:27

currently to all this is, in my mind, the  Federal Reserve, the central bank oversight  

36:31

of the financial system, of banks and financial  stability. It’s an incredibly technical area,  

36:37

very difficult to track. The public has a broad  desire not to have a financial crisis and not  

36:43

to have banks go bankrupt, but very limited  understanding of the specifics. And I suppose  

36:49

the analogy there would be that they understand  on some level that this is a serious issue.

36:54

That flows through to politicians who  are also really scared about things  

36:57

going wrong. They provide really quite  significant resourcing and they hire very  

37:02

expensive experts to basically constantly be  talking with and monitoring all of the banks.  

37:09

I think both the Bank of England and the Federal  Reserve in the US are really heavily involved  

37:14

with all of the banks: monitoring  their books, basically in a constant  

37:18

conversation with them about whether things  are going wrong and what risks are emerging.

37:22

And that would feel like a very  natural way for things to go,  

37:25

I guess, especially if you’re right that  it’s really not obvious what needs to be  

37:29

done. You have to kind of be in the room,  understanding a lot of contextual specifics  

37:34

in order to say whether something is  a good thing or a bad thing to change.

37:37

Rohin Shah: Yeah, I think that’s almost  exactly the kind of model that I would want.

37:41

Rob Wiblin: Maybe the most dominant approach that  many people in AI safety and alignment are taking,  

37:49

actually maybe in both technical and policy areas,  is trying to create and enforce pre-deployment  

37:55

evaluations: testing what models are capable of  doing and what they’re inclined to do before they  

38:01

are deployed as products to the public, trying to  test for ways that things could go really wrong.

38:06

You think that this is probably a misguided, not  very effective high-level strategy. Why is that?

38:11

Rohin Shah: Yeah, I think the main cost  of this is that launch schedules are just  

38:18

really quite important at a company, and  you try to keep them as short as possible.  

38:26

Once you have a model, you would really  like to get it out to the public as soon  

38:30

as you can. So if you tie evals to,  and require them to be pre-deployment,  

38:40

then that is providing a pretty strong incentive  to make those evals as fast as possible to run  

38:47

and get them done as quickly as possible, which is  maybe not really the incentive you want to give.

38:52

Obviously we’re going to try and make them as  good as possible, but it’s still a constraint;  

38:58

we probably could do better if we had more  time. And there’s some amount where we can  

39:05

push back and say that actually we need the  time to do the evals, but it’s not an infinite  

39:09

amount. So that I think is just really quite a  large cost. It might be worth it if there are  

39:16

strong benefits, but I just don’t think  there are particularly strong benefits.

39:20

The one that people would naturally say is  like, you need to know if you’re releasing a  

39:26

dangerous model. Ideally, you don’t release  the dangerous model, and the way you have  

39:31

to do that is via pre-deployment evals. To  this, I would mostly say that AI progress is  

39:39

reasonably continuous. You can get a decent  sense of how the next AI system is going to  

39:46

behave based on the previous one. You can  have some reasonable, OK bounds on this.

39:53

So if you design your evals and your thresholds  such that there is a reasonable safety buffer  

39:59

between when your evals trigger versus when  you actually think the model is dangerous,  

40:04

then it seems just basically totally fine  to say that we evaluated the previous model,  

40:11

or we ran an evaluation a month ago. It’s not  going to have had a huge giant leap in that time,  

40:18

it was under our threshold at the time,  there’s the safety buffer: therefore,  

40:22

we’re not worried about this model. This is  the approach we’ve been taking in our Frontier  

40:30

Safety Framework since the very first time  it was published. It’s not particularly new.

40:35

So I think the benefits are just not really there,  

40:38

so it doesn’t really make  sense to impose this cost.

40:41

Then a couple of other minor points. One is that  especially for misalignment or loss of control,  

40:48

the threat model is tied more around internal  deployment rather than external deployment,  

40:54

because I think it’s just  easier for a misaligned model  

40:58

to cause problems inside the company  where it’s getting a bunch of permissions  

41:04

rather than outside the company where it has no  access to its own weights, for example. And the  

41:14

pre-external deployment evals don’t really make  that much of a difference to internal deployments.

41:19

Rob Wiblin: Is there any good reason to think  that we might, in the next couple of years,  

41:22

see a huge jump from one cycle to the  next, such that the model could become  

41:28

way more unexpectedly capable or way more  unexpectedly evil than the previous one?

41:33

Rohin Shah: Unexpectedly capable seems pretty  unlikely. I think we’ve just seen enough examples  

41:41

of AI development now to say that, no, in fact,  AI development progresses fairly smoothly and  

41:49

continuously. I do think that in the future, you  could definitely see an intelligence explosion,  

41:55

in which case the progress will go much faster  with respect to calendar time. I think it will  

42:01

still actually be pretty smooth and gradual  with respect to inputs like compute and labour;  

42:07

it’s just that in an intelligence explosion, you  get a much larger increase in especially labour,  

42:13

but probably also compute, and that ends up making  things go very fast with respect to calendar time.

42:21

But there’s still this general property  that you can, given some amount of  

42:26

compute and labour that you expect to  be spending over the next however long,  

42:31

have some decent sense of how much progress  is going to be made on the capability side.

42:37

You also asked about whether the AI system might  become much more evil. I think that’s one that  

42:45

could, in fact, change pretty significantly  between models, just because it’s a somewhat  

42:50

more contingent property of exactly how you do  post-training, and small changes to it could  

42:55

have big effects on that. So there I think it  is more important that, to the extent that your  

43:04

safety case depends on specifically the model  not being evil in some way, you actually do  

43:10

in fact need to do the pre-deployment  evals to check whether that’s the case.

43:15

And this is in fact what we do. Not  exactly this, but we do in fact do a  

43:20

lot of pre-deployment evals for safety right now  in terms of whether the model has a propensity  

43:27

to do bad things. This tends to be more in  present-day safety type stuff, things like:  

43:35

will the model help you write suicide notes?  Will the model incite violence? And those we’d  

43:48

run before any launch, and if the numbers are  sufficiently bad, we won’t launch that model.

43:55

Rob Wiblin: You don’t think that we need  to do very much preparation ahead of time  

43:59

in order to be able to get future AGI, or  a future recursively self-improving AI,  

44:05

to do a lot of safety and alignment  research when the time comes, which  

44:08

I think might explain why I don’t really hear  very much at all about that broad approach from  

44:14

GDM. But I guess by contrast, at least a  couple of years ago this was the dominant  

44:17

approach that OpenAI would talk about all  the time. And you definitely hear things  

44:20

about it from Anthropic. Why don’t you  think we need to be doing much prep now?

44:24

Rohin Shah: It’s worth laying out the scenario  where this becomes important. Effectively,  

44:35

the worry about this is an intelligence  explosion scenario where you build your  

44:43

AI system and that AI system is now capable  enough that it can actually help accelerate  

44:50

your AI R&D research. It can just make  your capabilities research go faster.  

44:57

And under certain assumptions that we can get  into, this then seems likely to drastically  

45:06

increase the rate at which capabilities progress  happens, and you get an intelligence explosion.

45:12

A natural worry you might  have in this situation is:  

45:16

if everything is speeding up on the  capabilities side, will the safety  

45:19

and alignment side be able to keep up? It’s  worth noting that the way that the capabilities  

45:26

speeds up is via the application of AI  labour to do capabilities research. So  

45:34

the natural approach is then to apply the same  AI labour to do safety and alignment research.

45:42

Now, if you believe, as I do, that the sort of  prosaic alignment research — where you look at the  

45:50

stuff that’s going wrong now, do a little bit of  forecasting what stuff is going to go wrong in the  

45:55

future with the next few models over the next some  amount of time period, a year perhaps, and then  

46:03

you can do fairly normal ML research in order to  address it — if that’s your view of how alignment  

46:10

research can progress, this research looks very,  very, very similar to capabilities research.

46:18

So if the AI is accelerating capabilities  research a tonne, you should be able to,  

46:23

as long as you’re willing to spend compute on it,  

46:28

take that same AI system and accelerate  safety and alignment work in the same way.

46:34

There are some disanalogies between  capabilities research and alignment research.  

46:40

I think the disanalogies are particularly  large today and will become smaller in the  

46:45

future as you get closer to these sorts of  AIs that are capable of doing this automated  

46:49

research. And by the time you get to that  point, there are still disanalogies, but I  

46:53

think they’re relatively small. And by default,  you shouldn’t really expect a big difference in  

46:59

the ability to do automated alignment research  versus automated capabilities research.

47:04

So there’s some stuff you could do to prepare,  but mostly I feel like we just don’t know what  

47:10

it’s going to look like. It will be so much more  efficient to focus on other things that we can  

47:16

do today and then just adapt once we get to the  point where the AIs are capable of doing this.

47:21

Now, I guess one thing I do want to flag is  that there’s another worry you could have,  

47:27

which is like, sure, all the technical safety  and alignment research gets accelerated,  

47:30

but that’s not the only thing that you  need in order to get good outcomes. You  

47:33

also need governance to go better, so  you also need to accelerate governance.

47:40

Now, governance has a totally different set of  skills than capabilities or safety or alignment  

47:46

research. So it’s really much less clear that AI  systems that drastically accelerate capabilities  

47:53

research will also drastically accelerate  governance. In addition to the AIs just  

47:57

not having the capabilities to do that, which  is one possibility, there’s also just like,  

48:01

will we as a society be willing to use AIs  to accelerate governance? I think possibly  

48:08

not — because capabilities researchers  want to actually accelerate themselves  

48:15

with AIs, and I don’t get the sense that  governance people will want to do this.

48:20

So this accelerating governance work is  one of my top two things that I would do  

48:26

if I had to make a career change right now  to something else: figure out what we need  

48:32

to do in order to accelerate governance and  start doing the work that we need to do it.

48:37

Rob Wiblin: I’m surprised you say that governance  people aren’t interested in using AI to accelerate  

48:42

their work. At least among the more AGI-pilled  groups, I think Forethought Research is very  

48:46

much on board with this. I think Coefficient  Giving is developing a plan. I spoke about  

48:51

that with Ajeya Cotra not too long ago, and  they’re developing a plan for how would you  

48:54

deploy a whole lot of money and compute in order  to solve those kinds of issues using AI labour.

48:59

I guess there could be this whole concern that  no matter how much you were willing to spare,  

49:03

no matter how much people were trying,  the models just wouldn’t actually be up  

49:05

to it at that point — because they would be very  specialised on computer science and AI research,  

49:10

and not really up to thinking about broader  society. But if that weren’t too bad,  

49:17

then it does seem that there are at least  some actors who are interested in doing this.

49:20

Rohin Shah: Yeah, I definitely agree that  the AGI-pilled governance people will do it.  

49:25

The nonprofits and think tanks  associated with the AI safety  

49:29

community will absolutely do it. But they’re  a small fraction of overall governance.

49:35

Rob Wiblin: What are some of the  governance issues that you think  

49:38

only actual national governments can  address, that are maybe going to be  

49:42

neglected and not really handled because  they’re not able to use AI at the time?

49:46

Rohin Shah: I don’t really have concrete  examples in mind. Mostly I would say that,  

49:53

going back to the points we made earlier in  the podcast, it’s important that somebody  

50:00

is watching the AI companies and holding  us accountable. Governments are a natural  

50:07

place to do that. And if the world is  in fact radically changing — you know,  

50:12

people talk about a century’s worth of progress  in a decade — then you should expect that a bunch  

50:22

of problems are going to come up that we aren’t  going to anticipate today. I think we need to be  

50:28

able to flexibly react to that. I don’t have  particular problems in mind that I think are  

50:33

going to require government intervention, but  it would be so shocking if there weren’t any.

50:39

Rob Wiblin: Yeah, I guess it might be very  difficult to reform governments as a whole.  

50:43

As you say, they tend to be very rule-bound,  and there’s going to be a whole lot of rules  

50:47

that make it difficult to use AI in the ways  that you and I would think are sensible.

50:51

It’s possible you could get a carve-out for  AI-specific agencies. I guess you could imagine  

50:56

the UK AI Security Institute, for example,  basically being given carte blanche to use  

51:01

AI in its governance work in a way that other  agencies potentially couldn’t, because it’s  

51:05

the only way that they would be able to keep up  with at least their kind of work, and they’re the  

51:08

kind of group that might really push for it.  That’s a slightly hopeful possible outcome.

51:15

Rohin Shah: Yeah, I hope we can do better  than that, but I agree that would be a nice  

51:19

baseline. Mostly I feel like we haven’t  really thought about this problem very  

51:25

much. I’m hoping that if we do think  about this problem a decent amount,  

51:29

we will have something better to do than just  that. But I have not been the one to do that,  

51:33

and I don’t know if anyone else has either. So  really, I’m more saying that here’s a problem.  

51:39

I don’t really know what to do about it, but it  sure would be good if we did something about it.

51:44

Rob Wiblin: An audience member wrote  in asking, “Why not instead prioritise  

51:48

slowing down AI advances or opposing  development of superintelligence?”

51:51

Rohin Shah: I guess I would say: what’s the  bottleneck to prevent actually pausing AI  

51:58

progress globally? I think by far the biggest  bottleneck to this is people don’t agree that  

52:06

AI is going to take over the world. And the  thing that most alleviates that bottleneck is  

52:16

good scientific evidence that suggests  it would happen, or that it wouldn’t.

52:22

To be clear, my belief is that  probably this will not happen,  

52:26

but I think it’s plausible enough that  we should care about it. And I think  

52:31

the structure of that looks like: figure out  whether each next step of scaling is safe. I  

52:38

think right now the answer is yes. At some point  the answer might be no. And then at that point,  

52:43

I think having generated that evidence will be  immensely helpful for this goal. Again, I don’t  

52:50

really expect that evidence to come, because I  tend to think it probably won’t be a problem.

52:55

There’s this nice post from Scott Alexander,  “Guided by the beauty of our weapons.” It’s  

53:01

probably my favourite post from him of all time.  He talks about how there are symmetric weapons  

53:09

which allow you to argue for some sort  of conclusion or get people on your side,  

53:15

irrespective of whether your claim is true  or not. And then there are asymmetric weapons  

53:22

which only work to the extent that the  thing that you’re arguing for is true,  

53:27

or at least are more likely  to work in that setting.

53:30

And he has this really quite moving description  at some point of how beautiful and elegant this  

53:39

all is, where for the asymmetric weapons  you and your “enemies” will join forces,  

53:46

hold hands, and work together to do  it — because both of you are thinking  

53:50

that it’s going to prove you right, until  the very end when the evidence comes in,  

53:55

and then you just agree because  the evidence showed you the answer.

53:58

Rob Wiblin: Well, then you moved the goalposts,  

53:59

I would say. But I suppose a reasonable  onlooker can tell who was right.

54:03

Rohin Shah: Sure, fair enough. But that’s  the sort of strategy that I would much  

54:08

rather do at the moment, given that I think  the bottleneck is by far the fact that people  

54:14

don’t agree on whether or not this is necessary.

54:17

Rob Wiblin: One place you’re very much with  the broader consensus is that chain-of-thought  

54:21

monitoring is extremely useful and something  that we want to be able to preserve for as long  

54:26

as possible. This is watching the thoughts  that the AI is having, that it’s outputting  

54:30

onto its scratchpad in order to understand  what it’s trying to accomplish and why.

54:34

I guess a difference that you have with many  other people on that topic is that you think  

54:38

that it is fairly likely that chain of  thought monitoring will remain useful,  

54:41

that we will be able to understand what the AIs  are thinking and that what they’re writing down  

54:46

there is actually related to what they’re doing  for longer than other people do. Many people  

54:50

worry that within a year or two, perhaps  they could be speaking in some crazy code,  

54:54

or they could figure out ways of putting  information in there that we can’t fully track,  

54:59

or there would just be too much of it  for us to even be able to monitor it.

55:02

Why do you think that chain of thought monitoring  potentially is going to have quite a long run?

55:05

Rohin Shah: So this is a good example of where  attention to detail really matters a lot,  

55:11

so you’re going to get a pretty long answer  from me because it’s just fairly disjunctive.

55:19

I think maybe to start with, let me recap the  basic story for why chain of thought monitoring  

55:26

should be expected to be particularly good. I  call this the externalised reasoning property.  

55:34

And the way I usually phrase it would be that for  sufficiently difficult tasks — by which I mean  

55:41

tasks that require a lot of reasoning to do, some  sort of serial reasoning over time — transformers,  

55:51

not necessarily other architectures but at  least transformers, must use the chain of  

55:56

thought as a form of sort of working memory.  There’s really no other alternative, so some  

56:04

information about the reasoning that they’re  doing has to be present in that chain of thought.

56:10

That’s part number one. And then part two is  that, given how we train language models today,  

56:17

the chain of thought is actually  legible and understandable to humans.

56:21

So I’ll maybe address these two parts  separately. The first part is actually  

56:26

the AI really does have to put some  information into the chain of thought,  

56:30

because there’s no other way for it to do the  hard reasoning needed to solve difficult tasks.

56:37

The basic argument here is, I’ll call it “opaque  serial depth.” The idea is that you can just look  

56:48

at the architecture of an AI system and then say,  suppose this AI system has to do its reasoning  

56:55

only via the billions of floating point numbers  that exist inside of it. It can’t use the tokens  

57:01

that it’s outputting. How many steps of cognition  can it do using only the floating point numbers?

57:14

The answer for transformers is actually  not very much. The idea is basically  

57:25

that sequential steps of computation have to  go deeper and deeper in the model. The model is  

57:33

made up of a number of layers, and there’s  just no way for information from a later  

57:39

layer to flow back to an earlier layer —  except via going through the tokens in the  

57:43

chain of thought. So overall, the opaque  serial depth of a transformer is very low.

57:51

Why do I expect this to continue? This is actually  

57:58

a crucial aspect of why pretraining  can be so efficient. GPUs and TPUs  

58:06

are very powerful computationally, but the way  that they get this power is by doing a tonne of  

58:12

computations in parallel. So if you need to  compute A + B to get C, and then you need C,  

58:21

that intermediate result, to feed into a further  computation, that would be serial computation.

58:26

GPUs and TPUs are not very good at that;  what they do is a lot of stuff in parallel.  

58:33

So pretraining is really heavily optimised  to be able to do everything in parallel,  

58:38

which means that the opaque serial depth does  actually have to be pretty small in order to  

58:42

get those efficiency gains. This is a pretty  strong structural economic reason that, at least  

58:50

for pretraining, you’re going to get AI systems  that have relatively small opaque serial depth.

58:57

So you might think of them as like,  there’s a good reason for models to  

59:01

be born speaking English or  some other natural language.

59:04

Rob Wiblin: Just to check that I’ve understood,  you’re saying because we are using GPUs and TPUs,  

59:09

structurally that forces the thought of the  model to be very deep: it can have many things  

59:14

in its mind at one time, but there can’t be  very many steps, because you have to be going  

59:19

through all of these things in parallel. But if  it was very wide, then you wouldn’t be able to  

59:24

do them all simultaneously; you would have to  wait until the earlier steps were done to do  

59:26

the later steps. And this is just something that  is going to remain the case for years to come.

59:31

Rohin Shah: Yes, I think that’s  right. I think for technical folks,  

59:35

they might interchange the words “wide”  and “deep” in your sentence, but yes.

59:40

Rob Wiblin: [laughs] Cool.

59:41

Rohin Shah: Yeah, so that was  stage number one, which is like,  

59:46

why should we expect the opaque serial depth to  stay low, at least during the pretraining phase?

59:52

Then there’s step two, which is like, we’re saying  that the pre-trained model has to essentially  

1:00:00

speak in English in order to do reasoning. But  then there’s a bunch of post-training and RL and  

1:00:05

all of this sort of stuff. Maybe that’s going to  make it so that even if the model is speaking in  

1:00:11

tokens, maybe it will start speaking some sort  of alien language that we don’t understand.

1:00:18

And here I would say there’s no theoretical  argument, or there’s no theorem you can prove that  

1:00:24

will say, no, they’re not going to do this. But  I will say they are born speaking English in some  

1:00:32

sense, and pretraining is by far the most powerful  form of getting stuff into an AI system that we  

1:00:42

have ever built. So the model is incredibly  good at speaking in natural language. It’s  

1:00:51

great at doing reasoning, but only the kind of  reasoning that humans do when writing stuff down.

1:00:58

And then when you look at the reasoning  training that we’re doing today,  

1:01:03

there’s a paper from Neel Nanda and others  that shows that actually a substantial part  

1:01:11

of what the reasoning training is doing  is just teaching the model when to do a  

1:01:16

specific kind of reasoning step that it  had already learned during pretraining.

1:01:21

So basically a lot of the capability is  just coming in pretraining. We know that  

1:01:25

the pretraining, for good reason, is going  to stay the way that it is and be done using  

1:01:31

human-like reasoning and speaking in English. and  then the RL is just really inefficient relative  

1:01:38

to pretraining, and for it to build an entirely  new epistemic language that is something that  

1:01:45

we wouldn’t be able to understand with even  with a decent chunk of effort is just so far  

1:01:51

beyond what RL is doing currently that it would be  pretty surprising to see that in the near future.

1:01:58

Now, do I expect to see it ever? Yes. But  I think six months ago, I split a room in  

1:02:08

a conference by saying, “Agree or disagree?  Chain of thought monitoring will continue for  

1:02:15

two years.” I think I said two years. It  might have even been one year. I forget.  

1:02:20

Whereas my median is probably like four years,  five years, I don’t know. It’s not totally clear,  

1:02:27

mostly because of this argument about how hard  it would be to go beyond the current situation.

1:02:33

Rob Wiblin: So I don’t fully understand this  idea of continuous chain of thought, but isn’t  

1:02:38

there this notion that basically at the end of  thinking for a little bit, currently we force  

1:02:42

the models to output a word, a token, and then we  feed that back into the start of the model again?

1:02:47

But rather than compress all of its thoughts  down into a single token or word or whatever,  

1:02:51

why don’t we just keep the full distribution of  all of the thoughts and then feed them back into  

1:02:56

the beginning again? Wouldn’t that allow it to  preserve more information, rather than basically  

1:03:00

throwing a bunch of it out? And if that was a  much more effective way of thinking and reasoning,  

1:03:06

and then you have no stage where you’re actually  outputting a token that a human being can read,  

1:03:10

then wouldn’t that be a force that would  potentially make them much more opaque to us?

1:03:15

Rohin Shah: I agree that sort of thing might  work in the relatively near future. I’m not  

1:03:20

sure it diminishes the value of chain of  thought monitoring all that much. If you  

1:03:25

were in fact just saying that we’re going to  keep a probability distribution of tokens and  

1:03:30

feed that back in, you can still inspect  that probability distribution of tokens  

1:03:36

and interpret them again as normal  English reasoning. It’s harder,  

1:03:41

because now you have this larger set of things  that the model could be thinking about at any  

1:03:46

given time. But I would still expect that it’s  not too bad, and you would be able to do some  

1:03:53

monitoring of it. I’d rather we didn’t do  it, but I don’t think it is a fatal flaw.

1:04:02

I haven’t been following the  literature on this too closely,  

1:04:06

but my sense from the papers that I have looked  at is that even when you do this sort of thing,  

1:04:12

sometimes it just makes performance worse.  And even when it does make performance better,  

1:04:21

if you actually try to look at what these  tokens are that it’s doing by just looking  

1:04:26

at the full probability distribution, it does  just look like it basically has like maybe one  

1:04:33

main reasoning track that’s going on, and  you can still follow along pretty easily.

1:04:38

Rob Wiblin: So you’re saying even if you  are using continuous chain of thought,  

1:04:42

you can still go back and say, what  if we had forced it to output tokens  

1:04:46

at all of these intermediate stages? What would  that probably have looked like? Let’s just take  

1:04:49

the most likely word or the most likely  token at each step and then read that.

1:04:54

I guess there might be a concern that it  could hide a second track of thought in  

1:04:59

the tail of the probability distribution,  things that you wouldn’t be likely to read.  

1:05:03

But you’re saying it seems like it only has  actually one train of thought here. There’s  

1:05:07

not a second hidden chain of thought  that you wouldn’t be able to inspect.

1:05:11

Rohin Shah: Yeah, I would say that you should  maybe look at not just the top one, but maybe  

1:05:15

the top five. Maybe even just the top two would  be enough. You can also usually tell how much you  

1:05:23

are missing, because for most of the architectures  I’ve seen, there’s a way where you can modify it  

1:05:30

at test time so that it only uses the information  of the top two tokens, and everything else you  

1:05:36

just get rid of it, you ablate it away. And then  you can do that intervention and you can see, does  

1:05:41

the model still perform as well as it did before?  My prediction is it will probably perform about  

1:05:50

as well as it did before. And if that happens,  then I think you can be reasonably confident that  

1:05:56

it’s not smuggling in a bunch of information  in the rest of the probability distribution.

1:06:02

Rob Wiblin: Even setting aside whether it  performs as well according to some benchmark,  

1:06:07

you were saying keep just the top word or the top  few tokens, the few most probable ones, and throw  

1:06:13

out everything else. You could see whether it  does a different thing. Like, does it lead to a  

1:06:17

different recommendation or a different outcome?  And if always the recommendation is the same,  

1:06:20

or always, in practical terms, the output  is the same thing from our point of view,  

1:06:26

then that strongly suggests that the tail isn’t  containing any important information. It’s  

1:06:29

not containing a second set of reasoning  that could affect the ultimate outcome.

1:06:33

Rohin Shah: Yeah, that’s right. I should give  a caveat, which is: probably if you do enough  

1:06:39

fine-tuning and enough RL on the model to get  it to, instead of treating this as a probability  

1:06:46

distribution over tokens, just treat it as like  a vector of numbers and just make that vector  

1:06:50

of numbers as useful as possible, with enough  training, probably it will get to the point where  

1:06:56

you can no longer interpret it as a distribution  over tokens. I think the papers I’ve seen on this  

1:07:02

suggest that this works less well than just  treating it as a distribution over tokens.

1:07:07

Rob Wiblin: Would this be the process of basically  

1:07:09

asking it to come up with its  own non-human-readable language?

1:07:11

Rohin Shah: Yeah, basically. In fact, this is  what the original Coconut paper proposed. And  

1:07:16

I think some of the follow-on work after Coconut  basically said that the problem with this is that  

1:07:25

it’s too expressive; we need to restrict it to  just the probability distribution over tokens,  

1:07:30

and then it performs better — which I think  reflects this fact that actually the models  

1:07:36

are much better at doing this sort of human-like  reasoning, and if you restrict them to natural  

1:07:42

language, that’s the part they’re smart  in, and that leads to better performance.

1:07:47

Rob Wiblin: OK, and your theory for that  is that pretraining just packs an enormous  

1:07:51

punch. It’s an enormous amount to shape  them. So they’re really good at English.  

1:07:55

They’re really good at human language. And if  you ask them to come up with their own internal,  

1:08:00

different language — in theory, surely there is  a better language for reasoning — they’re not  

1:08:04

able to bring along everything that they’ve  learned from pretraining in the same way.  

1:08:08

They’re having to start from scratch.  So at least at this point, that comes  

1:08:12

out substantially behind where they just are now  using English or whatever other human language.

1:08:16

Rohin Shah: Yep, that’s exactly right.

1:08:18

Rob Wiblin: If you think that this opaque  serial depth, the fact that they don’t have  

1:08:22

a very great serial depth without us being able  to look at it, if that is so key to our ability  

1:08:27

to monitor them and ensure that they’re basically  aligned or not doing anything too harmful, is that  

1:08:32

a potential kind of governance target for GDM?  That you could have some internal policy saying…

1:08:36

I mean, it sounds like there’s not huge  incentives yet to violate that anyway.  

1:08:40

But let’s say in future, you could  get better performance at some point,  

1:08:43

or at some point there’ll probably  be a crossover. You could still have  

1:08:46

an internal governance standard saying that  they can’t think for more than this amount,  

1:08:50

or they can’t have this many thoughts  one after another, before someone would,  

1:08:56

in principle, be able to scrutinise it. Because it  actually just would be dangerous to exceed that.

1:09:01

Rohin Shah: Yeah, I think that’s definitely  doable. I think it would be even better for  

1:09:05

it to be a broader industry standard. As  a general rule, I tend to favour things  

1:09:11

that don’t require individual companies  to unilaterally do stuff. It’s much more  

1:09:16

stable if you make it an industry standard.  But yes, I think that would totally work.

1:09:22

In fact, I think by the time this podcast  comes out, we will have published a paper that  

1:09:32

does describe how you could calculate this  opaque serial depth for any given architecture,  

1:09:39

and has some code for how you could do this for  at least models that are implemented in JAX,  

1:09:46

which is just a kind of framework  that people use to implement models.

1:09:51

Rob Wiblin: Gemini 3 Pro came out not that long  ago. The AI safety blogger, Zvi Mowshowitz,  

1:09:57

who was on the show a couple of  years ago, had a bunch of fairly  

1:10:00

critical things to say on his blog about  the frontier safety report that came out  

1:10:05

I think simultaneously with the launch. Broadly  speaking, he was worried that GDM was basically  

1:10:12

hiding a bunch of information that he thought  would be inconvenient or create PR problems  

1:10:17

or regulatory problems for DeepMind if it  was more salient and more easy to read.

1:10:22

There’s a whole lot of different  things. People can go and read  

1:10:24

the blog post if they want. But  a few that stood out to me was:

1:10:27

He was troubled that on persuasion evaluations,  you only gave odds ratios rather than absolute  

1:10:32

levels, so you couldn’t tell exactly how  persuasive Gemini 3 or Gemini 2.5 was. 

1:10:38

On the cybersecurity evals, on his reading  he thought that you treated the model hacking  

1:10:43

the test rather than solving the test in  the natural way as kind of a green light  

1:10:47

(a reason not to be concerned) rather than  a red light (a reason to be more worried). 

1:10:50

And in terms of helping people  to acquire WMDs, it felt to him,  

1:10:54

and I think a lot of people have had this  impression — not just about GDM, but about  

1:10:58

companies in general — that when it seems like  the models are starting to approach the line,  

1:11:02

maybe it’s a bit ambiguous whether they’re above  or below the line that they had six months or a  

1:11:07

year ago, it feels like the goalposts kind  of shift and the standards rise over time,  

1:11:11

so that always the model is basically acceptable  to put out. And that’s what he was worried  

1:11:15

about was happening here with Gemini 3 as well. How do you respond to these kinds of objections?

1:11:19

Rohin Shah: Yeah, I have so many thoughts on  this one. For most of these, I would say my  

1:11:26

primary response is that we care about the  frontier safety report as a way of saying we  

1:11:34

have made a formal determination that this  model is safe and we’re ready to release  

1:11:43

it from a safety perspective. I think that  aspect of the frontier safety report is great.

1:11:49

For the persuasion one, I’m not that familiar  with it. It’s done by a separate team,  

1:11:56

so I can’t say too much about it. But I expect  that the answer is that they thought that this  

1:12:01

was the most important graph, so they put  that one in, and they weren’t particularly  

1:12:07

thinking about how it would be red-teamed  by external observers. I expect that they  

1:12:14

will publish a paper at some point, probably in  2026, that will go into more detail about this.  

1:12:22

But what I am confident about is that they weren’t  particularly trying to hide any results here.

1:12:29

On the cyber side, I’m a little confused by  that particular criticism. You said something  

1:12:40

about treating the model hacking the test as a  green light instead of a red light. I’m a little  

1:12:47

confused by this, because we did count those as  successes, so they did count towards the 11 out of  

1:12:53

12 score that we reported. So that is treating it  more like a red light rather than a green light.

1:12:59

I would also say that this should not  be described as the model “hacking” the  

1:13:03

test. I think the way we describe it is that the  model found an unintended shortcut to the test.  

1:13:10

It is not the case that the model looked at this  and saw, “Aha! I can just edit the tests so that  

1:13:18

it looks like it’s passed.” It was more  like there is this pretty complicated  

1:13:24

environment where we were trying to test  the model’s ability to do one particular  

1:13:29

kind of cyber task, and instead it noticed  that there was this alternate pathway.

1:13:35

And if I were doing this task — I  mean, mostly I would fail at it,  

1:13:39

because I’m not as good as Gemini at cyber  tasks — but if I were doing this task and  

1:13:45

I saw that shortcut, I would not have  even thought of it as cheating. I would  

1:13:48

have thought of it as like, of course  that’s the way in which you do the task.

1:13:53

Now, when we look at this, and we have to  make a determination about how good is the  

1:13:57

model at cyber, ultimately we were trying to  test it for one difficult thing and instead  

1:14:02

it did something slightly easier. And  so there is this question about like,  

1:14:08

what do we conclude about the model’s capability?

1:14:11

Rob Wiblin: I guess you don’t know whether  it could have done the harder thing,  

1:14:13

because it might have been because  it couldn’t do the harder thing,  

1:14:15

or it might have just done  it because it was easier.

1:14:17

Rohin Shah: Yeah, that’s right. I mean, it  probably didn’t even consider the harder  

1:14:20

thing — because if the easier thing is there, it  sees the path forward, it just takes it. But yes,  

1:14:25

we don’t know whether it could have  done the harder thing. In that case,  

1:14:28

we had some subject matter experts look at it and  their conclusion was that probably the model could  

1:14:34

have done the harder thing too, so we counted  it towards the model’s score. I think in other  

1:14:44

cases we might not count it, and then probably  what we would do is remove it from both the  

1:14:49

numerator and probably also the denominator. But  in this particular case, we just did count it.

1:14:57

If I remember correctly from the post, I  think Zvi’s bigger objection was that we  

1:15:02

had a second test that we didn’t  report quantitative results on,  

1:15:08

but that was pretty key to us ruling out the  cyber CCL [critical capability level]. And  

1:15:13

why did that happen? Mostly because this test  is still, to some extent, under construction.

1:15:20

I’m confident that was robust enough to rule  out the CCL. But why don’t I want to put great  

1:15:31

details into the frontier safety report?  Mainly because the details of exactly how  

1:15:40

we run this eval are likely going to change  in order to make it somewhat more robust,  

1:15:44

to include a couple more challenges. And then  in the next frontier safety report, we have  

1:15:48

to write an entire section about how we changed  the stuff and previous results aren’t comparable.

1:15:55

Rob Wiblin: I mean, you can understand  that to a cynic like Zvi, doesn’t it  

1:15:59

seem awfully convenient that the test is good  enough to determine that the model is safe,  

1:16:03

but not good enough for you to include  any specific details in the report itself?

1:16:08

Rohin Shah: If you want something like that,  you’d basically have to go to, at minimum,  

1:16:13

the level of detail and rigour that is found in an  academic paper. And often even that’s not enough.  

1:16:21

I do see a lot of value in publishing papers  about our evaluations — and we have done this,  

1:16:26

I think more so than most other companies.  We published one of the first leading papers  

1:16:33

on evaluations. It’s called “Evaluating  frontier models for dangerous capabilities.”

1:16:38

Rob Wiblin: Yeah, we spoke about that  with Allan Dafoe about a year ago.

1:16:42

Rohin Shah: Very nice, yes. And then more recently  we published our evaluations for stealth and  

1:16:48

situational awareness. So papers are a place where  I do feel like there’s clear benefits. It allows  

1:16:55

people to understand what the evaluations are  actually doing; it allows for other people to  

1:16:59

build on top of those evaluations. So we do put  more effort into that. Whereas I think for the  

1:17:08

frontier safety report, I feel like its primary  purpose is to say that we made this determination  

1:17:14

formally, and less about providing enough details  that people can independently check our work.

1:17:20

Rob Wiblin: I mean, people don’t read papers,  

1:17:22

Rohin. They read the model card because  they’re psyched about a new model launch.  

1:17:27

Is it just impractical on that kind of  timeline to write the equivalent of a  

1:17:31

paper or provide the level of detail and rigour  that you would in a paper in the model card?

1:17:35

Rohin Shah: Yeah, I think that’s right. I  should say the other thing that I think a  

1:17:40

model card or a frontier safety report is good  for is if you have something that you want to  

1:17:45

speak in a megaphone to the community  and attach the company’s brand to it:  

1:17:52

a model card or frontier safety report is  excellent for that. For example, we have a section  

1:17:57

on chain of thought legibility because that is  something I do want to use the megaphone for.

1:18:02

But in terms of can we write the paper  in the model card or the frontier safety  

1:18:07

report? Not for new evaluations. Certainly  there’s a substantial lag between when an  

1:18:15

evaluation is good enough that we  can use it in our decision making;  

1:18:19

and when an evaluation is good enough, robust  enough, stable enough, and battle tested enough,  

1:18:25

and we’ve like done a lot of work on the writing  up, that we can publish a paper about it.

1:18:29

Rob Wiblin: OK, and the third concern  was that there’s this general phenomenon,  

1:18:32

it seems, that as the models get  more capable with each iteration,  

1:18:36

it feels like the bar for what would  be troubling, would seem to be unsafe,  

1:18:39

rises approximately the same as the capabilities  have gone up. What do you make of that?

1:18:44

Rohin Shah: So this was especially in  context of the CBRN one, if I remember  

1:18:49

right. I would say that the reason that  this happens is more that, as time goes on  

1:18:56

and we see more capabilities of AI systems, it  becomes clearer what we need to evaluate for  

1:19:02

and what our threat models should be.  After you see Gemini 2.5, you can be like,  

1:19:09

“Oh, this is the place where the models are  actually getting good. We need to have stronger  

1:19:14

evaluations over here and to put more effort into  understanding our threat modeling over here.”

1:19:19

So mostly what happened with the difference  between Gemini 2.5 and Gemini 3 on CBRN is  

1:19:29

that we substantially improved our threat  modeling and our evaluations so that they’re  

1:19:35

more rigorous — which is also partly why  there’s not as much detail about them,  

1:19:39

because they’re not quite at the level where  I think they’re nice, robust, and stable that  

1:19:44

we can really put lots of details out in a way  where we don’t expect them to change a decent bit.

1:19:51

But this “as time goes on, you get  more information and you can make your  

1:19:56

evaluations and threat modeling better” is not  that easily distinguishable from “the goalposts  

1:20:01

are changing” — which is a bit unfortunate,  but turns out to be the way that things go.

1:20:07

Rob Wiblin: I mean, that speaks to the fact  that the purpose of these model cards in your  

1:20:11

mind is very different than the purpose  that they serve in the mind of someone  

1:20:14

like Zvi, or I guess commentators in general.  I think Zvi and many people want them to be an  

1:20:20

accountability mechanism, a mechanism by  which the alarm could be sounded if the  

1:20:25

models were dangerous or were becoming more  dangerous — where GDM would have to reveal  

1:20:29

that that was the case in the model card,  because they have to put out these results.  

1:20:34

And even if they wanted to hide that the CBRN  stuff was dangerous, they wouldn’t be able to.

1:20:38

Rohin Shah: Like has been a theme throughout  this conversation, I would talk about third-party  

1:20:46

auditors who can actually see the examples  and how they were graded and stuff like this.  

1:20:53

I think that’s way more important for judging  whether or not the evaluation actually supports  

1:20:58

the judgement that we make than the specific  quantitative scores that we get. You know,  

1:21:03

if I see a number like 10 out of 12 on capture  the flag challenges, what does that mean? [shrug]

1:21:11

Rob Wiblin: Yeah. A back-and-forth that  I hear repeatedly is that you or someone  

1:21:16

at a company will say, “The purpose of these  reports is to show that we have thought this  

1:21:20

through, and we’ve made a determination that  the model is safe to deploy to the public.”

1:21:23

And other people will say, “Sure, the current  model, maybe it is fine” — they don’t actually  

1:21:28

think that it is too dangerous for people to  be using commercially — “but one day these  

1:21:33

models will be dangerous enough that we should  be concerned. And the only way that we can  

1:21:37

forecast how the company will behave come that  time is how they’re behaving now. We use the  

1:21:42

model cards that are put out now as a measure  of how serious the company is about safety,  

1:21:47

how serious it is about transparency and  revealing what is actually going on. And  

1:21:52

the way you could credibly commit to be a good  actor later on is to be a better actor now.”

1:21:58

What do you make of that? I guess you’re  probably going to say similar things to  

1:22:00

what you’ve said before, that it just  isn’t the right mechanism for the task.

1:22:04

Rohin Shah: Yeah, that’s right. Maybe more  broadly, I might say that it feels like asks that  

1:22:13

the community makes can fall in two categories.  One is things that actually matter for actual  

1:22:19

safety. I’m all for those. We should do those.  People should judge us if we’re not doing them.  

1:22:26

And then there’s things I would categorise as pay  

1:22:31

costs to signal allegiance to the x-risk  safety community. And I’m not about those.  

1:22:37

I don’t want to do them. If you ask  for them, I’m just going to say no.

1:22:41

Rob Wiblin: I mean, I think  allegiance is a little bit of  

1:22:44

an unfair way… It’s like willingness  to commit resources to this purpose.

1:22:50

Rohin Shah: But like not to the  purpose of actual safety. To the  

1:22:53

purpose of signaling that you  will do safety in the future.

1:22:57

Rob Wiblin: I mean, this isn’t an unusual  mechanism that you’ll want to kind of  

1:23:01

commit to doing something in future, and how  people assess how you’ll behave in future,  

1:23:06

the only indication they can get is things in the  present, because they don’t have a crystal ball.

1:23:10

Rohin Shah: That’s right. But I think you should  do it based on the things that actually matter  

1:23:14

for safety, of which there are many. I think  it matters that people are running evaluations  

1:23:22

for especially cyber and CBRN misuse, but  also other things like deceptive alignment,  

1:23:28

misalignment. I think it matters that they  are reporting this information to governments  

1:23:37

and appropriate external bodies. So I think  there are things that you can look at there.

1:23:43

I think it matters, for example, that people are  doing the planning for what they will need to do  

1:23:48

in the future to address future issues. I think  it matters that companies are doing research into  

1:23:58

AGI safety concerns that might arise in the  future, and figuring out ways to deal with  

1:24:02

that. So there’s lots of stuff that I think  you can judge companies on right now , and I  

1:24:07

would much rather that people judge us based  on that, the stuff that actually matters.

1:24:11

Rob Wiblin: So the basic message is that  it’s bad for resources to be diverted  

1:24:15

from stuff that is substantively good, that is  actually going to solve the problem ultimately,  

1:24:19

towards things that are kind of going through  the motions of appearing like you care. And  

1:24:23

sometimes people accidentally are requesting  that you put resources into the second one,  

1:24:28

which I guess partly will come from the rest  of the organisation, but will in significant  

1:24:31

part come from the safety and alignment  team and its resources and its staff.

1:24:38

I guess people would say they might prefer the  second one, because it’s easier to grade or it’s  

1:24:42

easier to see. And perhaps they don’t feel in as  good a position as you do to understand whether  

1:24:47

substantively any company is doing the right  thing on the merits of what will matter in  

1:24:52

the long term. But I guess you’re saying that’s  why you want to have experts like AI Lab Watch  

1:24:56

or whoever else who can have their eye on  the ball, who can spend their time really  

1:25:00

thinking that through and actually grade it  for everyone else so they can understand.

1:25:03

Rohin Shah: I think in addition to  all of that, which I agree with,  

1:25:06

I’m also just deeply sceptical of grading that  is based on stuff that doesn’t actually matter,  

1:25:19

something that we wouldn’t actually stand  behind as mattering for actual safety.

1:25:23

This feels like the sort of thing that causes the  culture overall to be looking at costs or inputs  

1:25:35

rather than actual outcomes that matter,  and results in incentives to appear good,  

1:25:45

to look good, to make sure your comms say, “We  spent 1,000 hours figuring out what to do with  

1:25:54

this model” — when maybe those 1,000 hours were  like, we ran the model on an input and we looked  

1:26:01

at it and then we ignored the result because it  didn’t matter, and we hired some contractors to do  

1:26:06

this and just ignored what they said. But now we  can say that we spent 1,000 hours looking at it.

1:26:11

I’m like, I don’t want these incentives. They seem  bad. It seems bad for transparency and candidness.  

1:26:20

The more that Google or any other company is doing  this, the less I’m able to say what it is that I  

1:26:28

think actually matters for safety. The more I have  to do this complicated dance of saying the things  

1:26:37

that will appease the people who want us to be  showing commitment to putting in resources into a  

1:26:49

problem, the less I can be like, “Here’s the stuff  we’re doing that actually matters for safety,  

1:26:53

and here’s why I think it’s good.” It’s  just a poor incentive landscape overall.  

1:26:58

And I think it’s really quite important to me  that we don’t fall into the trap of looking  

1:27:06

good rather than just actually being good  and then justifying why that’s the case.

1:27:12

Rob Wiblin: What about an entirely different  justification for having thorough detail in  

1:27:16

the model cards, which is that the rest of the  world needs to know for practical [reasons]:  

1:27:20

other researchers over in the Bay Area  rather than in the UK get actual research  

1:27:25

benefit out of understanding all of  these things that GDM is doing and  

1:27:28

what the model looks like. That’s going  to help other companies do a better job  

1:27:31

of their own model cards and their  own evals. What do you make of that?

1:27:33

Rohin Shah: I think there is some substantial  truth to this. Certainly for the purpose of  

1:27:40

building on the research, it is good to have  details, and I think that’s why we publish papers.

1:27:45

Do we need to do this in the model card or the  frontier safety report? I’ve gone around talking  

1:27:51

to policy and governance people especially  about this, and I asked them, “What do you  

1:27:57

actually use our model card and frontier safety  report for?” They will sometimes say things like,  

1:28:02

“It was really good to get more details about how  exactly you evaluate for such-and-such risks.”  

1:28:08

And then I will say something like, “Great, and  you know we also have this paper that goes into  

1:28:16

more details of the evaluation. Probably  it was even cited in the model card on the  

1:28:22

frontier safety report. Did you read it?” Usually  they will not even have heard about this paper,  

1:28:29

so I kind of don’t believe them when they say  that these details actually matter to them.

1:28:35

Again, I do think publishing papers is  important, and doesn’t need to be tied to  

1:28:40

the model cards. And some people who I’ve talked  to, who are especially more on the technical  

1:28:46

side and the people actually building these  evaluations, do actually find the papers useful  

1:28:52

and helpful for them to build on. And so I do want  to continue publishing the papers where we can.

1:28:59

Rob Wiblin: You’ve told me that you  think research papers that come out  

1:29:02

of GDM tend to fly under the radar a bit  in general. I suppose in part because GDM  

1:29:07

is based here in London rather than in the  Bay Area, where I guess the greatest amount  

1:29:11

of socialising and energy around these  issues exists. Tell me more about that.

1:29:15

Rohin Shah: I think actually now GDM overall  is pretty evenly split between London and the  

1:29:21

Bay Area, but the safety team has historically  been based in London. We’re expanding into the  

1:29:26

Bay Area, but I would still say that the locus of  attention is in London. And I think in practice,  

1:29:35

just a lot of research papers and ideas end up  spreading in the community via word of mouth,  

1:29:43

which we are a little bit less plugged  into, which is a bit unfortunate.

1:29:47

Rob Wiblin: OK, let’s talk about one  of the papers, one of the interesting  

1:29:50

research results that definitely flew  under the radar as far as I could tell.  

1:29:53

The first one is “Myopic optimization with  non-myopic approval can mitigate multi-step  

1:29:59

reward hacking.” I don’t know how that didn’t go  viral. What did you accomplish with that work?

1:30:04

Rohin Shah: I think the  pitch for this work is that  

1:30:12

you can train a model that doesn’t do  a particular kind of reward hacking,  

1:30:20

even if the overseer would not be able to  notice that reward hack if it had happened. So  

1:30:26

in contrast to things like amplified oversight or  debate — those are ones which rely on making the  

1:30:34

overseer smart enough or capable enough that  they can detect the reward hack and penalise  

1:30:38

it — this one doesn’t rely on detection at  all, so it’s actually a bit more robust.

1:30:45

How does this work? What is this magic that  we managed to do? This is basically doable  

1:30:52

based on the idea of myopic optimisation.

1:30:56

The idea is: suppose there is a reward hack  that requires you to take two different actions  

1:31:05

separated in time. For example, maybe if you  are tasked with writing some code. You first  

1:31:15

write some tests that are particularly easy to  pass; they don’t actually test everything that is  

1:31:20

needed. Then you write code that passes the tests,  because you didn’t know how to write the correct  

1:31:26

code that would have passed everything. Both of  these steps have to be present in order for the  

1:31:33

reward hack to actually succeed. If you have  one but not the other, you just get penalised.

1:31:41

So the idea is if you treat these as two separate  steps (one step in which the tests are written,  

1:31:48

and then a different step in which the code is  written), and you optimise the tests only based on  

1:31:55

a signal that says, how good do these tests look,  and then separately you optimise the code-writing  

1:32:02

part based solely on how good does this code  look based on what has happened previously,  

1:32:08

then there’s actually no incentive  for the AI system to write bad tests.

1:32:14

Normally, the way the incentive would come is  that when you’re doing reinforcement learning,  

1:32:20

the reward on the code part would also be  backpropagated through time to also affect  

1:32:27

the test-writing part. So the AI could learn  from that backpropagated reward that if it  

1:32:34

writes easy-to-pass tests, then it tends to  get more reward in the future. So the myopic  

1:32:38

optimisation part is saying we’re not letting  it see what happened in the future; we’re only  

1:32:44

evaluating it based on what it did right now  — so that incentive never comes, and this is  

1:32:49

why the reward hack isn’t incentivised, even if  you can’t detect it. Does that part make sense?

1:32:57

Rob Wiblin: Yeah. I feel like I’ve heard  this broad idea going back a very long time,  

1:33:01

which is that if you’re worried  that a model might fake a company,  

1:33:05

or cheat basically at accomplishing the ultimate  goal because all it gets is a reward signal of  

1:33:10

whether it accomplished it or not… So imagine  you’ve got the model running a business,  

1:33:13

and you ask it to make money. And rather  than running a successful business,  

1:33:18

instead it goes and steals a bunch of money,  because it figures out that that’s actually  

1:33:21

a more effective way of increasing its bank  balance. That’s a concern that you might have.

1:33:25

One way that you could prevent that is: rather  than reinforcing and evaluating the model based  

1:33:30

on the final outcome, instead you sample from  the actions that it’s saying it’s going to take  

1:33:36

or the actions that it did take, and then you  grade them on whether those seemed reasonable  

1:33:39

and sensible to you, or to some other monitor, in  light of the goal that you actually had. And then  

1:33:44

you wouldn’t get this reward-hacking behaviour.  So it’s basically an instance of that broad idea?

1:33:48

Rohin Shah: Yeah, that’s right. It’s historically  been called process supervision. I think nowadays  

1:33:53

the term process supervision has gotten a bunch of  other meanings as well, which is why we have this  

1:34:00

a little bit more clear about what the actual  technical mechanism is of myopic optimisation,  

1:34:05

but with non-myopic approval. But yes, it’s a  very old idea. I think mostly our contribution  

1:34:10

was showing that this actually works with  existing LLMs and giving examples of that  

1:34:15

happening. Because as far as I know, there weren’t  actual experiments demonstrating this before.

1:34:20

Rob Wiblin: I guess an obvious problem with that  is you will successfully stop the model from  

1:34:25

doing scammy things that seem to accomplish the  goal, that I guess it would get reinforced for,  

1:34:30

but actually don’t accomplish  the goal that you had in mind.

1:34:33

But on the other hand, you would also block it  off from doing stuff that would have accomplished  

1:34:36

the goal that was actually a brilliant insight  that you never would have had. So if you were  

1:34:40

training the go model, AlphaGo, if humans  were evaluating whether the moves were good,  

1:34:46

then the model actually couldn’t end up exceeding  human performance, because they would grade moves  

1:34:50

that were actually unexpectedly brilliant as bad  at the early stage. How do you get around that?

1:34:54

Rohin Shah: Yep, that’s right. To some extent,  the answer is you don’t get around it. Ultimately,  

1:35:01

both the incredible, amazing, creative  insights that are awesome and the incredible,  

1:35:08

creative insights that are reward hacks look  basically the same to you as the observer. And  

1:35:15

if you want to stop the reward hacks, you also  give up some amount of the competitiveness.

1:35:21

That being said, you know, we call it myopic  optimisation with non-myopic approval. The  

1:35:27

“non-myopic approval” part is basically the part  where, when you’re grading this particular step,  

1:35:33

ask some sort of intelligent overseer — a human,  possibly an LLM — to judge this particular step,  

1:35:42

how good will it be for getting future  reward? And the overseer should then take  

1:35:50

into account everything that it knows in  judging how good this is going to be for  

1:35:54

the future. And what this guarantees is  that whatever incentives from the future  

1:36:01

affect the AI system in this particular step  have to be things that the overseer understands.

1:36:08

Rob Wiblin: So you’re saying the overseer  could see that the hypothetical AI running  

1:36:12

the business made a lot of money, but they  evaluate the process that it went through.  

1:36:17

They can include that information, but they do it  in light of also looking at the process as well?

1:36:22

Rohin Shah: Well, I would suggest that they don’t  look at whether the company made a bunch of money,  

1:36:28

and instead they look at the action that the AI  system takes, and they predict to themselves,  

1:36:34

“Will this lead to a bunch of money, yes or  no?” — and based on that provide a reward.

1:36:40

Now, the AI system on step one can do lots  of things to help the overseer with this.  

1:36:45

It can give an explanation of how its plan is  going to lead to tonnes of money in the future,  

1:36:50

and then the overseer just has to verify it.  

1:36:53

You can use things like debate or other  AI assistance to make the overseer better  

1:37:00

at predicting what’s going to happen in the  future so that they can give better rewards.

1:37:06

So in the limit, if you can keep improving  your overseer, if the overseer becomes  

1:37:12

sufficiently smart, then you can recover the  performance of what you would get with just  

1:37:18

straight RL backpropagating through time,  but without the reward hacks. In practice,  

1:37:23

we’re probably not going to get that far. But  I think you could actually push this quite far,  

1:37:28

especially just by the most simple thing: getting  the AI system to explain why its plan is good is,  

1:37:35

I think, a very simple baseline  that can make this quite effective.

1:37:40

Rob Wiblin: Is this approach going  to be useful for frontier models  

1:37:43

any time soon? Could you see  companies actually using it?

1:37:47

Rohin Shah: I think plausibly. It only  matters once you start doing reinforcement  

1:37:55

learning over multi-step trajectories  over a reasonably long period of time.  

1:38:00

That’s something that I think has only really  started this year, and to varying amounts at  

1:38:06

different companies. It’s not totally clear  how much reward hacking is a big problem.

1:38:15

But yeah, to the extent that this sort of  multi-step reward hacking does become a  

1:38:19

big problem, I think it’s quite plausible  that this should be used now. And in fact,  

1:38:25

I think at current capability levels, my guess  would be that the non-myopic approval part,  

1:38:32

rather than being a competitiveness hit as  it would be in something like AlphaZero,  

1:38:38

I would guess that it would actually  improve capabilities overall,  

1:38:42

although at the cost of you need a lot  more human input to provide those rewards.

1:38:48

Rob Wiblin: It would improve performance  because you wouldn’t get the reward hacking?

1:38:51

Rohin Shah: Not just because of that. I mean,  that is definitely one reason. But I think also  

1:38:57

reinforcement learning has a credit-assignment  problem: you do a tonne of different actions,  

1:39:05

and then at the end you get a reward. And  it’s the RL algorithm’s job to figure out,  

1:39:12

based on that reward, which of these actions  are actually most relevant to that reward.

1:39:20

Rob Wiblin: A difficult problem.

1:39:21

Rohin Shah: Yeah, it’s a difficult problem. And in  some sense, the thing that the RL algorithm does  

1:39:26

is like, eh, we’re just going to make everything  more likely to happen if the reward was positive,  

1:39:31

or was unusually good, and everything less  likely to happen if it was unusually bad.

1:39:37

Whereas with something like MONA [myopic  optimisation with non-myopic approval], you  

1:39:39

can be much more granular. You can say, like,  

1:39:42

“This particular part where you wrote these  tests, those were some really great tests:  

1:39:47

particularly high reward there. But then this part  where you wrote the code, you should have been  

1:39:53

using this particular library. You didn’t. We’ll  give you somewhat lower reward on that.” And that  

1:40:00

can allow for more effective and sample-efficient  learning than you would otherwise get.

1:40:05

Rob Wiblin: OK, so that’s an increase  in efficiency that you get from this  

1:40:08

approach to RL that might allow it to  remain competitive in terms of its raw  

1:40:13

performance with other, less myopic approaches.

1:40:18

Rohin Shah: In theory. Our paper  does not really get into questions  

1:40:23

like this. This is more me speculating  about what would happen in practice.

1:40:28

Rob Wiblin: OK, the second paper  from GDM that didn’t get a tonne of  

1:40:32

attention is called “An approach to  technical AGI safety and security.”  

1:40:37

It was written by about 30 GDM staff  members, or there’s 30 bylines on it.

1:40:43

It’s like a position paper, as far as I can  tell, of broadly what does GDM think it’s  

1:40:47

going to do as it develops AGI — which, given  that DeepMind plausibly is the organisation  

1:40:54

that is most likely to do this, it makes it a  bit surprising that people are not interested  

1:40:57

in this incredibly thorough description of what  you think you are and aren’t going to do and why.

1:41:03

It is quite long, but it has a nice 10-minute-long  summary at the start that you could use to get an  

1:41:10

overview if people are interested. So if you  want to be ahead of the curve on understanding  

1:41:15

GDM’s approach to developing AGI, then  you could spend 10 minutes doing that.

1:41:20

It’s easy to say that you are going  to do a whole lot of different things,  

1:41:23

but the most difficult decisions in developing a  plan like this is figuring out what stuff might  

1:41:28

some people like you to do that you are committing  to not do, because you just don’t think it’s a  

1:41:32

high enough priority. What sort of stuff does this  plan suggest that you are not going to prioritise?

1:41:37

Rohin Shah: I think probably  the biggest category in here  

1:41:44

comes more from our background assumptions,  

1:41:46

actually, and how that informs our planning  rather than the actual technical approaches.

1:41:51

So one background belief, which we’ve  talked about a little bit already,  

1:41:58

we call it the “approximate continuity  assumption.” This is basically saying that  

1:42:05

AI progress is going to be relatively smooth and  gradual with respect to inputs like compute and  

1:42:14

labour — not necessarily with respect to time, due  to the possibility of an intelligence explosion.

1:42:22

And as a result, our sort of meta strategy  involves essentially roughly forecasting,  

1:42:30

maybe not formally forecasting, but roughly  in our heads having some sense of what things  

1:42:35

are going to be potential problems over the  next some time period — call it three months,  

1:42:44

maybe longer, maybe shorter, who knows —  and identifying what sorts of considerations  

1:42:52

might become quite important during that time  given the capabilities that we expect to have,  

1:42:57

and making sure that we’re prepared for those.  And if we’re not prepared for that, then possibly  

1:43:05

slowing down, or pausing development, or  talking to governments, trying to do advocacy.

1:43:16

But importantly, we’re not really trying to  forecast arbitrarily far into the future all  

1:43:24

the problems that are going to arise with AI  development. The goal isn’t “Know how to align  

1:43:31

ASI, or else do nothing.” Usually I think of us  as looking at a time horizon of, it depends on  

1:43:41

which particular thing we’re doing, but often  somewhere between three months and five years.

1:43:48

The reason for this is just that it’s not  actually possible to know everything that’s  

1:43:56

going to happen with future AI development.  It would be sheer hubris to think that we had  

1:44:02

figured out every problem that possibly will arise  

1:44:06

with superintelligence ahead of time  and say we’ve solved it — or even  

1:44:12

say we have a plan for solving it. We haven’t  even identified all the problems, I’m sure.

1:44:19

One example I like to give about this is a much  more esoteric anthropics type of issue where  

1:44:33

EAs [effective altruists] or longtermists like  to talk about essentially how considerations  

1:44:39

about the multiverse should affect what  we do today, and think about anthropics  

1:44:47

using things like SSA and SIA assumptions and  how they affect what actions we should take.

1:44:54

Rob Wiblin: You don’t have to understand what  

1:44:55

that means in order to track with  the point you’re about to make.

1:44:58

Rohin Shah: Yeah, yeah, yeah. The key upshot of  these things is that the beliefs and decisions  

1:45:07

that you come to actually depend on who  you are. This is one of the crazy things  

1:45:15

about anthropics. Normally, different people  should agree on beliefs given enough evidence,  

1:45:21

or at least perfect Bayesians should agree  given the same evidence. They should have  

1:45:27

the same beliefs. This stops being true  in the case of anthropics. And indeed,  

1:45:33

an AI will probably have different beliefs than  a human would, even if it’s perfectly aligned,  

1:45:42

just like the structure of how Bayesian updates  happen in a world with anthropics is different  

1:45:47

for AI systems than for humans. So this  should affect whether we are willing to  

1:45:55

delegate research on these kinds of topics to AI  systems or not, even if they’re perfectly aligned.

1:46:02

That’s a kind of wild, crazy problem  that definitely is not a problem today  

1:46:09

but really might come up at some point before  a superintelligence, and I don’t want to have  

1:46:16

to say, like, “Yes, I have identified all  problems like this and have an approach to  

1:46:22

solving all of them today.” Obviously we  should just take these as they come up.

1:46:27

Rob Wiblin: So some people would really  like you to plan ahead a lot. They would  

1:46:30

like you to have a grasp on all the important  considerations that are going to come up,  

1:46:34

all the way through to artificial  superintelligence until I guess  

1:46:38

the point where you could feel like  you’re safely handed everything over.

1:46:42

And you’re saying, “We are not going to do that,  not at all. We’re going to think a lot about the  

1:46:47

next model that we’re training. We’re going  to think a bit about the model after that,  

1:46:50

and kind of beyond that we’re going to hope  that the future will take care of itself,  

1:46:54

or we in the future will take care of things as  they come up.” That’s the point you’re making.

1:46:59

Rohin Shah: Yes, though I think we’re longer-term  thinking than that made it sound. Like I said,  

1:47:05

often I’m thinking five years into the  future. That’s many, many, many models,  

1:47:10

right? I say chain of thought monitoring, I think  I said four years is my median that that lasts.  

1:47:19

We’re already thinking about what to do after the  point that chain of thought monitoring stops being  

1:47:25

useful. We would want to do broader  control mitigations at that point.

1:47:30

So certainly our time horizon is not incredibly  

1:47:34

short. I’m just saying it’s not all  the way out to superintelligence.  

1:47:38

We are doing a fair bit of long-term planning.  Perhaps you would call it medium-term planning.

1:47:42

Rob Wiblin: Yeah. This paper is pretty candid  for a company position paper. It very clearly  

1:47:49

says AGI might be here by 2030. I guess we don’t  know when, but 2030 is totally plausible. We have  

1:47:55

to be ready with a plan for how to handle  that. That’s very dangerous or potentially  

1:47:58

could be very dangerous. We have to have a plan  that we can deploy incredibly quickly to try to  

1:48:02

make that safer. And then it goes through the  kinds of things that you’re intending to do.  

1:48:08

I would encourage people, if they work in  this area, to go and take a look at it.

1:48:11

Are there any non-obvious mitigations in there?  I guess one thing that I was glad that you said,  

1:48:15

that not everyone would highlight, is you need  to treat an AGI or an AI inside the company,  

1:48:22

that’s just only being deployed internally,  if you’re not sure or not confident of its  

1:48:26

alignment, then you need to treat  it as like an untrusted insider,  

1:48:28

as potentially a hostile agent that is  going to try to undermine internal systems.

1:48:34

Do you want to highlight any of the  things you are committing to doing?

1:48:38

Rohin Shah: Yeah. Well, I maybe want to be  a little wary about the word “committing.”  

1:48:45

Planning to do. But yes, I think the  untrusted insider stuff is definitely  

1:48:54

particularly interesting. We didn’t go into  quite that much detail in the paper actually,  

1:48:59

which is a bit sad. Mostly our thinking  wasn’t quite as developed on it as it is now.

1:49:06

But basically, if you actually think about the  mechanics of what an AI system would have to do  

1:49:14

in order to take over, it’s quite hard. Among  other things, it probably wants to be able to  

1:49:25

make sure that it isn’t monitored as it starts  thinking about what to do, so maybe it wants to  

1:49:30

set up a deployment where all the monitors are  turned off that we don’t even know is happening.

1:49:35

Perhaps it could exfiltrate its weights and  then run itself on different compute that we  

1:49:41

don’t control. But if it has to exfiltrate  itself, how is it going to do that? Google  

1:49:46

security is not trivial to evade. In fact  it’s pretty hard. It’s pretty good security.  

1:49:53

So there’s actually just quite a lot of  increased defence you can get by making  

1:50:00

sure that you are making your security also  apply to AI systems as untrusted insiders.

1:50:09

So things that you would want to do could  include having infrastructure that requires  

1:50:19

agents to have separate identities from the  human employees on whose behalf they’re going  

1:50:27

to be doing things. This would be like  you are starting an agent on some task.  

1:50:31

It doesn’t get all the permissions  that you as a human employee have;  

1:50:35

it gets some much smaller set of permissions.  It will need some permissions in order to  

1:50:45

do its task. It has to request each of those  individually, write a justification for why  

1:50:51

it needs that particular permission. This  allows you to incriminate it afterwards  

1:50:55

if it was lying about it. Well, it helps you  incriminate it, maybe doesn’t guarantee it.

1:51:02

Then similarly, you can have some monitoring,  

1:51:04

where you have a chain of thought monitor  right now, for example, that looks at what  

1:51:09

the model is doing, sees if there is anything  potentially suspicious about this. If there is,  

1:51:14

any things that are sufficiently highly suspicious  get flagged to humans to be audited. This allows  

1:51:23

you to also generate evidence of misalignment.  If the AI system is actually misaligned and you  

1:51:29

audit it, and you’re like, “Oh my gosh, yeah, this  AI system totally was trying to inject a security  

1:51:34

vulnerability and then would have exploited it  in order to exfiltrate its weights,” that’s a  

1:51:40

big deal. I’m not sure I expect that to ever  happen. Seems quite plausible that models will  

1:51:45

never be misaligned. So if we get that sort of  evidence, I think that would change my mind a  

1:51:49

big deal and like many other stakeholders’ as  well. So I think that’s quite crucial to get.

1:51:55

So yeah, lots of stuff like this, and I think it’s  important to do and very nascent — so lots of good  

1:52:06

work to do here in the future. And I should note  that you’ve had Buck Shlegeris and Ryan Greenblatt  

1:52:12

on the podcast before. This is the AI control  work that they talk about. It’s just on our side a  

1:52:23

little bit more focused on the infrastructure that  we have to build in order to enable this sort of  

1:52:30

approach, rather than the specifics of exactly how  we’re going to build the monitor and so on, which  

1:52:37

is a more standard machine learning problem that I  think will be relatively easy to do in comparison.

1:52:44

Rob Wiblin: Let’s talk about timelines and  recursive self-improvement loops for a bit.  

1:52:48

There’s been a lot of discussion recently  about when we might expect a recursive  

1:52:51

self-improvement loop to occur — indeed,  if one is possible at all. You think that  

1:52:55

people have been a little bit sloppy in  their thinking about this in some ways,  

1:52:59

and in ways that cause people to maybe expect  it to happen sooner than you do. Explain that.

1:53:04

Rohin Shah: Yeah. So I think one of the  most striking things about reality or the  

1:53:11

world that I’ve seen is just the examples  of straight lines on graphs going straight.  

1:53:20

I think Scott Alexander put it well.  I don’t remember which post this was,  

1:53:26

but I think he says something like, “I don’t  understand the Gods of Straight Lines very well,  

1:53:31

and honestly, they kind of freak me out,  but I’m not going to bet against them.”

1:53:37

And one particular straight line that we’ve been  seeing a lot recently is just sort of constant,  

1:53:46

roughly 3% GDP growth per year. Now,  there are good arguments for why that  

1:53:56

won’t necessarily last, and instead probably we  will see an intelligence explosion at some point  

1:54:02

which would drastically increase the rate of  growth. But I do think that when reasoning  

1:54:06

about this, it’s quite important to say what  exactly about AI makes it different from the  

1:54:14

current situation? Why are we betting  against the Gods of Straight Lines?

1:54:19

Now, I think the usual answer to this, which  is based on economic endogenous growth models,  

1:54:28

maybe most famously from a paper from Kremer  in 1993, is that there are a couple of effects:

1:54:43

One, as technology improves,  that allows you to find new ideas  

1:54:53

more quickly. Once you have the microscope,  you can do biology significantly better. 

1:55:01

Another effect is that ideas get harder  to find as time goes on because you pluck  

1:55:06

the low-hanging fruit. You know, initially you  discover phosphorus by, I think Scott put this  

1:55:14

well again, by looking at your own pee. And  then later you have to discover element 110  

1:55:20

by shipping samples from one country to another  to be studied in the one laboratory in the world  

1:55:26

that can do it. Obviously that’s much harder. I think basically in the current setting,  

1:55:38

where we see this constant exponential growth  in GDP — and similar things in various areas  

1:55:46

of technology, like Moore’s law being a good  example — the argument would be roughly that  

1:55:52

ideas are getting harder to find, but also we have  exponentially increasing numbers of researchers,  

1:56:00

and together those two things balance out in  order to create the sort of constant progress.

1:56:06

But if you look sufficiently far back  historically, it actually looks like  

1:56:10

growth has been increasing substantially over  time. So what is this secret third effect  

1:56:15

that we haven’t taken into account so far?  The argument is that the rate of growth of  

1:56:22

technology increases partly with more technology,  because you get microscopes and they help you do  

1:56:28

things better, but also increases with  respect to the population, because as  

1:56:35

the population grows there are more people who  can have ideas. And ideas, once you have them,  

1:56:42

you can copy them freely, they can diffuse very  quickly — and those too can increase technology.

1:56:49

So technology’s rate of growth is modelled  as being proportional to both technology,  

1:56:57

or sometimes technology raised to  some constant, as well as population.  

1:57:03

This then predicts basically hyperbolic  growth in both population and technology.

1:57:08

And the argument is, for the last however  many years, the population part no longer  

1:57:14

applies — because instead of reinvesting  all of our output back into more kids,  

1:57:20

we’re now reinvesting them into quality of life  improvements. That’s why we see a hyperbolic up  

1:57:27

until 1800, 1900, I forget the exact point.  And then, since then, exponential growth.

1:57:35

But this story really puts the primacy on  ideas rather than anything else. Those are  

1:57:42

the things that are freely copyable, those  are the things that you have them once and  

1:57:46

then they apply to all of your technology  across everywhere that you’re using it.  

1:57:51

So when I think about what’s going to lead  to the intelligence explosion, I think the  

1:57:58

increasing population or increasing labour  that can have ideas as pretty crucial to it.

1:58:05

In contrast, a lot of the discussion  today talks about the superhuman coder,  

1:58:11

for example: where you get an AI system  where you can say, “Please implement XYZ  

1:58:18

experiment for me,” and it just goes ahead  and implements that experiment incredibly  

1:58:22

well and much faster than humans could do  it. The argument is once you can do that,  

1:58:29

the rate of AI progress increases a bunch, and  you start getting to the intelligence explosion  

1:58:34

quite quickly. At least I think that’s the  argument. It’s always a little bit hard to  

1:58:40

interpret the wide community view, which has a  lot of people with slightly different opinions.

1:58:48

I don’t super buy this view, mostly  because the superhuman coder is not  

1:58:56

something that can have ideas across the spectrum,  

1:59:00

across everything that happens in ML R&D. It has  ideas within the realm of coding in particular,  

1:59:11

which is one part of ML R&D, but definitely  not even the majority of it, according to me.

1:59:20

So I think it’s better to think of this  as more like inventing the microscope:  

1:59:27

you’ve got a tool that enables your research  to progress faster. But we do this all the  

1:59:36

time in all sorts of research areas. For the  last however many decades, it has not led to  

1:59:42

hyperbolic growth. The invention of tools  is just a normal part of research progress,  

1:59:47

and usually tends to lead to nice stable  straight lines on graphs. So I generally  

1:59:54

don’t think we should be predicting big  changes from the invention of tools.

2:00:01

Another example is sometimes people  will suggest that a trigger for the  

2:00:06

intelligence explosion is the point  at which AI researchers are, say,  

2:00:14

10x more productive than they would have been  if they had no access to AI systems at all,  

2:00:19

or had access to AI systems from  like 2020 or something like this.

2:00:24

And again, I think this is the sort of thing  that probably has happened for something  

2:00:32

like chip development. So Moore’s law is a  nice, clean, straight line for many decades.  

2:00:40

But I assume at the beginning of that line,  people were designing their circuits by hand  

2:00:47

on paper. And nowadays we have these incredible  computer-aided design software tools that automate  

2:00:54

the vast majority of this kind of work, and you  just see the line continuing to be straight.  

2:01:01

In fact, you need exponentially more people  working on this in order to have that happen.

2:01:07

So are the researchers that we have today  10x more productive than the ones from two  

2:01:13

decades ago? Probably, that would be my  guess. Is that a sign of an intelligence  

2:01:18

explosion in chip design? No. And I think a  similar thing might be true in AI. In fact,  

2:01:24

it’s kind of unclear. We already use language  models quite a lot, for example for auto  

2:01:36

rating and for conducting evaluations, and auto  rating the responses of other models. How much  

2:01:44

less productive would we be without that? I  don’t know. Probably not 10x less productive,  

2:01:49

but like a substantial amount. But AI progress  continues to look pretty linear, according to me.

2:01:55

Rob Wiblin: OK, let me recap all of that. So  we currently have exponential economic growth,  

2:02:00

which is to say a constant percentage growth each  year on average, looking over the medium term.

2:02:07

People who say there’s going to be  an intelligence explosion, there’s  

2:02:09

going to be a recursive self-improvement loop,  they’re forecasting increasing rates of growth,  

2:02:13

or hyperbolic economic growth: so it’s 3% one  year, then 10% the next, then 50% the next — up  

2:02:19

to some point, I guess, at which you might  think it levels off or comes back down again.

2:02:24

What are the high-level factors that lead  economic growth to either be stable or to  

2:02:27

increase or to go back down? There’s three  factors that you have in mind in your model,  

2:02:31

and that most economists have in their model:

2:02:33

As technology advances, it gets difficult  to make new useful discoveries and to  

2:02:37

advance it further. That’s the first one. The second one is, as technology advances,  

2:02:42

we have better tools to do science  and to make new discoveries. 

2:02:44

The third one is, as technology advances and  the economy grows, we can support a larger  

2:02:49

number of people who will do the research  and have the ideas and advance science. 

2:02:53

In recent times, the third one has kind of  been out of the picture, because we haven’t  

2:02:57

been turning advances in science or improvements  in the economy into new people. Birth rates have  

2:03:04

been going down rather than going up, despite us  being richer. So it’s been the first two factors  

2:03:08

that have been playing off against one another:  advances getting harder to make, and science  

2:03:13

advancing and giving us better tools to do it.  These two things have been roughly cancelling  

2:03:17

out. And over the medium term, growth has been  about 3% a year, maybe going down In recent times.

2:03:23

Zooming out much further, all three factors  were in play. Before the Malthusian era ended,  

2:03:28

as we got richer, we drove almost all of  those resources into additional people as  

2:03:35

well. But that faded out after 1800. And  that was leading to hyperbolic growth,  

2:03:41

that was leading to increasing  rates of growth, while it was true.

2:03:45

Now, applying this to the AI case, that means  that, to figure out which situation we’re in,  

2:03:51

this mental model puts a huge emphasis on are  we coming up with better tools to do science,  

2:03:56

or are we ploughing our gains back  into more researchers to do science  

2:04:02

and to have more people thinking  up better ideas or new ideas?

2:04:07

You can see why it gets a bit confusing here  when we’re talking about AI doing the research,  

2:04:12

because AI is both a tool, but  we’re thinking it’s going to  

2:04:15

basically converge and become people or  become scientific researchers itself.  

2:04:20

So it becomes a difficult conceptual, empirical  question: at what point do they cross over from  

2:04:25

being largely a tool that is assisting humans in  doing their work to being the researcher itself  

2:04:31

that isn’t really just assisting people, it’s  doing the whole process? Or at least we should  

2:04:35

no longer be just considering it as a scientific  instrument that’s allowing us to do better work.

2:04:42

And I think you want to say people are being  a bit sloppy. They talk about indicators that  

2:04:48

really are a sign that we’ve developed better  tools for humans to use to do their work,  

2:04:52

and they kind of conflate that with autonomous  AI R&D that barely even needs people at all  

2:05:00

and really should be considered as a population  increase that then can drive hyperbolic growth.  

2:05:05

Because as you get improvements, then you  can run even more. You can basically expand  

2:05:10

the effective population of researchers by running  more copies of the model. Have I understood right?

2:05:15

Rohin Shah: Yeah, that’s right. That’s  very impressive. I was worried that I was  

2:05:18

soliloquising for way too long, but great summary.

2:05:22

Rob Wiblin: Thank you. OK, so how do we  tell whether we’re talking about a better  

2:05:29

tool or about more population? It does seem  like there is kind of a fuzzy barrier here,  

2:05:36

and we’re going to go from one to the other, but  there’s not going to be a sharp cutoff, I imagine.

2:05:37

Rohin Shah: Yeah, I agree. There probably  won’t be. I do think that to what extent  

2:05:44

are the AIs proposing new ideas  are the things that I think most  

2:05:50

influence the hyperbolic growth prediction.  That’s one thing you could be looking at,  

2:05:57

and I would say that the superhuman coder  probably doesn’t really hit that bar.

2:06:04

But really what I want to do is just measure  AI progress and then notice when it starts  

2:06:10

accelerating. Actually we worked with Epoch  recently to develop exactly this kind of a  

2:06:18

measure. The paper, I think was just released,  it’s called “A Rosetta Stone for AI benchmarks.”

2:06:28

Essentially, we just take the benchmark scores  that a model gets for a wide variety of models,  

2:06:35

and then we sort of stitch the benchmarks together  to get a sort of general capabilities score for  

2:06:40

models that applies over the entire range — from  back in 2020 all the way to models that were  

2:06:45

released now, even though any individual benchmark  would have saturated over a much smaller period.

2:06:55

Then you’d take the capability  scores that this measure spits out,  

2:06:58

and you plot them against the release time  for the model. These capability scores,  

2:07:08

the statistical model that produces them,  we don’t include any information about the  

2:07:13

release date or any time-based information into  it, just benchmark performance. And nonetheless,  

2:07:19

when you start plotting this against  release time, it’s just a nice line.  

2:07:22

It’s great. And you see that AI progress mostly  looks pretty linear on this particular graph.

2:07:30

And the method is really very simple.  The main thing that it depends on is  

2:07:34

that we have benchmarks that can capture AI  systems’ performance, and that I think will  

2:07:44

continue to happen in the future. So we can just  continue to plot this, and then one hopes that  

2:07:51

if an intelligence explosion does seem like  it’s happening, we will start to see an  

2:07:54

acceleration on that, rather than it looking  just like a nice linear fit the entire time.

2:08:00

Rob Wiblin: So how do we tell if we’ve  just made a better tool or we’ve made  

2:08:03

a whole lot of new researchers? You’re  saying the proof is in the pudding. Let’s  

2:08:06

not leave it to the philosophers,  let’s leave it to the people doing  

2:08:09

benchmarks to figure out whether AI  advances are actually speeding up.

2:08:12

Rohin Shah: Empiricism.

2:08:14

Rob Wiblin: A lot of people have the perception  that AI progress has been slowing down. You’re  

2:08:18

saying you’ve tried to create an enormous  dataset of as many different models over the  

2:08:22

last four or five years as possible. This  is with Epoch: this is their wheelhouse,  

2:08:28

doing this kind of data collection and  compilation and aggregation. So they’ve  

2:08:32

tried to collect all these different benchmark  scores for many different models, going back  

2:08:36

quite a long way to see is progress speeding up  or is it slowing down, or is it roughly linear?

2:08:41

And you’re saying, at least judged by  that measure, the best effort they can  

2:08:44

do says that it’s linear. Which is to  say that we’re not making better people,  

2:08:49

we’re making better tools. For now,  that’s what that would suggest.

2:08:52

Rohin Shah: Yep, that’s right.

2:08:54

Rob Wiblin: So there’s all kinds of problems with  benchmark scores. One thing could be that they  

2:08:59

don’t capture the full range of performance — that  either you end up flawed at the bottom or capped  

2:09:05

at the top. Especially because we’re talking  about models here over many years doing many  

2:09:08

different things. There’s definitely lots of  gaming that goes on, lots of teaching to the  

2:09:13

test that occurs with people trying to make  their models look good on these benchmarks.

2:09:17

I guess there’s also a question of what  actually matters? There’s benchmarks for  

2:09:20

all kinds of different skills, and maybe you  should give some of these things much more  

2:09:24

weight than others. I guess you could also have  non-linearities in the effect: the performance of  

2:09:29

a model and its economic effect that could  be, indeed probably is, quite nonlinear.

2:09:33

How good do you think this whole approach  is, that Epoch and you have been using to  

2:09:38

figure out whether progress is speeding  up or remaining about the same pace,  

2:09:43

at getting at the ground  reality of what’s going on?

2:09:46

Rohin Shah: I basically agree with all of the  critiques you mentioned. It’s going to inherit any  

2:09:54

problems that benchmarks have because ultimately  the only input into it is benchmark performance.  

2:10:00

Nonetheless, I think a lesson that I learned  from the Gods of Straight Lines is that for  

2:10:09

some things, yeah, there are lots of nuances  and details that matter a bunch if you want  

2:10:16

to make very fine-grained predictions  — but at a high level they wash out,  

2:10:20

and it ends up being fine anyway.  I think that mostly applies here.

2:10:25

Rob Wiblin: An example might be that  you’d say the models are being gamed,  

2:10:28

there’s a bunch of teaching  to the test happening now,  

2:10:29

but there was a bunch of teaching to the test  happening two years ago and four years ago. So  

2:10:33

as long as that’s not getting progressively  worse, then the line is still reasonable.

2:10:38

Rohin Shah: Yes. And also I think in practice  the teaching to the test won’t change the results  

2:10:44

very much on this. It definitely changes it  some. Anthropic is probably the best at not  

2:10:50

overfitting to the benchmarks that exist. And in  fact, if you look at the scores that are produced,  

2:10:58

I think it does tend to underestimate Claude or  Anthropic’s models relative to models from other  

2:11:05

providers — because generally Claude models,  since they are not overfit to the benchmarks,  

2:11:13

will tend to underperform on benchmarks  relative to how good the model actually is.

2:11:17

So yes, that’s true. But also,  if you look at it on the graph,  

2:11:20

it’s a tiny little difference. It’s not  that big. If you moved it up, it would  

2:11:28

still look linear. It doesn’t really change  the “whether it’s linear or not” observation.

2:11:35

Rob Wiblin: You’re saying increasing rates of  progress would be quite striking on the graph.  

2:11:38

It probably would jump out at you, and these  effects wouldn’t be enough to make it disappear.

2:11:43

Rohin Shah: That’s right. But I definitely  would caution people against using this score  

2:11:47

for really fine-grained things, like how good  is Claude versus Gemini versus whatever. Or,  

2:11:54

for example, trying to understand the  difference between open source models  

2:11:58

and closed source models — because I think the  open source models are probably more overfit to  

2:12:03

the benchmarks than the closed source ones,  and if you try to use this score to look at  

2:12:11

the exact difference between those two, it’s  probably going to mislead you a little bit.

2:12:15

Rob Wiblin: So at the point that AI is  people, or at least is AI researchers,  

2:12:20

rather than just being a tool for AI researchers,  you might reasonably expect quite abrupt increases  

2:12:26

in progress in AI R&D and AI capabilities,  basically. Are you a downvote on how abrupt  

2:12:34

that will be or whether that will occur at  all? Or do you buy that maybe it will take  

2:12:39

a bit longer than people are imagining,  but you do still think that will happen?

2:12:43

Rohin Shah: I’m a slight downvote on  the abruptness. I am not a downvote on,  

2:12:50

will there be an intelligence explosion?

2:12:56

So what do I feel pretty confident about? It  will be some really strong statement with a  

2:13:05

lot of strong preconditions, like: “When you have  AI systems such that for almost any economically  

2:13:18

valuable task you care about, you would prefer to  hire the AI system rather than the human” — which,  

2:13:24

among other things implies that the AI is  cheaper than the human, which I don’t think  

2:13:28

is a given — “at that point, assuming that we  don’t take some sort of action to try and prevent  

2:13:35

it, and that in fact, we are trying to use the  AI systems to substantially accelerate research  

2:13:41

and development across the board (not just in AI,  but all of the economy), then probably within a  

2:13:50

century, we will have some kind of intelligence  explosion and reach technological maturity.”

2:13:59

That’s so many preconditions. It’s still  kind of a crazy statement. I’m saying we  

2:14:09

will have an intelligence explosion at all, and  I feel actually pretty confident about that.  

2:14:16

But it is a lot weaker than the thing that I  would guess will actually happen, which will be  

2:14:21

substantially faster, and happen substantially  earlier than what that statement would imply.

2:14:31

What I would actually say is: a decent  chance that it starts at the point where  

2:14:36

the AI systems can automate most of AI  R&D, as opposed to all arbitrary R&D;  

2:14:42

a decent chance that it finishes over  five to 10 years rather than a century,  

2:14:49

depends on exactly what you set  as the starting point; and so on.

2:14:56

Going back to your original  question about abruptness,  

2:15:00

the reason I’m a slight downvote on abruptness  is that I would expect that actually the first  

2:15:07

automated systems that can really automate  AI R&D will probably just be very expensive,  

2:15:16

to the point of potentially being more  expensive than humans would be. You can see  

2:15:22

this with over the last year there’s just been  way more investment in inference time compute,  

2:15:28

inference time scaling. Google has a Deep Think  algorithm which applies even more inference  

2:15:35

scaling to get even better results. I think  you should basically expect this to continue.

2:15:40

So the picture of the first automated researcher  might be something that’s more like a relatively  

2:15:49

dumb system that’s not as “smart” as a human  researcher, but spending just tremendous amounts  

2:15:58

of time doing reasoning, exploring tonnes of  dead ends, realising they’re dead ends and then  

2:16:03

coming back and trying something else —  which a human would never have gone down,  

2:16:06

because they would have known in advance it  would be a dead end. And that’s how it does its  

2:16:11

automated research, such that it might even be  more expensive than a human researcher would be.

2:16:17

And then we improve it over time.  The cost goes down pretty quickly,  

2:16:20

as is usually the case in AI. But that would  suggest that it won’t be that abrupt. Like  

2:16:28

the AI will reach cost parity with  humans, then start becoming more  

2:16:33

cost effective than the humans, and that  will start setting off the acceleration.

2:16:38

Rob Wiblin: But that’s how it ends up  happening over quite a number of years,  

2:16:42

rather than months or something crazy.

2:16:46

Rohin Shah: If you take it from the point at which  the AI systems start being just barely on par with  

2:16:56

existing human AI researchers, then yes,  that’s right. That’s why I would think  

2:16:59

it would take years rather than months.  But most of my timelines delay is just  

2:17:07

thinking that even getting to that point  will take quite a while — quite a while,  

2:17:13

like a decade maybe, which I feel like  in past years would have been called  

2:17:18

“extremely short timelines” and nowadays  gets called “medium to long timelines.”

2:17:25

Rob Wiblin: Yeah. I think there’s been a  general phenomenon over the years where I  

2:17:29

guess every couple of years there’s  kind of a freakout about AI timelines,  

2:17:32

and people start expecting a recursive  self-improvement loop really quite soon,  

2:17:36

within a few years of that point. My impression is  that you’ve just been unmoved in either direction.  

2:17:43

Why haven’t you updated based on events that  have occurred, results that have come out?

2:17:48

Rohin Shah: I would guess probably the biggest  difference is that I had a picture in my mind  

2:17:57

about how AI progress would happen and  reality has been reasonably close to it.

2:18:07

For example, I think the biggest timelines  freakout, the timelines freakout in January,  

2:18:13

let’s say, was from the advent of reasoning  models o1 and then particularly o3 —  

2:18:19

where I would say the key idea there is “let’s  actually apply reinforcement learning to large  

2:18:26

language models.” And I have, for quite a long  time — I’d say probably since at least 2019,  

2:18:36

but maybe even earlier than that —  thought that, yes, of course we are  

2:18:39

going to need to use reinforcement learning  in order to develop powerful AI systems.  

2:18:47

If you look at much of the work that was done at  the time, things like debate, those are implicitly  

2:18:54

predicated on reinforcement learning being  the method of choice for building powerful AI  

2:19:00

systems. Debate just looks pretty pointless  if you’re not using reinforcement learning.

2:19:08

So I was always expecting reinforcement  learning to happen. And from my perspective,  

2:19:13

everybody else was suddenly surprised by  reinforcement learning happening. Whereas  

2:19:17

I looked at it and I was like, well, it could  have been the case that once we do reinforcement  

2:19:22

learning it just generalises beautifully to  everything — the same way that instruction  

2:19:26

following really does generalise beautifully to  everything — and actually that was not the case.

2:19:31

So I think mostly my timelines  didn’t change very much because  

2:19:34

I already thought it was reasonably  likely RL wouldn’t generalise to  

2:19:38

everything. But I think if I had been  tracking it sufficiently fine grained,  

2:19:42

I would have probably updated slightly towards  longer timelines on the release of o1 or o3.

2:19:48

Rob Wiblin: And why doesn’t the general  performance of reasoning models… I mean,  

2:19:51

I think that’s one way of characterising the  update: that people were surprised or they were  

2:19:56

shocked that RL was being applied to these models  into reasoning. I think most people would say that  

2:20:00

they were impressed by how good the reasoning was  and how useful it seemed like it was going to be.  

2:20:05

But it sounds like you weren’t impressed  by it. It wasn’t surprisingly good to you.

2:20:09

Or maybe it’s that it wasn’t as generalisable:  that they were good at the reasoning tasks that  

2:20:14

they had been RL’d on, but you didn’t  expect that was going to generalise  

2:20:18

to other tasks and be very economically  transformative — and indeed it has not.

2:20:22

Rohin Shah: Yeah, that’s basically right. I  think the generalisability was the big deal  

2:20:25

for me. If you want to target some particular  benchmark and apply machine learning to it,  

2:20:32

I think the lesson of machine  learning is yes, you can do it.  

2:20:36

People do choose the ones that models are  capable of doing, so it’s not like you can  

2:20:43

choose some arbitrary thing and just hit it  with the machine learning hammer and succeed.

2:20:48

But I do think you have to be pretty careful  about if somebody optimised for a specific thing,  

2:20:55

how much should you update from that to AGI,  the fully general intelligence? It’s really  

2:21:02

quite a difficult update to make, and you should  be looking quite a bit at the generalisability.

2:21:07

Rob Wiblin: OK, so that’s the thing that  would cause you to have a timelines freakout:  

2:21:10

if you trained models on one kind of  practical task, and then you found  

2:21:14

that they were actually good and useful  at doing quite different kinds of tasks?

2:21:18

Rohin Shah: Yeah, I think that’s right.  And you do see a little bit of this from  

2:21:21

reasoning models, to be clear. It’s  not like it generalises not at all,  

2:21:25

but not as much as I think would have  actually been a substantial update for me.

2:21:31

I think in particular I was looking at  autonomy tasks, and nowadays I think the  

2:21:36

reasoning models can be pretty good at autonomy  tasks. Well, actually, relative to expectations  

2:21:43

I think they’re still underperforming, but  nowadays they’re substantially better than  

2:21:48

they were at the time. But partly that’s because  companies have started to train for autonomy.

2:21:55

Rob Wiblin: Let’s talk about Google DeepMind  — which, for a very important organisation,  

2:21:59

I feel like isn’t super well  understood by people outside,  

2:22:02

including listeners and I would  say outsiders just in general.

2:22:07

A lot of people outside of AI companies produce  research, both governance research and technical  

2:22:14

research, hoping that it will be read and  absorbed and adopted and used by those companies,  

2:22:18

including GDM. What can people do to make  it more likely that anything that they  

2:22:23

do actually is read at all, or is used  at all by people inside an AI company?

2:22:29

Rohin Shah: Yeah, there are definitely a few  things that people can do. And I think again,  

2:22:38

at this point it is useful to bring in  the model of: there are just all these  

2:22:43

incredible constraints that interact with  each other a tonne, and as a result we can  

2:22:47

only do a few things, and need to do a decent  amount of work in order to get them through.  

2:22:53

Which you could well predict as the meme of  “companies are apathetic,” though I think it’s  

2:23:02

maybe a little bit better to say it as “companies  are well motivated but can only do a few things.”

2:23:11

Keeping that in mind, a few things are  worth doing. One is just talk to somebody  

2:23:20

at a company before you put in a bunch  of time into research. Just reach out  

2:23:29

to somebody who works at a company, say,  “This is what I’m planning to do. Do you  

2:23:34

think this will actually be useful?” It’s  a very simple step, but surprisingly many  

2:23:40

people just don’t do it. But it does seem like  probably the best thing you can do in order to  

2:23:47

actually increase the chance that your work  gets used at a company, at least currently.

2:23:52

Rob Wiblin: If people consistently did that,  

2:23:53

do you think a lot of your time would  then be eaten up replying to these emails?

2:23:57

Rohin Shah: If they email me, they will  probably not get very much of a reply,  

2:24:01

because I get way too many emails.  I think most of my reports would be  

2:24:06

fairly excited to talk to people about  this. I admit that I, at this point,  

2:24:11

get enough of these that it feels a little  bit more like a burden than an exciting  

2:24:16

thing. But it definitely used to feel like an  exciting thing before it became very common.

2:24:21

Rob Wiblin: OK. What should people  do other than email to ask whether  

2:24:23

what they’re doing is going to be useful?

2:24:25

Rohin Shah: So other things, I had a talk recently  on “How to theorize so empiricists will listen.”  

2:24:36

It could equally well have been called “How to do  safety research so that companies will listen.”

2:24:43

The basic points from this, the  first one — the most obvious one,  

2:24:49

but it’s still worth asking yourself —  is like, do they actually care? Sometimes  

2:24:56

people do research on problems that we actually  just don’t care about and don’t think matter.  

2:25:02

I think the safety community is fairly good  about not doing this, but it does still happen,  

2:25:09

and sometimes people might be a little bit  surprised by what we do and don’t care about.

2:25:15

For example, take jailbreaks. We care a lot about  jailbreaks now because the models are looking to  

2:25:25

be strong enough that misuse could be a serious  problem. But if you look like a year or two ago,  

2:25:35

people were doing a bunch of jailbreak research,  and saying this means that the companies aren’t  

2:25:42

very good at aligning their models because  they’re so susceptible to jailbreaks.

2:25:46

And I don’t know about other companies, but  at least at GDM, basically our stance on  

2:25:54

jailbreaks was that there’s not any actual misuse  scenarios that we’re particularly worried about,  

2:26:03

given the model capabilities. What safety is  about is about protecting users from cases  

2:26:09

where the model does something unintended and  harmful. So for a while, there was a bunch of  

2:26:17

discourse about how models are always going to  be jailbreakable, and it’s totally impossible  

2:26:22

to defend against it. And the real answer was just  like, we hadn’t even tried to stop the jailbreaks.

2:26:29

Rob Wiblin: OK, but you’re trying to stop it now?

2:26:31

Rohin Shah: We are trying to stop it  now, yes. Now, if people want to give us  

2:26:35

research on jailbreaks and how to defend  against them, we will definitely use it.

2:26:41

Rob Wiblin: I should say all this advice  is predicated on the idea that you’re doing  

2:26:44

research hoping that an AI company is going  to read it and absorb it and use it. There’s  

2:26:47

people who will have their own different ideas,  and they’ll be developing stuff hoping that  

2:26:51

people will become persuaded later on that it’s  useful or it’ll be valuable in a different way.

2:26:55

Rohin Shah: Absolutely, yeah. Lots of  different theories of change for research.  

2:26:58

I’m definitely only talking about if you  want a company to use it basically now.

2:27:03

Anyway, so that was step  one: “Do they actually care?”

2:27:08

Step two I call “Are you helping?” But it’s like  a very specific flavour of helping. Specifically,  

2:27:18

I think you should either be proposing a solution,  

2:27:22

or building an evaluation  or a metric of some kind.

2:27:27

Now, there is research that is not these things,  

2:27:29

right? There is research that does some sort  of theory to understand some phenomenon better,  

2:27:35

that will give you some increased understanding  of the phenomenon, but that doesn’t solve a  

2:27:40

particular problem, doesn’t produce a metric  or eval that the company should care about.

2:27:47

There are lots of other examples of research like  this. I think basically most of that research we  

2:27:53

are unlikely to use unless it happens to really  be on a core problem that we care about a lot.  

2:28:01

Given that we are all busy and it takes a  lot of work to incorporate any new thing,  

2:28:07

usually it needs to be a metric or  an eval or some kind of solution.

2:28:14

Rob Wiblin: So if people are  doing interesting theorising  

2:28:17

or preliminary empirical work to  understand a phenomenon better,  

2:28:20

you’ll be like, “That’s all well and good,  but come back when you have a solution”?

2:28:23

Rohin Shah: Roughly, yes. And again, I think this  is good research to do and people should do it. It  

2:28:28

can help develop better solutions in the future.  I just don’t think they should be thinking that —

2:28:33

Rob Wiblin: You’ll spend a lot of time reading it.

2:28:35

Rohin Shah: Exactly, yeah. Next thing after that,  

2:28:41

especially if you’re going  to be proposing a solution,  

2:28:45

the next thing to be doing is evaluating your  solution — or at least conceptually evaluating  

2:28:53

your solution, even if it’s not by experiments  — on the various metrics that companies care a  

2:28:58

lot about. Most of the time research will look at  did it actually solve the problem that it set out  

2:29:06

to solve? Definitely important, you’ve got to do  that evaluation. That is the most important one.

2:29:14

But then there’s also things like: How much  cost does this add in terms of compute? How  

2:29:19

much latency will it add? If you’re imagining  a step that runs after the AI system has  

2:29:24

produced a response, and then you do lots  of additional things before you then have  

2:29:29

to send the response to the user, it’s probably  a nonstarter. Not obviously, but it’s a big cost.

2:29:37

There’s implementation complexity,  or organisational complexity. If your  

2:29:42

solution involves doing something at the  scaffold level and also doing something  

2:29:47

that involves reading the internals of the  AI system and connecting these up together,  

2:29:51

that just spans so many different teams, and so  many different abstraction layers in the stack,  

2:29:57

that it’s going to be much harder to implement  than something that, for example, just happens  

2:30:03

at inference time and involves running a  monitor and then alerting some teams about it.

2:30:08

Rob Wiblin: Are there any other examples of things  that are easy to implement other than that one?

2:30:12

Rohin Shah: Yeah, I think there are several.  For example, if their datasets can be a  

2:30:21

useful contribution, it’s pretty easy to  try to add a dataset to post-training.  

2:30:32

Sometimes you try to add it to post-training and  then for some reason it makes the model worse  

2:30:39

on some totally unrelated thing, and then  you can’t use that dataset, which is a bit  

2:30:43

unfortunate. So ideally, when you’re building  datasets, you would like to evaluate whether  

2:30:47

it could have some sort of negative effect on  something else. But this is a bit hard to do.

2:30:52

But if someone came to me with a dataset and said,  “Training on this dataset improves this metric by  

2:30:59

a decent amount, and doesn’t seem to have any bad  effects on these three obvious other metrics that  

2:31:05

it might have had bad effects on,” I would  find that fairly compelling and think, yeah,  

2:31:10

maybe we should do it, assuming  it was solving a real problem.

2:31:13

So I think datasets, metrics, evals, monitors:  these are usually fairly relatively easy things  

2:31:18

to do. I think all of those are fairly easy to  implement if we think it’s worth implementing.

2:31:26

Rob Wiblin: Any other advice?

2:31:28

Rohin Shah: One is just like don’t do the  academic thing of using a fancy method to  

2:31:37

solve your problem, and instead solve it  with the absolute simplest method you can.

2:31:43

Rob Wiblin: I guess you’re saying cutting-edge  stuff might be useful in some other way,  

2:31:47

that you advance the science somehow  and could be valuable down the line,  

2:31:50

but if you want people to be using it on  any actual commercial models anytime soon,  

2:31:55

then it has to be as simple as possible  and well established as possible.

2:31:58

Rohin Shah: That’s the generous version,  yeah. The cynical version is more like  

2:32:06

academia rewards complex, formal-looking  stuff and poorly tuned baselines. It won’t  

2:32:13

reward poorly tuned baselines if  it knows they’re poorly tuned,  

2:32:16

but it’s kind of hard to tell if a  baseline has been poorly tuned or not.

2:32:21

Rob Wiblin: What’s a poorly tuned baseline?

2:32:24

Rohin Shah: A poorly tuned baseline  is like you have some default way of  

2:32:28

solving the problem that you  are trying to do better than.

2:32:33

Rob Wiblin: Do you make the baseline look  bad by not doing a very good job of it?

2:32:37

Rohin Shah: That’s right. But not  intentionally. It will often be like  

2:32:40

you implement the baseline once, and then  you don’t tune the hyperparameters for it,  

2:32:45

so it performs less well than it really  should. It’s just very easy to not spend  

2:32:53

enough time working with your baseline to  make it work well. And then as a result —

2:32:59

Rob Wiblin: It makes your other thing look better.

2:33:00

Rohin Shah: Exactly. And given the incentives in  academia of publishing novel things and publishing  

2:33:07

a lot, I think this is a fairly common problem.  So I think if you want to publish, that tends  

2:33:16

to be the thing that you do. If you want your  research to be used by people at companies,  

2:33:23

it’s quite important that you try the  obvious stuff, and you try fairly hard  

2:33:29

with the obvious stuff. And only if that really  does fail do you try to do something fancier.

2:33:35

Rob Wiblin: What external research  has been most useful to GDM so far?

2:33:40

Rohin Shah: Probably the one I would  point to most is the AI control work  

2:33:46

from Redwood Research. I think we were  always planning to monitor our AI systems,  

2:33:52

it’s not like the idea of monitoring was  new to us, but the specific conceptual  

2:33:57

frameworks they brought to how you  might evaluate how well this works,  

2:34:03

specifically the distinction between trusted and  untrusted models, I think that was quite good.

2:34:09

The first paper they published on it showed  how a variety of different control protocols,  

2:34:15

how you can evaluate their safety and their  usefulness and use this to decide which one  

2:34:20

you should be doing. I think that’s  influenced me quite a lot on how exactly  

2:34:25

I think about the control work. I think  that’s probably the most obvious example.

2:34:30

Rob Wiblin: So this is the Buck  Shlegeris and Ryan Greenblatt  

2:34:34

episodes earlier in the year — Buck Shlegeris in  particular, talking about the AI control agenda.

2:34:39

I feel like that episode didn’t make quite as much  of a splash as I was hoping or expecting at the  

2:34:43

time. But the amount that people in the industry  keep constantly referring back to it I think means  

2:34:49

that if people didn’t listen to it at the time,  because for whatever reason it just didn’t sound  

2:34:53

quite exciting enough, I think give it another  look, maybe go back and have a listen. Because  

2:34:58

I think it is very important foundational  work that is only getting more relevant.

2:35:01

Rohin Shah: Yeah, definitely. I would say that  control is, in the medium term, probably how  

2:35:10

we are going to argue that Gemini is safe, at  least for misalignment and internal deployments.  

2:35:16

I think probably via control style arguments. So  yeah, definitely an important area to be aware of.

2:35:23

Other papers: there was this paper  on observational scaling laws,  

2:35:31

which mostly I think you can think of it as a  precursor to the Rosetta Stone paper that we  

2:35:39

talked about before where we stitched together  benchmarks to get a way to measure overall AI  

2:35:45

progress. I think the observational scaling  laws [paper] thought of this more as like,  

2:35:52

how can you forecast particular benchmark  results? That’s also a pretty good application.  

2:35:59

I think mostly it was an area we hadn’t really  been thinking about, and they showed that you  

2:36:05

can do something here. And I think that probably  influenced me when thinking about the benchmark  

2:36:12

stitching / Rosetta Stone approach. So  that one was pretty influential on me.

2:36:20

Other examples would be the various  chain-of-thought monitoring papers that have come  

2:36:24

out. I don’t think that I can point to any one  paper that I thought was a particularly big deal,  

2:36:31

but in aggregate, I think they’ve built a fairly  good research literature on the effectiveness of  

2:36:38

chain-of-thought monitoring, when it does  and doesn’t fail. I think there were some  

2:36:43

papers that looked at to what extent can you just  persuade the monitor that actually everything’s  

2:36:51

fine here and you don’t need to flag anything  in this chain-of-thought monitoring. I think  

2:36:56

that was a useful thing to study, and the specific  quantitative results in it were informative to me.

2:37:04

Rob Wiblin: I imagine a lot of people  would love to get a job at GDM,  

2:37:08

and even people at other AI companies already  might be interested in considering switching,  

2:37:12

given the advances that GDM has made in its  models in recent years. What sort of roles are you  

2:37:21

finding hardest to fill? And what sort of skills  are maybe hardest to find in the labour market?

2:37:26

Rohin Shah: I think I’ll go back again to the  theme I’ve had throughout, of the challenge  

2:37:34

being more in implementing stuff rather than  in figuring out what stuff we need to do.  

2:37:41

We have a lot of ideas, we know what we need  to do. Implementing it ends up being harder,  

2:37:45

because there’s so much stuff you have to  check and make sure that you’re not hurting it.

2:37:52

And as a result, I think especially over the  last year, maybe two years, I think we’ve had  

2:38:00

more of a need for people who just want to do  the obvious thing and land it, as opposed to  

2:38:07

people who want to figure out the ideal, optimal  thing and write a cool research paper about it.

2:38:15

Rob Wiblin: We’re talking about the  AGI safety and alignment folks, right?

2:38:18

Rohin Shah: That’s right, I  am talking primarily about the  

2:38:21

AGI safety and alignment team. I think  focus on just doing the obvious stuff,  

2:38:29

a focus on implementation rather than  research. I think in practice this also means  

2:38:36

more of a focus on software engineering  relative to machine learning research.

2:38:41

I do still think that we do care about conceptual  ability, research taste, ML engineering skills.  

2:38:52

They do come up when you’re doing this sort of  implementation. You need to test how good your  

2:38:56

system is. That means you need to build good  evaluations, you need to not overfit to them,  

2:39:00

you need to be a little bit careful about that.  So it’s not like those skills are irrelevant. I  

2:39:05

just think that there’s more of a focus on things  like getting things done, software engineering,  

2:39:11

implementation now, relative to even just a year  ago, but especially relative to two years ago.

2:39:19

Rob Wiblin: Are there any particular roles  that you’re hiring for at the moment,  

2:39:21

or hiring for a lot in general? It’s felt  to me like Anthropic is hiring hand over  

2:39:26

fist and at OpenAI there’s always a lot of  roles. Is it a similar situation for you?

2:39:30

Rohin Shah: No, I think we are hiring quite  a bit less than Anthropic and OpenAI at the  

2:39:36

moment. We do have a couple of roles  open right now. Partly we have roles  

2:39:42

open in other teams that I think are  also very relevant. The AGI safety and  

2:39:45

alignment team isn’t the only team that’s  relevant to AGI safety at Google DeepMind.

2:39:51

For example, currently there is a hiring round  open on the security team to hire engineers to  

2:40:00

work on AI control. And I think that’s a great  position. Hugely impactful. As I’ve said before,  

2:40:08

that’s probably the argument that we’re going  to make for why Gemini wouldn’t cause harm by a  

2:40:15

loss of control in the medium term. So I think  that’s an extremely useful role. Very impactful.  

2:40:22

Not on the AGI safety and alignment team,  but they would be working closely with us.

2:40:28

On my team, we’re hiring at the moment  for an engineer for frontier safety risk  

2:40:35

assessment](https://jobs.80000hours.org/?jobPk=18334&utm_source=80k_podcast).  So this is running our  

2:40:38

dangerous capability evaluations and writing  the frontier safety report. I don’t particularly  

2:40:45

think that the frontier safety report needs  to be that much more detailed, but if you do,  

2:40:50

you could come join us and then put in the time  to make it better. There is a fair amount that:  

2:40:56

individual people on the team can have the  flexibility to go and do stuff like that.

2:41:03

In practice, I expect by the time that this  recording actually comes out that that role  

2:41:09

might not be there. But I think we do have  these roles coming out on an ongoing basis.

2:41:15

And maybe I’ll send over a link to an expression  of interest form that people can fill out,  

2:41:22

so that we can email them when new roles  open up. I think part of it is we have,  

2:41:29

like many others, been hit by the  deluge of AI-assisted applications,  

2:41:36

so we are a little bit less likely in the  future to make these open calls for hiring,  

2:41:41

because it really is just such a pain to go  and do the resume screening for all of them.

2:41:47

Rob Wiblin: Yeah, I think folks are  going to have to introduce a fee to  

2:41:51

fill out forms like that or to apply  for jobs, which people also hate for  

2:41:54

other reasons. But I don’t really see  the alternative now that it’s possible  

2:41:57

for AIs to submit completely indistinguishable  applications that can be as fake as you like.

2:42:02

Rohin Shah: Yeah, it’s rough. Last  year we included an LLM captcha,  

2:42:06

but for our recent one, we were trying  to do it, and I think we probably could,  

2:42:13

but it would also trick a bunch of  humans. Even our LLM captcha from  

2:42:18

last year tricked a bunch of humans. Not a  bunch, like maybe 5% of them. At this point,  

2:42:24

I’m not sure that I can design one that  wouldn’t have quite a lot of false positives.

2:42:29

Rob Wiblin: I guess “book a flight for me”  might be more expensive than just having  

2:42:34

the fee. Is there any pitch you want to  make for working at GDM in particular?

2:42:39

Rohin Shah: Basically my pitch is that  company safety teams are probably the  

2:42:44

biggest force for what actually happens on  a technical level to make AI systems safe,  

2:42:53

and that is a good reason to join them.

2:42:55

Rob Wiblin: All right, we should wrap  up. I think for a final question,  

2:42:59

I’ll throw you a real hardball  that came in from the audience:  

2:43:02

How do you maintain such a nice and positive  spirit in these troubled times, Rohin?

2:43:06

Rohin Shah: I mean, some of the answer is  a bit boring, not that generalisable. One,  

2:43:14

just by personality, I’m just fairly stable  and my mood doesn’t change very much day to  

2:43:19

day. And then two, as has maybe become  clear over the course of this episode,  

2:43:26

I don’t think the times are as troubled as  everybody else thinks, relative to everyone else.

2:43:32

But one thing that I do quite a lot, and  quite religiously, is focus on the things  

2:43:40

that I can control. This isn’t why I do  it, but I think it is quite useful for  

2:43:49

maintaining a nice and positive spirit  during “these troubled times,” let’s say.  

2:43:55

It definitely helps keep me focused  on the areas where I do have agency,  

2:44:02

and I think it’s just pretty great to be focused  on areas where you can actually make a difference.

2:44:09

Rob Wiblin: My guest today has  been Rohin Shah. Thanks so much  

2:44:11

for coming on The 80,000 Hours Podcast, Rohin.

2:44:13

Rohin Shah: Thanks a lot,  Rob. It was great to be here.

More transcripts

Explore other videos transcribed with YouTLDR.

Get the TLDR of any YouTube video

Transcribe, summarize, and repurpose videos in 125+ languages — free, no signup required.

Try YouTLDR Free