Full Transcript

·YouTLDR

Harness Engineering: How to Build Software When Humans Steer, Agents Execute — Ryan Lopopolo, OpenAI

46:057,968 words · ~40 min readEnglishTranscribed Apr 19, 2026
AI Summary

The scarcity of software engineering is shifting from code production to system design and delegation because code is now functionally free to produce and refactor via agents. To scale, engineers must transition into 'harness engineers' who build structural guardrails, documentation, and automated review agents that steer model execution toward high-quality, non-functional requirements.

As AI models transition from simple completions to full-job execution, this video provides a blueprint for managing the resulting 'abundance of code' and the cultural shift from hands-on keyboard to system orchestration.

Section summaries

0:00-2:20

Introduction & The 'Code is Free' Thesis

watch

Establishes the philosophical shift required for modern AI engineering.

2:20-10:30

Systems Thinking & Non-functional Requirements

watch

Explains why engineering skillsets are moving toward delegation and guardrail design.

10:30-18:40

Harnessing Agents & Context Hacks

watch

Contains the most technical advice on file size limits, lints-as-prompts, and reviewer agents.

18:40-23:20

Q&A: The Working Setup

optional

Discusses Ryan's specific internal tools and the 'token billionaire' lifestyle.

23:20-33:50

Q&A: Scaling & Context Scarcity

watch

Deep dive into how to avoid 'over-engineering' harnesses and managing context.

33:50-45:30

Q&A: Large Orgs & Future Roadmap

optional

Discusses monolithic vs modular repo structures and the long-term vision of AGI coding.

Key points

  • Code is a Free, Abundant Resource — Implementation is no longer the bottleneck; the abundance of code is only constrained by GPU capacity and token budgets. This allows teams to execute even low-priority tasks (P3s) in parallel and pick the best solution, rather than triaging by human time.
  • Harnessing the Context Window — In an agentic workflow, context is the primary constraint. Harness engineering involves adapting codebases to be 'context efficient,' such as limiting file sizes (e.g., 350 lines) and using automated linting to inject remediation steps directly into the model's feedback loop.
  • Persona-Based Review Agents — Instead of blocking on human code review, teams can deploy specialized reviewer agents (e.g., security, reliability, or UI personas) that check every PR against durable documentation and ADRs (Architecture Decision Records).
  • The 'One Way' Architecture — To make agent output predictable, repositories should enforce strict architectural uniformity—one way to handle state, one way to write tests, and isolated packages. This creates 'transferable context' for the model across the codebase.
Implementation is no longer the scarce resource of what it means to do the job of software engineering. Code is free. Ryan Lopopolo
Every time I have to type 'continue' to the agent is like a failure of the harness to provide enough context. Ryan Lopopolo

AI-generated from the transcript. May contain errors.

0:00

Our

0:15

next speaker is here to speak about

0:17

harness engineering. How to build

0:20

software when humans steer and agents

0:23

execute. Please join me in welcoming to

0:26

the stage member of technical staff at

0:28

OpenAI, Ryan Leopo.

0:40

Good morning, London.

0:46

I'm super excited to be here today. I'm

0:48

Ryan Laapo and for the last nine months

0:51

I have had the privilege of building

0:53

software exclusively with agents. Uh I

0:57

am a token billionaire and I believe

0:59

that in order for us to get into our AGI

1:02

future, we want everybody to be token

1:04

billionaires to use the models to do the

1:07

full job. And what that means is to lean

1:13

into the idea that the models are

1:15

capable of being a full software

1:16

engineer. And I've lived that experience

1:18

by banning my team from even touching

1:20

their editors, to have to work through

1:22

the models in order to get the job done.

1:25

And today I'm going to talk to you a

1:27

little bit about what it means to lean

1:28

into that and operationalize the way you

1:31

work, the code spaces you live in, and

1:33

the processes on your teams in order to

1:35

get the agents to do the full job.

1:40

I believe I'm preaching to the choir

1:42

here when I say that the way we build

1:44

software has changed. In the last six

1:46

months, we have seen coding agents take

1:48

over the world and capability has

1:51

continually advanced at a super fast

1:54

pace to have these models and the

1:56

harnesses within which they live take

1:58

more complex actions, do more

2:01

complicated work with higher reliability

2:04

over longer time horizons.

2:06

And the place we've gotten to here is

2:09

that implementation is no longer the

2:11

scarce resource of what it means to do

2:13

the job of software engineering. Code is

2:15

free. We have an abundance of code to

2:18

solve the problems that we come across

2:20

in our day-to-day as we run our teams,

2:23

build software, and solve user problems.

2:27

Hiring the hands on the keyboards as

2:30

part of our teams is only constrained by

2:32

GPU capacity and token budgets. And each

2:36

engineer today in this room has access

2:38

to five, 50, or 5,000 engineers worth of

2:42

capacity 247 every day of the year. The

2:47

only thing that needs to happen, our

2:49

roles is to figure out how to

2:51

productively deploy these resources into

2:53

our code and into our teams to make use

2:56

of this new capacity.

2:59

And in this world, skill sets are

3:02

shifting more towards systems thinking,

3:04

system design, and delegation in order

3:06

to make use of this abundant capacity to

3:09

produce code to solve problems. And

3:12

there are three reasons that this

3:13

happened. All of which happened in late

3:16

2025.

3:19

For me, the magic moment was GPT 5.2,

3:22

which when it came out was able to do

3:24

the full job of a software engineer. The

3:26

models at this point are good enough

3:28

where they're isomeorphic to you and I

3:31

in terms of the ability to produce code

3:33

at high quality that solve real user

3:36

problems in real code bases.

3:39

Code is free. And I know this is maybe a

3:43

scary thing to hear because code carries

3:45

maintenance burden, but it's free to

3:47

produce, free to refactor, and it is not

3:51

a thing to get hung up on anymore.

3:54

We think of code as burden because it

3:57

it's a synchronous attention drain on

3:59

the human engineers on our team. But the

4:02

models are incredibly patient. They are

4:04

infinitely parallel. So the ability to

4:06

produce, maintain, refactor, and delete

4:08

code is no longer a forcing function on

4:12

figuring out how to allocate resources

4:14

on your engineering teams. So sort of be

4:17

agi pill here is to believe that the

4:20

models are capable of producing every

4:22

line of code we could ever possibly

4:23

need, figuring out when to delete them,

4:26

figuring out when to refactor them or

4:28

make them more reliable. And it's your

4:31

role as software engineers to figure out

4:33

how to unblock your team of agents and

4:36

humans driving those agents from being

4:38

able to drive them over long horizon

4:40

work to do the full job.

4:44

The idea here is that every one of you

4:46

is a staff engineer. You have as many

4:49

team members as you can possibly drive

4:51

concurrently and have tokens to support

4:54

and you need to look one day, one week,

4:58

six months into the future to figure out

5:00

what structures you need to put in place

5:02

to productively harness this infinite

5:05

capacity to produce code.

5:10

The scarce resources in this world that

5:12

we see today are three things. human

5:16

time, human and model attention and

5:20

model context window. And in the world

5:23

where human time and attention is

5:25

scarce, the role is to think about where

5:29

that time is going, figure out ways to

5:31

productively automate it and move that

5:35

synchronous human time into higher

5:37

leverage activities.

5:40

In a world where human time is scarce

5:43

and human time is required to produce

5:46

code, we have a stack rank. Things are

5:48

either P zeros or P2s. Those P3s will

5:52

never get done. However, in a world

5:54

where code is free and infinitely

5:56

abundant, all those P3s get kicked off

5:59

immediately, maybe 4x in parallel. We

6:02

pick one that solves the problem and in

6:05

it goes.

6:06

I've had the privilege of building a ton

6:08

of agents internally at OpenAI to

6:11

improve the productivity of my

6:12

co-workers. And when code is free, all

6:16

these internal tools can have good

6:20

localization and internationalization

6:21

from day one. I can make tools that my

6:26

colleagues in London, Dublin, Paris,

6:28

Brussels, Zurich, and Munich are able to

6:30

experience in their native languages

6:33

without really having to trade against

6:36

any of my other teams capacity in order

6:38

to make highquality tools.

6:40

We should be working with the assumption

6:42

that the best parts of software

6:44

engineering that we all know, live, and

6:47

breathe are available in any product

6:49

that we could ever build all the time.

6:52

Humans no need no longer need to concern

6:54

themselves with implementation. The

6:57

important thing is not the code but the

6:58

prompt and the guardrails that got you

7:00

there. This is why leaving breadcrumbs,

7:03

documentation, ADRs, persona oriented

7:07

documentation around what a good job

7:08

looks like. All the historical logs of

7:11

tickets and code reviews. This is the

7:13

process that got you and your teams to

7:15

the code and products that you have

7:17

today. And this is what is need needs to

7:19

happen in order to get your agents there

7:21

as well. Your job is to build systems,

7:25

software and structures that enable your

7:28

team to be successful. And to do that,

7:31

we need to make them legible to those

7:34

agents that are driving the

7:35

implementation. That means structuring

7:37

them in a way that's native to the

7:39

agents. Writing them in a way that is

7:41

respecting of scarce context, which is

7:43

this other scarce resource here, and

7:46

figuring out ways to make the tokens

7:48

that are required to do the job easy to

7:50

predict. That means making things the

7:52

same as much as possible so we can limit

7:54

the amount of attention the model needs

7:56

to activate in order to do the job.

7:59

Large scale refactoring in this world is

8:02

free. So making things the same is

8:04

something that you are all able to do.

8:07

There's never going to be a migration

8:09

that hangs open for six months now that

8:11

you can't get the last parts of the

8:12

codebase to do because you can just fire

8:14

off 15 agents to drive that work to

8:16

completion. This is what it means to

8:18

have a migration, right? We can finish

8:20

them now. Come on. That's good. That's

8:22

good. Clap.

8:28

There's sort of this like meta

8:29

epistemological question here about like

8:31

what it means to do a good job and doing

8:35

a good job as a software engineer is

8:37

hard. It requires us years of being in

8:40

the industry to fully internalize what

8:42

it means to write highquality

8:44

maintainable reliable code that our

8:47

teammates are able to build on top of

8:49

that is going to acrue leverage to the

8:51

codebase

8:52

to do a single patch. well, probably

8:55

requires 500 little decisions along the

8:58

way around the underspecified

9:01

non-functional requirements that go into

9:03

producing good code. The agents, the

9:06

models during their training have seen

9:09

trillions of lines of code that make

9:11

every possible choice of those

9:13

non-functional requirements that you

9:14

could ever imagine. So, it's our job to

9:17

specify those non-functional

9:19

requirements to write them down in a way

9:21

that the agents can see this is what it

9:23

is to do a good acceptable job that's

9:26

going to produce a merged patch. And if

9:28

the agents aren't doing that, it's our

9:31

job to figure out ways to refine and

9:33

restrict their output such that the code

9:36

they write is acceptable. You can just

9:39

simply say do not produce slop. Don't

9:41

accept slop. You won't get slop in your

9:43

codebase. But to do that requires taking

9:46

short-term velocity hits in order to

9:48

back up or doubleclick into a task to

9:50

figure out what it is the agents are

9:52

struggling with in your environment.

9:55

Put the guardrails in place so they stop

9:57

making those mistakes

10:00

and then figure out ways to step back

10:02

and spend your time on higher leverage

10:04

activities once you solve some of the

10:06

blockers in the short term.

10:09

When I think about empowering my team in

10:11

this way, everyone is an expert in what

10:14

it is they bring. I have a diverse full

10:17

stack team that is experts in front-end

10:19

architecture, backend scalability, being

10:22

productminded. And each one of those

10:25

different personas fleshes out the skill

10:27

set of my team by bringing a different

10:29

understanding, a different set of solves

10:31

for those non-functional requirements.

10:34

Getting teammates to write those down

10:36

actually means that every engineer

10:38

driving agents gets the best of every

10:41

single person on my team. I don't need

10:43

to block on low signal code review in

10:46

order to learn what it means to write a

10:48

good QA plan. To have one engineer on my

10:52

team document that in a durable way

10:54

means every agent trajectory is going to

10:57

get a good QA plan. And we can do this

10:59

once in a high lever way that we're able

11:01

to stack on top of.

11:05

So how can we get the agents to do a

11:07

good job? What are some of the tools and

11:09

techniques we have in order to

11:12

essentially prompt inject our agents and

11:13

continually remind them of what it means

11:15

to make those specific choices that we

11:18

expect around those non-functional

11:20

requirements. And there's a bunch of

11:22

ways we can do this. We can write good

11:24

agents.mmd files. However, with autoco

11:28

compaction, which is a thing that has

11:29

continued to improve,

11:31

GPT 5.4 and CEX is fantastic at autoco

11:34

compaction. I essentially never have to

11:36

write slashnew anymore. I've got some

11:39

pictures on my Twitter of me strapping

11:40

my laptop into the back of my car so I

11:42

can continue do running inference while

11:44

I'm commuting to and from work. And in

11:47

this world, you have to kind of build

11:50

for that expectation that context will

11:53

get paged out over time. We need to be

11:56

continually refreshing context as the

11:58

agent goes about doing a task. And the

12:00

ways we can do that are by having

12:03

reviewer agents look at the code along

12:05

the way through the lens of what it

12:07

means to be successful. Right? We have

12:10

security and reliability review agents

12:12

in our codebase that are continually

12:14

running as part of every push and CI

12:16

that look at those documentations and

12:18

the proposed patch and do simple things

12:20

like say, are there timeouts and retries

12:24

on this bit of network code? Has the

12:26

code that has been introduced have a

12:28

secure interface that is impossible to

12:30

misuse?

12:32

I'm sure everyone here has been paged at

12:34

some point for network code that failed

12:37

in production causing an outage that

12:39

could have been remediated by a retry

12:41

and a timeout. And I know I'm guilty of

12:45

putting that retry and timeout in

12:47

merging the bug fix and otherwise

12:49

ignoring that. I am not a reliable

12:51

reviewer or author of code with respect

12:54

to this non-functional requirement.

12:56

However, taking the time to write some

12:59

docs, write a lint that is bespoke to my

13:02

codebase that is going to look at every

13:04

time I call fetch to make sure that

13:06

there's a retry and a timeout wrapped

13:08

around it means I've durably solved this

13:10

problem and I'm able to do it because I

13:12

lean on this axiom that code is free

13:15

that the agents are able to do a good

13:17

job that I can completely migrate the

13:19

codebase to solve this problem durably

13:21

once and for all. And in order to kind

13:25

of operate in this way, we need to step

13:28

back and look at the durable classes of

13:31

failures that the agents and the humans

13:33

in the codebase are making time after

13:35

time. Figure out why we're spending time

13:39

on it. Devise a solution to

13:41

systematically eliminate this class of

13:42

misbehavior and then continue to

13:45

observe, refine, and make additional

13:48

choices on those non-functional

13:49

requirements.

13:51

One really neat trick I use here is that

13:55

you can write tests about the source

13:56

code as well that are separate from

13:59

lints. Right? If we know that context is

14:01

limited, we can write a test that limits

14:05

the fact that files are no longer than

14:07

350 lines. We're adapting our codebase

14:10

to the harness to the models to do a

14:13

little bit of engineering to be context

14:15

efficient and squeeze more juice out of

14:17

the model capability that we have today.

14:22

The other things we can think about are

14:25

providing good error messages that give

14:27

actual remediation steps to the model

14:30

and to humans for how to proceed next.

14:33

It's not enough to say we've got a lint

14:36

failure because we're awaiting in a loop

14:39

or that we have an unknown at this deep

14:41

part of the codebase and why is the

14:43

model writing a function called is

14:45

record. What we need to do is provide a

14:47

prompt via a lint or a test failure that

14:52

says no no no you shouldn't have an

14:53

unknown here at all because we parse

14:57

don't validate at the edge and you

14:59

certainly have a type here which was

15:00

derived from zot loadbearing

15:02

infrastructure for our AI future

15:10

you can just prompt things

15:12

I've talked about here today is a prompt

15:14

you can do this without touching the

15:16

model weights at all.

15:19

Kind of a funny digression here is it

15:22

seems like each advancement we've had in

15:24

the complexity of the way we write code

15:27

to interact with these models comes from

15:30

both increasing capability in the models

15:32

and increasingly

15:34

niche ways for injecting prompts into

15:36

those models. prompts I'm sure you're

15:39

aware are prompts powers prompts rules

15:42

files prompts skills prompts these lint

15:46

error messages that I am talking about

15:48

prompts review agents that inject

15:50

comments onto the PR that we require the

15:52

agent to address before it is able to

15:54

propose it for merge prompts

15:57

you're going to find lots of ways to

15:59

insert prompts into your code and one

16:02

way you can do that is by embedding

16:04

agent SDKs into your tests that are

16:07

going

16:08

review the codebase for acceptability

16:10

using prompts that get embedded into the

16:12

code. And if I find myself spending a

16:15

ton of time writing prompts, we can

16:17

actually shell out to the agent for that

16:19

as well. Uh I've pointed codecs at all

16:23

of the prompting cookbooks we have on

16:25

the OpenAI developer guide and told to

16:27

synthesize a skill out of them for how

16:29

to write prompts. Which means when I

16:31

find a need to write prompts in order to

16:33

improve my agent performance locally in

16:35

the code, I use the skill to write

16:37

prompts that I wrote with the agent

16:39

looking at the prompts to write the

16:40

prompts.

16:45

All the leverage that you're encoding in

16:47

in to your repository, your team, and

16:50

the agents in this way stacks incredibly

16:53

well. to kind of pull back to this idea

16:57

that a single product-minded engineer on

16:59

my team was able to give us a big lift.

17:03

They know what it means to write a good

17:04

QA plan. To write a good QA plan though,

17:07

you have to document all the features

17:09

that you have, the critical user

17:10

journeys, and how users engage with your

17:13

applications, web apps, APIs, and

17:16

services.

17:17

Once you write those down on how to

17:20

write a good QA plan with the

17:22

expectation that all userfacing work has

17:24

a QA plan, now a review agent is able to

17:28

assert expectations around what it means

17:30

to prove that you have effectively

17:31

written the feature. A QA plan indicates

17:34

what media should be attached to the PR

17:37

for the humans and agents to know that

17:39

you've done a good job, which has the

17:41

consequence of me trusting the output

17:43

more, needing to shoulders surf the

17:45

agent less.

17:47

and removing myself from the loop even

17:49

more to delegate more and more of the

17:52

work to agents. And all of this is just

17:55

making sure the agents have the tools

17:58

and tokens and context

18:01

to do the full job to remove myself from

18:04

the need as a synchronous driver. The

18:07

models crave tokens. We can

18:09

operationalize our codebase to give them

18:12

tokens to drive them forward using sub

18:14

agents and all these other techniques to

18:16

refine the agent output.

18:20

I'm excited to let you all know today in

18:22

the way you all do that you can just go

18:24

build things. Do not hesitate to remove

18:28

yourselves from the loop by getting the

18:29

agents to do the full job because they

18:31

can. Thank you.

18:34

>> Very excited to bring on our guest.

18:38

We've got Ryan Leapo today. He just gave

18:40

the keynote. Um, very exciting speaker.

18:42

The man is full send hyperengineering at

18:45

OpenAI. So, uh, a little bit of

18:48

background. We did a Laten Space episode

18:50

with him. We shipped it the other day.

18:52

The the story he wrote this great

18:54

article called Harness Engineering and

18:56

we're like, "Wow, this is pure gold." We

18:58

have him on the podcast. He's a token

19:00

billionaire spending over a billion

19:02

output tokens a day. That's like over

19:04

$1,000. So, you know, man is really

19:06

living it. Uh, we want to keep this

19:08

exciting. Ask good questions, ask

19:10

interesting stuff, ask things that

19:12

people can learn from. But, you know,

19:13

let's welcome Ryan onto the stage.

19:20

>> Hi folks, how's it going? Excited to be

19:21

here. Uh, London has been fantastic and

19:24

uh, excited to kind of walk through what

19:27

it is, uh, that we do and how we work

19:28

here.

19:29

>> I think you got to come on. This camera

19:31

is just here. So,

19:32

>> I got blinded by the QR code. So, we're

19:35

>> Okay. So, background. We have about an

19:37

hour. Um, scan this QR code. You should

19:40

get Slido. Slido will let you ask

19:43

questions. If you see interesting stuff,

19:45

you can thumbs them up and we'll try to

19:47

get through them. Unfortunately, the

19:49

first one I can't superdo, but let's

19:51

just kick it off. Ryan, can you show us

19:53

your actual working setup with no

19:55

laptop? Um,

19:56

>> uh, yeah. Uh, here. Beach

19:59

Margarita

20:01

linear, right?

20:02

>> Oh, wow.

20:04

Um, I'll say watch the podcast we put

20:07

out. We go through some of the work, but

20:09

if you want to talk about it, I guess

20:10

without actually showing us what's your

20:12

what's your workflow like? What's your

20:13

setup? How do you how do you approach a

20:15

task?

20:16

>> Sure. So, uh, the way me and my team

20:21

work is to start with tickets, right? We

20:24

have chunks of work that we want to do,

20:26

features we want to add to our apps,

20:28

reliability work that we want to do. uh

20:30

we give that ticket to an agent along

20:32

with a couple of skills that enable it

20:34

to manipulate our app. Uh we want the

20:38

entry point to the development process

20:40

to be codecs not an environment which we

20:43

build around it. So we kind of do things

20:46

um outside in right like codeex is the

20:49

entry point the same way you would be

20:50

and we give it tools we give it

20:52

instructions on how to cook. So rather

20:54

than like creating a shell that our app

20:56

and CEX get spawned into, we have a

20:59

skill that teaches Codex how to launch

21:00

the app that teaches Codeex how to spin

21:02

up that local observability stack to

21:04

give it logging and telemetry. We give

21:06

it a skill that enables it to uh boot up

21:10

Chrome DevTools and attach to the

21:13

application with a you know local CLI

21:16

that will connect via some Damon that we

21:18

have. So the whole way we have set up

21:21

the repository and all of the local dev

21:23

tools is for codeex to invoke them

21:25

first. Um that means we have kind of

21:28

like a bunch of little mini harnesses

21:30

within the codebase that make it really

21:31

easy for us to slot in additional guard

21:33

rails. Uh you know a big package of

21:37

custom ESLint rules which get wired into

21:39

every PNPM package in the workspace. We

21:42

have another sort of local dev harness

21:44

that allows us to add sort of like

21:46

higher level wholesome tests that assert

21:49

the structure of the code itself rather

21:51

than like either the syntax or the

21:53

behavior of the code. Things like you

21:56

know package privacy dependency edges

22:00

between different layers of our stack.

22:01

these sorts of things. Uh making sure

22:03

that you know across multiple files zod

22:07

schemas are dduplicated that there's a

22:08

single canonical implementation of like

22:10

our async helpers. Uh these sorts of

22:13

things because you know the way we have

22:15

seen the agents work is to sometimes

22:18

optimize for local coherence of a

22:20

package rather than using like our

22:22

shared utilities and things like that.

22:24

So having observed that behavior, we

22:26

kind of have built a bunch of little

22:28

pseudo llinter source code verification

22:31

things that shake out some of that bad

22:34

behavior so the humans don't get

22:35

distracted paying attention to that in

22:37

reviews, stuff like that. But uh the

22:40

setup optimizes for the agent to do the

22:44

job and for the humans to not have to

22:45

keep track of the high churn in the

22:47

codebase. Um we kind of centralize our

22:50

leverage around five to 10 skills. uh we

22:53

don't go super wide on skills preferring

22:55

to make the existing skills better

22:58

because at least I find that the

23:02

infrastructure within the repository all

23:03

the local developer tools change super

23:06

frequently uh and I don't really have

23:08

the bandwidth to keep track of this. So

23:10

we hide all that complexity beneath the

23:13

skills that the human has to invoke and

23:16

let the agent just kind of figure it

23:17

out. One one kind of neat thing here is

23:20

um when we moved from using uh Chrome

23:23

DevTools protocol directly to having

23:24

this like Damon thing like I didn't know

23:26

that had happened for like three weeks.

23:28

Uh it was like totally fine because

23:30

Codex was able to do the thing uh you

23:32

know with the documentation and things

23:34

that we had in place

23:35

>> and part of this you can get more detail

23:37

in your article. So some background you

23:39

wrote a great piece called harness

23:40

engineering. There's a whole section in

23:42

there on how you thought about skills,

23:44

thousands of skills versus simplifying

23:46

it to just quite a few. But okay, uh,

23:49

continuing on, how do you stop yourself

23:52

from overgineering harnesses? And a

23:55

little bit of a similar followup is, do

23:58

you often build small tools for

24:00

yourself, if ever? Uh, do you do you

24:02

build custom tools?

24:04

>> Yeah. So, I think this is kind of

24:06

gesturing in the direction of the bitter

24:07

lesson here, right? which is how do I

24:09

make sure the work that I do isn't like

24:12

completely obsoleted by an increase in

24:14

model capability and the way I have

24:17

thought about that is doing sort of the

24:19

bare minimum amount of context

24:21

management to kind of pull in

24:23

requirements uh for the agent to do an

24:26

acceptable job over the course of its

24:28

work and context is a thing that I don't

24:31

think will ever be obsoleted right like

24:33

the the models must be told like the

24:35

requirements of the task which

24:37

guardrails to pay attention patention to

24:38

these sorts of things. So a good harness

24:41

is really operationalized around giving

24:44

the model text at the right time so it

24:46

can look at the work it has done and the

24:49

information around what a good job looks

24:50

like and you know fundamentally the

24:54

models are trained to follow

24:55

instructions. All the harness should do

24:57

is surface instructions to the model at

24:59

the right time. So we do want to

25:02

minimize that too, right? You don't want

25:04

to frontload all those instructions

25:05

because then you kind of like overwhelm

25:07

the agent, but all of these sort of

25:10

requirements around what a good job do

25:12

need to be paid attention to over the

25:13

entire course of a PR, right? So

25:16

figuring out ways to either defer or

25:18

just in time surface those instructions

25:20

is kind of what uh a good harness should

25:23

do, right?

25:24

If you know that uh you want your React

25:29

components, right, to be decomposed so

25:31

that they make good snapshot tests for

25:33

individual more stateless pieces, right?

25:35

You don't need to load that up front.

25:37

Instead, you should kind of let the

25:38

agent cook and prototype and experiment

25:40

with the UI you want to build and then

25:42

at lint or test time say, "Okay, you've

25:45

done the work. In order to finish it,

25:47

you have to break this apart so that

25:48

your components are small and as

25:50

stateless as possible and have local

25:52

dependencies on hooks instead of prop

25:53

drilling or whatever it is that you want

25:55

uh the code to look like. And then the

25:57

agent will say, "Oh, this is a new

25:59

instruction for me. Let me take the

26:01

patch as written, modify it to make sure

26:03

that it aderes to the instructions." And

26:05

then up it goes to GitHub. And this sort

26:07

of thing is not going to be obsoleted by

26:10

increases in model capability. It's

26:12

really just about getting that right

26:14

text, that right context to the agent at

26:16

the right time.

26:17

>> Can we talk about an example of a good

26:20

harness? So, a lot of people are asking

26:22

about the codeex model, the codeex

26:24

harness. How does that compare to other

26:26

harnesses? So, cloud code, open code.

26:28

Uh, how do you guys take these decisions

26:30

into play? You don't work directly on

26:32

codec, but if there's you can if there's

26:34

stuff you can speak about about the

26:36

codeex harness, what you guys see as you

26:38

architect it out.

26:40

>> Yeah. So one thing that I think is super

26:43

powerful is this notion that the labs

26:46

are not just post-training the models

26:48

but post-training the models in the

26:49

context of the harness in which they are

26:51

primarily deployed in right like the

26:54

apply patch tool or like the specific

26:56

quoting semantics of how to invoke the

26:58

bash tool are like in the loop for the

27:01

post- training process for the harnesses

27:03

from the labs which means like there is

27:06

leverage to be had by depending on these

27:08

sort of like firstparty harnesses

27:09

directly At least this is what I

27:11

believe. Uh, and as such, kind of being

27:15

able to direct through them via things

27:18

like the SDK or manipulating the Codex

27:20

app server directly means you kind of

27:22

get to ride the wave of all that

27:24

leverage in post- training. Instead,

27:26

focus on the parts that you care about,

27:28

which is like what correct code looks

27:30

like. Um, I kind of have high confidence

27:34

that things like clog code and codecs

27:36

will continue to get better. uh that is

27:38

the responsibility of like the teams

27:40

working on these coding agents. So in my

27:43

role where I don't really want to focus

27:45

on the coding harness at all is figuring

27:48

out ways to plug into them in ways that

27:51

um kind of like steer the agent. That

27:55

means my job can sort of like move up to

27:58

thinking about differences in model

27:59

behavior between releases rather than

28:02

deeply understanding the nuts and bolts

28:03

of the harness. Instead, I can think

28:05

about what it means to, you know, drive

28:08

the behavior that I want vers based on

28:10

the observed behavior rather than like

28:12

the inner mechanics of the thing. It's a

28:15

perfect follow-up to the next question,

28:16

which is, uh, do you have any

28:18

recommendations for collaboration

28:20

platform? So, when you're in the

28:21

software development life cycle, is

28:23

there any platform that you use for

28:25

agents, engineers, developers all to

28:28

collaborate on working on anything? Any

28:30

tips, any tools?

28:32

Yeah. So

28:35

in this world it has largely been just

28:39

markdown files in the repository and

28:41

GitHub that have been the primary sort

28:43

of hub and spoke sort of thing. If you

28:46

think about collaborating on a document

28:49

like you open Google Docs, you write

28:51

something, you ask for feedback, people

28:53

comment, you apply suggestions, these

28:54

sorts of things. This is kind of like a

28:56

little clean room environment just for

28:59

this work artifact that you're

29:00

producing. like a PR kind of has a

29:02

similar purpose. So we kind of treat

29:05

that as a big hub and spoke broadcast

29:07

domain where all of the agents and

29:09

humans collaborate together. Uh and

29:13

because we optimize for throughput, we

29:15

don't block on any sort of like

29:17

contribution to that like folks can

29:18

either review or not. Agents can either

29:20

review or not. The implementation agent

29:23

can acknowledge, defer or reject any

29:26

feedback that it gets. uh really

29:28

allowing each participant in the

29:30

production of diffs to kind of make

29:32

their own judgments around what it means

29:34

to deliver, receive, respond to

29:36

feedback. Uh and this has a nice

29:39

property of like not putting the model

29:41

in a box in a bunch of places. We want

29:44

them to use their good reasoning sort of

29:46

thing. So being super prescriptive

29:48

around like every bit of feedback must

29:50

be addressed can kind of have this like

29:52

catastrophic failure mode of your coding

29:55

agent being bullied by all of the

29:57

reviewers when really we want to bias

29:59

toward code being accepted, not perfect,

30:02

not drowning in minutia and these sorts

30:04

of things.

30:06

>> How should people get started with using

30:08

coding agents? People that have been

30:10

using a lot of doing a lot of manually

30:12

written code, how how do they start to

30:15

transition? What should they offload?

30:17

How do they kind of come over that

30:19

barrier of okay, I'm still checking

30:20

every PR I'm copy pasting from codecs.

30:24

How should like the average engineer

30:26

start to use these tools?

30:28

>> I think there's two ways to approach

30:30

this problem. One is to

30:33

start using the coding agents to improve

30:35

your confidence in the code itself as it

30:38

is written today. Right? I think we

30:40

would all agree that like more tests is

30:42

probably a good thing, right? to assert

30:44

that our programs are well specified and

30:47

behave correctly as our users interact

30:49

with them is a good thing. Uh and the

30:51

agents are super good at looking at the

30:54

existing code with some context around

30:56

how it is meant to be used and writing

30:58

tests that assert that behavior. So kind

31:01

of using this to improve your confidence

31:03

in the quality of the code will also

31:06

increase the agents ability to

31:08

successfully navigate it which means you

31:10

don't have to worry as much around doing

31:13

super detailed review of the agent

31:15

output. The other way to think about

31:17

this is to look at how you are spending

31:20

your time. Is it you know staring at

31:23

your editor writing code? Is it waiting

31:25

for tests to run? Is it waiting for

31:27

human review feedback? is CI slow and

31:31

you're like waiting on that maybe you

31:32

have a ton of flaky tests and using the

31:34

agents to incrementally automate the

31:38

parts where you are spending your time

31:40

because ultimately the high lever parts

31:42

of our jobs is to define the work that

31:45

must be done prioritize and schedule

31:47

that work and then effectively empower

31:49

folks on our team to do that work. uh

31:52

and the more and more we can delegate

31:55

and move into sort of this like

31:56

sequencing and orchestration role even

31:59

if if you just think about like managing

32:01

your teams right the more parallel and

32:03

the more like deeper individual

32:06

executions of those delegations we're

32:07

able to do right if I put primitives in

32:11

place that make it super easy to like

32:13

spin up ways to respond to events on my

32:15

Kafka queue right like I don't really

32:17

need to be in the weeds with every

32:18

engineer making sure they like implement

32:20

a consumer correctly Right. And these

32:23

same sort of like building block style

32:25

techniques apply really well to the

32:27

agents and stack really well too.

32:29

>> A fun one. How do you work with agents

32:31

in your car?

32:34

>> Um so I have not used the new uh voice

32:37

mode that launched in CarPlay uh

32:40

recently. Uh not ready for that. But uh

32:43

usually what I'll do is kick off uh a

32:46

task uh right before I leave the office.

32:49

uh tether my laptop to my phone, buckle

32:52

it into the back seat, and kind of let

32:53

it cook in the 30 minutes it takes me to

32:55

get home. Uh most of the time with the

32:57

skills we invoke that tell the agent,

32:59

you know, you're operating on a task,

33:00

you go until the tests are green. Uh you

33:02

know, I don't have to reach back there

33:04

and poke yes, continue onto the thing.

33:06

Uh and I'm basically able to more fully

33:10

saturate, you know, my day with token

33:13

consumption. Um, the dream here is that

33:16

I actually have 50 agents running 247

33:19

and I don't have to interact with them

33:20

at all. Uh, and the way to do that is to

33:23

define the work well, figure out ways

33:25

for it to automatically be scheduled and

33:27

remove myself from having to click the

33:29

button. Right? Every time I have to type

33:31

continue to the agent is like a failure

33:33

of the harness to provide enough context

33:35

around what it means to continue to

33:37

completion.

33:39

>> Wow, good statement at the end. every

33:41

time you have to interact with the agent

33:43

is a failure. Okay, so the following

33:46

question kind of scales this out, right?

33:47

As your org knowledge map scales, what

33:50

practical steps do you have to like

33:53

enable progressive disclosure? So as you

33:55

have a larger and larger codebase, as

33:57

you have more people, how do you scale

33:59

your agents to work better with this?

34:01

>> Yeah. So

34:03

when I sort of initially started this

34:05

project that I was working on, blank

34:07

repository, create Electron app, right?

34:10

you know, V single package, all this

34:12

sort of stuff. And eventually ended up

34:14

with a mess, right? Because there's no

34:18

package privacy that allows me to

34:19

enforce invariance around what APIs are

34:21

public versus which ones are not. The

34:23

agent didn't have like concrete hooks in

34:26

the file system to determine which

34:29

domains were separate from the other

34:30

ones. So we ended up going like full

34:33

10,000 engineer organization heavy on

34:35

the architecture

34:37

750 packages in the PNPM workspace

34:40

isolated by business logic domain or

34:43

layer of the stack individual small util

34:46

packages that encapsulate reusable

34:49

functionality that we lint on being used

34:51

that we can encode leverage in and I do

34:54

think that like in this world even if

34:57

you don't actually have microservices

34:59

structuring your repositories in ways

35:01

that you can actually scope like the

35:03

directory subree you are looking in to

35:06

be able to do most of the change helps

35:09

uh and you know code in the file system

35:13

is also text which means it's

35:15

effectively prompts that you're giving

35:16

to your coding agent. Uh, so making the

35:19

code as much the same as possible kind

35:23

of makes it so that regardless of where

35:25

in the repository your agent is looking,

35:27

it develops a ton of transferable

35:29

context, right? Like you should have one

35:32

way to like do a bounded concurrency

35:35

helper. You should have one way to

35:37

construct a observable and instrumented

35:41

side effectful command. You should have

35:43

one OM, right? Like you should have one

35:45

programming language. You should have

35:46

one way of writing CI scripts. you

35:47

should have one way of adding additional

35:49

lint rules, these sorts of things

35:50

because it means that like the tokens

35:53

that you want the model to produce are

35:54

easier to predict and more consistently

35:56

predicted regardless of where it looks.

35:58

Um, so I would say figure out ways to

36:01

structure the code so it is local to a

36:03

subree in the repository for most of the

36:06

ways you would interact with that system

36:08

and then figure out a way to use these

36:09

agents to completely migrate the

36:11

codebase to be the same. you know,

36:13

empower someone on your team to be a

36:15

dictator to say this is the way it must

36:17

be done, right? Or, you know, figure

36:18

that out together

36:20

and, you know, write it down, evolve the

36:23

code so that it reflects that reality,

36:25

these sorts of things.

36:26

>> We've got a few questions on code

36:28

review. How do you approach code review

36:30

now that you have such high velocity?

36:32

Uh, do you just not read the code? Do

36:34

you just trust trust the test coverage?

36:36

Uh, how do you write good tests? How do

36:38

you offload that stigma of like, you

36:41

know, you have a mental blocker. I need

36:42

to manually check everything before I

36:44

merge pure.

36:46

>> So that same sort of idea where you have

36:49

to look at where you're spending your

36:50

time and figure out ways to spend less

36:53

of it. Uh, you know, when we started,

36:57

right, the first thing to do was figure

36:59

out how to get the agent reliably

37:00

producing code that we would accept. And

37:03

a big challenge we ran into is with each

37:05

engineer producing three to five PRs per

37:08

day, even on a team of three, merge

37:10

conflicts were super miserable, right?

37:13

Because these PRs tended to be pretty

37:14

big. We were working on the same parts

37:17

of the codebase. So that's where we

37:19

moved in two directions. One was to like

37:22

tree out the code a bit more to minimize

37:24

these merge conflicts, but also minimize

37:26

the amount of time PRs were open so that

37:28

we were uh reducing the likelihood of a

37:31

merge conflict actually occurring. And

37:33

the reason PRs were staying open so long

37:36

was because we needed code review uh

37:38

because humans were being the blocker in

37:42

this scenario. So in order to do

37:47

that piece automatically, I essentially

37:49

asked every engineer on the team to take

37:52

one day a week, Fridays, we called it

37:54

garbage collection day, where our entire

37:57

job was to take every bit of slop we had

38:01

observed over the course of the week

38:02

that was making a PR difficult to merge

38:05

and figure out ways to categorically

38:07

eliminate it from ever happening in the

38:09

first place, which is where we kind of

38:11

started closing this loop between the

38:14

feedback that humans were giving on the

38:15

PR indicates some context failure on

38:17

behalf of the agent, getting that into

38:20

the repository and then figuring out

38:22

ways to automatically prompt inject the

38:23

agent so that it would selfheal when it

38:25

produced this bad behavior. And this is

38:27

kind of how you go from synchronous

38:29

human time spent giving feedback as code

38:31

review comments to documentation in the

38:34

repository to automatically surfing this

38:36

documentation either via a failing test

38:39

or a reviewer agent who is primed to

38:41

review the code as written in the

38:43

context of these docs. But all of that

38:46

happens by putting those docs in a

38:47

single place that all these processes

38:49

are able to attach to. Um, you know, we

38:54

kind of asked folks to basically bucket

38:56

the types of review feedback they were

38:58

giving into like um like the persona

39:00

they were operating as like front-end

39:02

architect, you know, reliability

39:04

engineer, scalability sort of thing. And

39:06

then basically for each of those

39:08

personas, we spun up a review agent that

39:10

gets triggered on every push that says,

39:12

is this code good? Surface any P2s or

39:14

above that would block this PR from

39:16

merging based on these documentation

39:18

that says what good looks like. Uh and

39:21

with that and just continuously

39:23

appending to these files, we started to

39:25

see slop reduce reduce reduce.

39:28

>> People have questions about your billion

39:30

tokens. Where do you think those are

39:32

split up? So how much of it is on code

39:34

review? Where where is the majority of

39:37

that usage coming from? And a followup

39:39

for people that are just getting

39:41

started. Say they have they've jumped

39:43

and done a $200 pro plan, right? If you

39:46

had to cut your usage by a fifth, how

39:48

should people maximize their usage?

39:51

Right? You run into usage limits. Um,

39:53

you know, you don't want to just copy

39:54

paste million lines of code every six

39:57

hours. No prompt hit prompt cache hit.

39:59

But how should we how should we think

40:01

about that?

40:02

>> Yeah. So I would say probably

40:07

a it's probably a third a third a third

40:10

between like planning ticket curation

40:15

documentation implementation and stuff

40:17

that runs in CI.

40:18

>> Do you use plan mode?

40:20

>> Uh we uh I've used exec plans which was

40:23

kind of like an early version of this

40:25

that we published which is sort of like

40:26

a proto skill that says this is how you

40:28

should structure a plan with milestones

40:30

and acceptance criteria. um haven't

40:33

really used plan mode as part of the

40:35

harness at all. My my sort of

40:36

expectation here is that I should be

40:38

able to drop a ticket in and have it do

40:40

the job anyway without diverting through

40:42

a plan. Uh because most of the time I'm

40:44

never going to read it anyway. Uh so I

40:47

find that if you do use a plan and you

40:49

approve it without reading it at all,

40:51

you're actually encoding a bunch of

40:52

instructions that you don't necessarily

40:54

want followed. Uh so if you are going to

40:57

use plans, my recommendation is to push

40:59

those up as single PRs with just the

41:01

plan where you actually have human

41:04

review every line of it and like block

41:06

on human approval before they get merged

41:08

and then kicked off. Uh because it's

41:11

you're effectively potentially wasting

41:13

your time on a rollout with instructions

41:16

that like are bad. Uh so you want to

41:18

kind of like minimize the time that

41:19

happens. But I do think that uh kind of

41:23

getting tokens to be spent in CI is a

41:26

necessary part here because writing code

41:29

no longer is the hard part. Like getting

41:31

code accepted and advancing the code and

41:34

product forward is like what it takes to

41:36

make that written code be valuable. And

41:39

you know you kind of have all heard the

41:41

apherism that like you know senior

41:42

engineers give good code reviews like we

41:44

expect our senior engineers as agents to

41:46

do the same.

41:49

Uh someone asked is code a disposable

41:51

build artifact?

41:53

>> Yes.

41:53

>> Yes.

41:55

>> Uh I think we we touch on this with uh

41:57

symfony which is sort of this agent

41:59

orchestrator that we release. This idea

42:01

that you know we can publish a library

42:05

that's actually a super well-defined

42:07

spec that the code is a compiled

42:10

artifact of. And I think like using LLM

42:14

as fuzzy compiler is like an interesting

42:17

mental model to have, right? Like all of

42:20

the context that we're putting in the

42:21

codebase for harness engineering is

42:24

effectively like constraints and

42:26

optimization passes on which code is

42:29

acceptable to build in the first place.

42:31

Uh and this is pretty similar to like

42:33

the static analysis and optimization

42:36

passes that something like LLVM would do

42:38

in the process of compiling Rust code.

42:41

uh and sort of

42:43

swapping out one model for another is

42:46

sort of like changing your code

42:47

generation backend from you know LLVM to

42:50

crane lift in the Rust compiler and you

42:53

would expect that all of the sort of

42:55

rules around what acceptable Rust code

42:57

looks like produce valid sound machine

43:01

code out the back even if the generation

43:04

process is different and you end up with

43:05

different x86 instructions. So same sort

43:08

of mindset for LLMs swapping out

43:10

different models sort of thing. We want

43:12

the structure around the code to

43:14

basically limit

43:17

how it is written to things that would

43:19

be acceptable to us.

43:21

>> At a high level, can you give us a

43:24

picture of what future you're building

43:25

for? Does context still matter? How do

43:28

people do engineering, harness

43:30

engineering, context engineering? What

43:32

does the future look like?

43:35

sort of the the feature that I want to

43:37

build toward here is where

43:40

I'm able to take a token budget and a

43:45

quarter, a half or a year's worth of

43:48

work,

43:50

take the human input to rank what is

43:52

most important success metrics,

43:55

reliability metrics, give it to the

43:57

machines and have them continually work

43:59

and advance my product forward. uh

44:02

without sort of you know my hands

44:04

explicitly on the wheels at all. We

44:08

as we have gone through like very early

44:10

prototyping to internal alpha internal

44:14

beta external alpha I kind of have felt

44:17

that like new parts of the software

44:19

engineering process have kind of like

44:21

started from zero and we've had to build

44:22

up capability kind of like these like

44:25

you know pentagonal like personality

44:28

charts right where like I spike in this

44:30

direction maybe I'm weak over here and

44:32

you know when we get to deployed

44:35

software for the first time, right? The

44:37

agents ability to do like QA smoke

44:39

testing on our built artifacts before

44:42

they're promoted to distribution was

44:43

weak. We hadn't invested any time in

44:45

this. There were no docs. There were no

44:46

tools that the agents could use to like

44:48

download the built artifact, launch it,

44:51

poke around to make sure that our like

44:53

most critical user journeys were well

44:55

validated and tested. So, because I

44:58

don't want to be touching the computer,

45:00

we needed to figure out like ways for

45:02

the agents to build themselves tools to

45:04

do that part. Uh, and

45:07

there's a whole universe of software

45:09

engineering outside of writing code,

45:11

right? Like I am triaging user feedback.

45:13

I'm triaging pages. I am making sure

45:16

that we don't have any PII leaking in

45:19

the logs in production. I'm making sure

45:21

that like the Twitter vibes are good and

45:23

people are enjoying my software that uh

45:26

our user operations staff are supported

45:29

with well written runbooks that allow

45:31

them to triage and mitigate high volume

45:34

user issues and then moving that into

45:36

the code itself so they don't happen in

45:37

the first place and as I no longer have

45:41

to produce code like my mind can shift

45:43

to these other higher level or more

45:46

squishy activities but the agents are

45:48

good enough to do these things too and

45:49

figuring out how like write down the

45:51

processes and the acceptance criteria

45:53

becomes like the sort of like meta

45:54

programming part of the job using these

45:56

agents.

45:57

>> That's a great way to end it. What an

45:59

exciting future. Give it up for Ryan

46:01

guys.

46:01

>> Thank you folks.

More transcripts

Explore other videos transcribed with YouTLDR.

Get the TLDR of any YouTube video

Transcribe, summarize, and repurpose videos in 125+ languages — free, no signup required.

Try YouTLDR Free