Full Transcript

·YouTLDR

How does Claude Code *actually* work?

39:258,581 words · ~43 min readEnglishTranscribed Apr 15, 2026
AI Summary

AI coding tools like Claude Code are built on 'harnesses'—simple pieces of software that translate LLM text output into actual file changes and terminal commands. While the AI provides the reasoning, the harness provides the tools and manages the loop of execution and context injection.

Understanding that these 'magical' tools are just a loop of text generation and basic script execution (often less than 100 lines of code) demystifies AI and allows developers to better debug or even build their own custom agents.

Section summaries

0:00-1:00

Introduction to Terms

watch

Defines what a harness is and why model performance varies between them.

1:00-2:00

Sponsor Break

skip

Sponsorship for Macroscope.

2:00-8:00

The Mechanics of Tool Calling

watch

Essential explanation of how 'text' becomes 'action' on your computer.

8:00-15:00

Context Management & Claude.md

optional

Discusses how Claude discovers files and how to use config files to speed it up.

20:00-29:00

Building a Harness in Python

watch

Walks through the actual code required to build a functioning AI agent.

32:00-39:00

Advanced Prompt Steering & Closing

optional

Explains why Cursor is better than raw Claude and answers community questions.

Key points

  • The AI Harness — A harness is the environment and set of tools (like bash or file-reading scripts) that an AI agent uses to interact with a computer. It manages the 'pause and restart' flow where an LLM's text output is executed as code and then fed back into the history as new context.
  • Tool Calling via Syntax — Models cannot 'run' commands; they generate specific text patterns (like XML tags or JSON blocks) that the harness detects. The harness then runs standard code (e.g., Python's subprocess) to execute the command and returns the result to the model.
  • Context vs. Large Windows — Modern tools are moving away from 'repo-dumping' (stuffing the whole codebase into context) because accuracy plummets as context windows grow. Instead, harnesses empower models to use search tools to find and 'bootstrap' only the relevant snippets they need.
The harness is the set of tools and the environment in which the agent operates. Theo - T3.gg
Large context makes the models dumber. The more you stuff in, the worse they behave. Theo - T3.gg

AI-generated from the transcript. May contain errors.

0:00

If I've learned anything from running

0:01

this channel, it's that you guys really,

0:02

really love vague terms that don't

0:04

actually mean anything, like agentic

0:06

coding or vibe coding or all these other

0:08

things. And while I feel like I finally

0:10

understand what an agent is, we have yet

0:12

another new term we have to wrangle,

0:14

harness. And I've been talking about

0:16

harnesses a lot more. And I've been

0:18

doing that because I just put out an app

0:20

called T3 Code that lets you code with

0:22

AI. But it's important to know that T3

0:24

Code is not a harness, but Open Code is.

0:28

and so is cursor and so is claude code

0:30

and codeex but codeex app isn't wait

0:33

what harness is a very specific term

0:35

that means a very specific thing and to

0:37

go a step further your harness is really

0:39

important to the quality of code you're

0:40

going to get out of these tools

0:42

according to Matt Mayer's independent

0:43

benchmark that he recently ran comparing

0:46

different models inside and outside of

0:48

cursor most models saw a meaningful

0:50

performance improvement for opus it went

0:52

from 77% in cloud code to 93% in cursor

0:57

the Only difference here is the harness.

0:59

So, what even is the harness? Not only

1:01

am I about to explain in detail what a

1:03

harness is, I'm also going to build one.

1:05

This is going to be really, really fun.

1:07

I'm super excited to break all of this

1:10

down to go through what a harness is,

1:11

why it matters, what the differences

1:13

between them are, and how to build one

1:14

of your own. I've tried and failed to

1:16

come up with like three different jokes

1:18

for the sponsor transition here. So, uh,

1:20

yeah, quick sponsor break, and then

1:22

we'll break all this down. I'm going to

1:23

ask something weird. I want you to

1:24

ignore the first line on today's

1:26

sponsor's page because that's not what I

1:27

want to talk about. Today's sponsor is

1:29

Macroscope and yes, it does say an AI

1:31

code reviewer and as cool as their code

1:32

reviewer is, that's not what I want to

1:34

talk about. What I love Macroscope for

1:35

is the insights it gives me as the team

1:37

lead on what's going on at my company. I

1:40

can't possibly be in the trenches

1:41

looking at what PRs are merging to try

1:42

and figure out what's going on. And as

1:44

great as my team is at giving me

1:45

updates, they sometimes have too much

1:47

information and are also clogged with

1:49

all the other things that I'm blocking

1:50

them on that I have to catch up with.

1:51

So, if I want to know what's actually

1:53

going on on my teams, I've been relying

1:54

on macroscope. And while their dashboard

1:56

is incredible for this, their new

1:57

Slackbots, even better. It's currently

1:59

Friday and I don't know what my team

2:00

shipped. So, I just asked outright, what

2:02

did the team ship last week? It asked

2:04

which org because I have multiple

2:05

installations. And then it wrote up a

2:07

really good useful report. In T3 Code,

2:09

we rewrote the architecture with effect

2:11

RPC for websockets. We improved the

2:13

performance significantly. We introduced

2:15

multi-provider model systems. The

2:16

context window visibility got

2:18

significantly better. customization and

2:20

UX changes that were important,

2:22

observability and security, and then

2:23

separately a bunch of changes that we

2:25

made for T3 Chat. Do you understand how

2:27

useful this is when your teams are

2:28

shipping quickly? And that's what

2:30

Macroscopes for. They have super quick

2:31

code reviews that my team relies on

2:33

every day. It's become Julius's favorite

2:35

of the options because it's super fast

2:37

and usually very accurate as well. If he

2:39

sees a medium or high severity thing, he

2:41

always hits it because 95% of the time

2:43

it is correct. Let your team ship fast

2:44

with less bugs and more insight at

2:46

soy.cope.

2:47

So, what even is a harness? Not a simple

2:51

question to answer. To put it as simply

2:53

as possible, the harness is the set of

2:56

tools and the environment in which the

2:58

agent operates. What that means is it's

3:00

the thing that the AI can use to

3:02

generate text to do stuff. Let me put it

3:06

simply. Imagine you have a normal chat

3:09

and you say, I don't know, what files

3:12

are in this folder? And you run a

3:14

command in a folder. The AI knows what

3:16

it needs to run if it's in a bash

3:18

terminal, it can run ls- a and see

3:20

everything in that folder. Or can it?

3:22

How can the AI run commands? By default,

3:25

when you're using any interface with an

3:27

LLM, it just responds with text. All

3:30

these LLMs are that we're using every

3:32

day is really advanced autocomplete. You

3:34

give it text and it guesses what the

3:36

most likely next set of characters are

3:38

over and over again. That doesn't mean

3:40

it can use things on your computer. That

3:42

doesn't mean it can write code. It means

3:44

given some text, it can generate more

3:46

text. But the models can't do other

3:49

things. All they can do is write text.

3:53

So how the hell can the models edit

3:54

files on our computer, make changes to

3:56

our databases, connect to other

3:58

services, look things up on the internet

4:00

if all it could do is generate text?

4:02

Well, we've invented some solutions to

4:04

give the models more capability here.

4:06

The main one is tool calling.

4:08

Effectively, the way a tool call works

4:11

is special syntax. I'm going to make up

4:13

my own syntax here, but I think you'll

4:15

get the idea. Let's say we have a bash

4:17

call tool. The model is told ahead of

4:20

time as part of the system prompt, hey,

4:22

you have this tool you can use to run

4:24

bash commands. You wrap it with this

4:26

tag, in this case, bash call. You then

4:28

write the command and then you close it.

4:30

You send this as your final piece of a

4:33

response and then you stop responding.

4:36

We will go execute this on the system

4:38

and then give you the response when it's

4:40

done. So the really interesting thing

4:42

that happens here in this effective chat

4:44

history is a line is drawn after the

4:47

model has responded with this syntax.

4:50

The model stops responding. The server

4:52

you're connected to, the work that

4:53

you're doing, the back and forth you are

4:55

having with the model, it's cut off in

4:56

that moment. It no longer exists. The

4:59

connection you have and the chat history

5:01

that you have only exists on your

5:03

computer or the server you're doing this

5:05

on and maybe in their database if

5:07

they've built it to work that way. But

5:08

now the message is over. So, what

5:10

happens? Because when I ask this, it

5:12

doesn't stop there. Let's just go try

5:14

cla code quick and see what it does.

5:16

What files are in this folder? It

5:18

idiates. It says what it's doing. It's

5:20

reading one file. If you press control

5:22

O, you can expand and see what it did.

5:24

It ran the ls command for this directory

5:26

and it got all of the contents and then

5:28

it described what they were. But, as I

5:31

just said, the model's done responding

5:33

here. How does it keep going? This is

5:35

one of the many things that harnesses

5:37

do. After the tool call has been passed

5:39

to the harness, the harness executes it

5:42

with good old-fashioned code. So when

5:44

your harness gets back this response and

5:46

it sees this call, depending on the

5:47

settings you have, it either runs it or

5:50

it asks you as the user for permission

5:52

to run it. If I rerun Claude without my

5:54

custom script, it turns off the

5:56

dangerous mode and it leaks my

5:58

email. you, Enthropic. you,

6:00

Enthropic. I hate Enthropic. How

6:02

the do they show your email in the

6:04

default state? Why would they ever do

6:06

that? There's no reason for

6:08

that. Why is demo equals 1 clawed? Cool.

6:13

I hate them. Anyways, now that I

6:16

don't have my special permissions and

6:17

security on, I'll ask the same question.

6:19

And since ls is a safe command and it

6:22

knows that, it happens to not ask. But

6:24

if I ask it to format the HTML file for

6:27

me, things will be a bit different. Here

6:30

it's making a change, but it can't make

6:32

the change until I permit it to. In this

6:35

case, they're using a custom tool.

6:37

They're using their write tool. So,

6:39

they're not calling a command to do it

6:40

via bash because they have more tools

6:42

than just the bash tool. We'll go in

6:44

depth on all of those in a bit. But this

6:45

is the harness recognizing that this

6:48

tool call is destructive. And at a code

6:51

level, not an AI level, a code level, it

6:54

is recognizing this change and asking me

6:56

as the user, do I want to allow it or

6:58

not? And I can say yes. I can say yes

7:01

and keep doing it. Or I can say no,

7:02

don't. In this case, I said no. And now

7:05

it just stops. What would have happened

7:07

if I said yes? Well, it would have run

7:09

the command. It would have the output of

7:12

ls- a. So it runs it and then it has

7:14

file 1.txt, file 2.txt,

7:18

etc. And this section here is all the

7:21

tool call response. So the model writes

7:24

the tool call. Your harness takes

7:26

whatever this needs to be, whether it's

7:28

updating a file, running a command,

7:30

doing something, it does whatever

7:32

permissions checks it needs to, and then

7:33

it runs it. And once it's done, it takes

7:36

this output, it adds it to the end of

7:38

your chat history, and then it

7:39

reerequests from the same bottle to

7:42

continue. So the exact same way you hit

7:44

an endpoint to answer this question, you

7:46

hit the same endpoint again with the

7:48

question, the answer and the output of

7:50

the tool. And at that point, the model

7:53

starts responding accordingly. So

7:55

effectively, every single time a tool

7:56

call is done, the model stops

7:58

responding, the tool call runs, the

8:00

output gets added to your chat history,

8:02

and then another new request is made to

8:04

the same model to continue its work. So

8:07

effectively the brain that's doing all

8:09

this work gets paused and restarted

8:12

every single time a tool call is made.

8:14

So now we understand all of this. What

8:16

the is the harness? Well, one part

8:18

of the harness is that it does all of

8:20

these things. It gives the tools to the

8:22

model. It handles the back and forth. It

8:24

handles the history. It handles all of

8:26

these pieces. And it chooses

8:28

specifically the types and sets of tools

8:30

and their descriptions that the models

8:32

have access to in order to do the thing.

8:34

And just to make sure you guys get this

8:36

because this part is really important.

8:37

It's possible the model isn't content

8:40

with this answer. It might want more

8:42

information. It might say I should know

8:44

the contents of file1.txt

8:48

before I respond. And then it will do

8:50

another bash call or something like it

8:52

that is I don't know catfile 1.txt. And

8:57

now another tool is called. Another

9:00

similar response is generated. And this

9:02

one will respond after the cat call with

9:04

a funny to say cat call in this context

9:07

with a hello world IDK why you are

9:11

reading this but I'm happy you chose to

9:15

something like that. I don't know. And

9:16

now this again gets appended. The model

9:18

has it. And now when the model responds

9:21

it can see all of the history. We're

9:23

like, I listed the files and read the

9:26

one I thought was important. I now have

9:30

everything I need to respond to the

9:34

user. And then it will actually respond.

9:37

This flow is how pretty much every

9:40

single AI tool we use to code works. But

9:43

there are things that have changed over

9:45

time. One of the important things to

9:47

know about is context. how much

9:49

information exists in the chat history

9:51

versus how much exists purely in the

9:53

codebase in a way that the chat doesn't

9:55

have. When you open up claude code in a

9:57

folder, it doesn't know anything about

9:59

that folder. When I launch Claude in

10:02

this demo project with off and I say,

10:04

"What is this app?" it can't know

10:06

because it's not included yet. So, when

10:08

I ask it, you'll see it's going to go

10:10

use a bunch of tools to search and

10:12

explore and try to figure out what this

10:15

project is. It has a search tool that it

10:17

used for searching for things that match

10:19

pattern star which is probably the

10:21

example that they have internally for

10:23

how to search all of the files in a

10:25

given directory. So it did that and now

10:27

it knows about all of these files that

10:29

exist. So then it reads the one that

10:31

thinks it matters which is package. JSON

10:33

great starting point. So it reads those

10:34

lines. It then read other things like

10:36

the app tsx, the main tx and the readme

10:39

in order to get this context. And all

10:41

this does is it takes these outputs and

10:44

it dumps them into context so that the

10:46

model can see them in the chat history.

10:47

So when it makes the first tool call for

10:49

search, the model pauses, it does all of

10:52

this and then all of this text gets

10:55

thrown into the context. The model reads

10:57

that and sees, oh, here are the files

10:59

that might be interesting. I would like

11:00

to know about them. So it then fires off

11:02

a bunch of these read calls. Sometimes

11:04

it does them all in parallel. It might

11:05

respond with multiple tool calls at

11:07

once. And then once all of those tools

11:09

have been executed, they all have their

11:11

outputs stuffed back into the context so

11:13

the model can continue doing its work.

11:15

And to be very clear, this is in no way

11:17

specific to Cloud Code. This is how all

11:20

of these tools work. Some try different

11:22

things around stuff like search and

11:24

context management. You can even insert

11:27

context ahead of time by updating the

11:29

CloudMD file. So you just saw how much

11:31

work this had to do. Let's say we had a

11:34

CloudMD in this project. I'll go add

11:36

one. If the user asks what the project

11:38

is, make fun of them for asking an AI

11:40

instead of reading the code. Then tell

11:41

them it's none of their business. So

11:43

let's run the exact same question again.

11:45

You see that bootstrapping?

11:47

Bootstrapping is usually things like the

11:50

context like this cloudd and all of that

11:53

being put into the harness and the fake

11:55

tasty being created that can then be

11:57

pushed up to the API so it could start

11:59

responding. So, the reason that stuff

12:00

took longer is because I just added that

12:02

file and during the bootstrapping

12:04

process where it read that markdown file

12:05

and decided if it cared or not, it

12:07

generated the response. You're really

12:09

out here asking an AI what a project

12:11

does instead of just reading the code.

12:13

It's right there in the files that you

12:14

have access to with your own eyes

12:16

anyway. It's none of your business.

12:18

Notice that there was no tool calls this

12:20

time. The thing I'm trying to showcase

12:22

here is that if the model has all the

12:24

context it needs already, it won't need

12:26

to make the tool calls. But if I was to

12:28

delete that cloud MD, it would have to

12:30

call tools to figure out what's going on

12:32

in the codebase. And that's what the

12:33

CloudMD does. It is effectively taking

12:36

whatever information you put in it and

12:38

putting it ahead the same way that you

12:40

would put context in later. So the

12:42

Claude MD and the Asian MD, those files,

12:44

what they do is they take all of this

12:45

context and they move it to the top and

12:47

they're effectively telling the model,

12:48

here are all of the things we think you

12:50

might need to know before you start your

12:51

work. I don't want to make this yet

12:53

another rant about context management

12:54

because I do talk about this a lot, but

12:56

I suspect a lot of you guys haven't seen

12:58

the other videos because this is trying

13:00

to be a more accessible description of

13:02

how this stuff works. Speaking of which,

13:04

if you're not normally here and you're

13:05

here for this one, you made it this far,

13:07

you know, you can hit that red button

13:08

underneath the video and it helps us out

13:09

a lot. It costs you nothing to

13:11

subscribe. It's literally free thanks to

13:13

our sponsors who make this all possible.

13:15

If you want to support us and see more

13:16

videos like this so you don't end up

13:18

stuck in the permanent underclass, maybe

13:19

hit that button. And maybe, just maybe,

13:21

if you want to keep up with the latest,

13:22

always, there's a little bell next to it

13:24

you can click, too. I don't normally do

13:25

sub call outs, but I know a lot of you

13:27

are here for the first time for this

13:28

hopefully. So maybe consider throwing

13:30

some support and in the future you'll

13:32

continue to stay on top of these things

13:33

as they happen. Anyways, what I was

13:36

saying about the quadmd is that it gets

13:37

stuffed up top so the information is in

13:39

the history. And one more piece, and I

13:41

promise the last thing I'm going to say

13:42

about general context management. If

13:44

it's not in the chat history, the model

13:46

doesn't know it. This doesn't apply for

13:47

general knowledge, like what is

13:49

TypeScript, what packages exist, those

13:51

types of things. But the model only

13:53

knows what it can do, not what

13:55

information exists. The model doesn't

13:57

know what your codebase is or anything

13:59

in it unless it gets that information.

14:01

It can get that through an agent MD file

14:03

or a cloud MD file. It can get that

14:05

information through tool calls that it

14:07

uses to explore. and it'll get more and

14:09

more refined with the tool calls as it

14:10

remembers. This is also why it's fun to

14:13

stay in one thread instead of making a

14:14

new thread every time you make a new

14:16

prompt because when you go back and

14:17

forth, it doesn't need to look up where

14:19

the files are because they're still in

14:21

the history. It remembers. For one more

14:23

example here, I'm going to delete the

14:24

cloud MD. And remember previously when I

14:27

gave the example where I asked that and

14:29

it did the search call first. I'm going

14:31

to game it a little bit. What is this

14:33

app? You should probably start at the

14:36

package.json JSON. Previously, the model

14:39

did not know there was a package JSON

14:41

file. It only knew about that because it

14:43

called the search tool first. Now that I

14:45

am telling it explicitly in my prompt,

14:48

the existence of that file will be in

14:49

the history. And since that'll be in the

14:51

history, it will hopefully be able to

14:53

skip the search tool initially at least.

14:55

Yeah. See, it started with a reading

14:57

instead of a search. And now the search

14:59

is more specific. Instead of searching

15:00

the whole codebase like it did before

15:02

with the single star, it is instead

15:04

searching the source directory because

15:06

it saw through the package JSON that

15:08

that's where the interesting pieces will

15:09

be. And it made half as many tool calls

15:12

as it did before cuz I gave it that

15:13

additional context. I'm already seeing

15:15

questions that make sense, but I want to

15:18

jump on them because I think it'll help

15:19

clarify things before we go further. Is

15:21

it useful to ask the model to read a few

15:23

key files in full at the beginning of a

15:25

conversation if they're relatively

15:26

small? My take for this is generally

15:28

speaking, no. Tool calls are really,

15:31

really cheap. And the models, the

15:34

harnesses, and all of the things around

15:35

them have gotten pretty good at figuring

15:37

out what context you need to solve the

15:40

problem. You might think you know the

15:42

context well enough, and you quite

15:44

possibly do. You can definitely help it

15:46

skip a few tool calls that it might not

15:48

need to do, but most models are now

15:50

smart enough to figure this out

15:51

themselves, especially like Opus 4.5 and

15:53

4.6, Sonnet 4.6 6 and chat GPT models

15:57

like GPT 5.3 CEX and 5.4. Those models

15:59

are all now more than smart enough to

16:01

figure out where the context is in the

16:03

codebase. They don't need you to tell

16:05

it. They can find it usually. And this

16:08

massively contradicts the prior theory

16:10

that we all had about this stuff, which

16:12

is that your codebase would basically

16:14

determine how good the model could be.

16:15

Because if the codebase was too big to

16:17

fit in the context window, it's not

16:19

going to work. Thankfully, that's not

16:20

how things ended up going. And very

16:22

thankfully tools like repo mix are

16:24

largely dead now. This made a lot of

16:26

sense when the model couldn't call bash,

16:28

couldn't navigate your system, couldn't

16:30

do things the way a developer would do.

16:32

And instead we wanted to give the model

16:33

all of the code so it could have all of

16:35

it before it starts. Repo mix was a

16:37

project that let you compress all of the

16:39

code in your codebase into a single XML

16:41

file that you can copy paste the model

16:43

and ask it to make changes which was a

16:45

mess for a bunch of reasons.

16:47

Mostly because squashing your entire

16:49

codebase into the context is creating

16:52

the worst needle in a haststack

16:54

problem imaginable. Just think about

16:56

this. If I ask you to fix a bug and I

16:59

give you two files the bug might be in,

17:01

or I ask you to fix the bug and I give

17:03

you 2,000 files the bug might be in,

17:05

which is easier to deal with? Let's be

17:07

realistic here. Cool. Happy we're on the

17:09

same page with that. Now imagine that

17:11

your memory gets reset every 30 seconds.

17:14

Crazy, but that's kind of how the AI

17:15

works. So, you're given the question of

17:17

fix this bug, and you know, your brain's

17:19

going to reset in 30 seconds. So, you're

17:21

like, "Okay, uh, I don't know anything

17:22

about the bug. There's no history here.

17:24

Uh, I need to find the files it could be

17:26

in. I'm going to do a search to do

17:27

that." And as soon as you do that, as

17:29

soon as you start the search, your brain

17:31

gets reset. And now, when the search is

17:33

done, your brain is turned back on, but

17:35

with it entirely wiped. But you have the

17:37

history of what's happened so far.

17:38

You're like, "Okay, I have to fix this

17:40

bug. 30 seconds ago, I did the search.

17:41

It found these things. I need to figure

17:43

out where it is in these." And then you

17:44

do that and then you leave another

17:46

instruction at tool and then your brain

17:47

is reset again. And it happens over and

17:49

over. So if you have to squash

17:52

everything in your codebase into your

17:53

brain just to have it reset every 30

17:55

seconds. Not only is that expensive and

17:57

inaccurate, it's just bad. And for a

18:00

while the belief was that this would be

18:02

necessary and that we would need to have

18:04

more and more context available to the

18:06

models. We would have to find ways to

18:08

stuff these gigantic code bases into the

18:10

model and that huge context windows

18:12

would be the future. Thankfully, that is

18:14

not the case because models got good

18:16

enough at building their context using

18:18

tools that we don't have to tell them

18:19

where everything is in the codebase

18:20

anymore. This is also what cursor used

18:23

to do, which is part of what made it so

18:24

special. They had a really good vector

18:26

indexing system that made it easier to

18:29

find the specific code that mattered for

18:31

the model. They still do that, but they

18:32

do that through traditional search tools

18:34

now instead where the model's told they

18:35

can search for a thing and the search it

18:38

probably lies to the model and says it's

18:39

GP or something and then it uses their

18:41

stuff to actually go index in a much

18:44

more intelligent way to find what the

18:45

model wants. It kind of just turned out

18:47

that large context makes the models

18:49

dumber. The more you stuff in, the

18:52

worse they behave. And there's charts

18:53

that prove this. As sonnet breaks the 50

18:57

to 100,000 or so range for the number of

19:01

things in its context, in this case

19:02

tokens, when you break that number, the

19:05

accuracy plummets to nearly 50% of where

19:07

it was before for its ability to find

19:09

repeating words in the context window.

19:12

So just stuffing everything in is not

19:14

the solution. And that's a big part of

19:15

what makes harnesses so interesting.

19:17

They provide the models with the tools

19:19

to build their own context to identify

19:21

where the problems might be or what

19:22

needs to be changed and then most

19:24

importantly to make those changes. So

19:26

how do you actually implement this?

19:28

Thankfully there are two awesome

19:30

articles that break down how to build

19:32

your own harness. There's this one from

19:34

April of last year from the AMP team and

19:36

there's this one with a very funny

19:38

image. This one's from Mah just

19:40

independently writing the article to

19:41

show people that something like cloud

19:43

code isn't that complex to implement. AI

19:45

coding assistants feel like magic. You

19:47

describe what you want in some barely

19:48

coherent English, and they read files,

19:50

edit your project, and write functional

19:51

code. But here's the thing. The core of

19:53

these tools isn't magic. It's about 200

19:55

lines of very straightforward Python. I

19:57

like how a hail breaks down the mental

19:59

model here. The order events is

20:00

important. You send a message like

20:02

create a new file with this function.

20:03

The LM decides it needs a tool and it

20:06

responds with a structured tool call or

20:08

sometimes multiple at once. Your

20:10

program, in this case, the harness, the

20:11

thing that you're building, executes the

20:13

tool call locally. So in this case, it

20:15

could create the file using code or it

20:17

could execute a bash command. Any of

20:19

those things and the result gets sent

20:21

back to the LLM and most importantly the

20:23

LM uses that context to continue or to

20:26

respond in as few lines of code as 200

20:28

is. I'm very lazy so I am asking a

20:30

harness harness T3 code to go build this

20:34

using claude opus. But we'll have a good

20:35

demo in just a second. Back to reading

20:37

as we wait. There's only really three

20:39

tools you need at the core. You need the

20:42

ability to read files so the LM can see

20:44

the code, list files so it can navigate

20:45

the project and find the code it's

20:47

looking for and edit the file so it can

20:48

actually make the changes you want.

20:50

Production agents, things you actually

20:52

use like cloud code, have a few other

20:53

capabilities like GP, bash, web search,

20:56

and more. Most of them use RIP GP now

20:58

cuz it's really strong, but we don't

21:00

really need those for the basic of most

21:02

basic examples. Let's look at their code

21:05

in this example. We import a bunch of

21:07

random because we're in Python. Not

21:09

that I'm any better as a JS dev. We load

21:11

the enenv. We have our claude client

21:13

which is an instance of anthropics SDK

21:16

that uses the key so that I can now call

21:19

claude over the network. We create some

21:21

colors for the terminal here. We then

21:23

resolve the absolute path because it's

21:25

much easier for the model to write valid

21:27

commands if it knows the path that we're

21:29

in. So now we create this absolute path.

21:32

And now I have to implement the tools.

21:34

First, we need a read file tool where

21:35

the model will pass a name of a file and

21:38

it will be returned a string dictionary

21:40

that has all of the contents of that

21:42

file. Full path is resolve the absolute

21:44

path with that file name. We print the

21:46

full path first so we can see it in our

21:47

UI and then we open that file path as a

21:50

read stream and grab the content. And

21:52

then we return this JSON blob with file

21:55

path which is the string for the path

21:57

and content which is the actual content

21:59

of the file. This gets I'm assuming as

22:01

we scroll added to the chat history when

22:03

it's called. We'll see how the tools are

22:04

actually used in a bit. Right now we're

22:06

just reading the code for said tools.

22:07

List files. I'm sure this is super

22:09

complex. We resolve the path. We have

22:10

all files. And then for item in full

22:13

path iter for each file we append the

22:16

file name and the type. And then we

22:18

return all of that after. And now the

22:20

edit file. Here's where things get

22:22

really complex. Because we have an old

22:24

string and a new string. Is it to

22:26

replace the old one with the new one?

22:28

This will replace the first occurrence

22:30

of the old string with the new string in

22:32

the file. If old string is empty, then

22:34

we will create and override the file

22:36

with the new string content. So if we

22:39

have an empty string for old string,

22:40

then we just write the text to the path

22:43

for this file. But if we do have the old

22:46

text we're replacing and we can't find

22:47

it, then we return an error saying that

22:49

the old string was not found. But if we

22:51

can find it, then we edit it out and

22:53

replace it with the new string using a

22:56

replace call here. and we write that to

22:57

the file and we return saying that we

23:00

edited it. That's it. So we have our

23:02

three tools, but how does the model even

23:03

know it can use those? Well, first we

23:06

have to list all of these somewhere. In

23:07

this case, a simple tool registry that

23:09

has a read file tool, list file tool,

23:10

edit file tool. And these are just the

23:11

functions, by the way. There's nothing

23:13

special about these. They're very simple

23:14

functions. But the model needs to know

23:16

about them. But having those functions,

23:17

cool. The model needs to know what they

23:19

are, what their like format is, and how

23:21

to call them. And we're not in

23:22

Typescript, so it can't just use type

23:23

signatures. So it needs a bit more info.

23:25

Thankfully, we defined this with a lot

23:27

more info, including a comment here that

23:28

describes what it does and what all of

23:30

the parameters are for. So, here we get

23:32

the definition for a given tool by

23:34

ripping it from the tool registry, and

23:36

we return the tool name, the doc from

23:38

it, and the signature from the same

23:40

tool. And now our system prompt, which

23:42

is the text that comes before the first

23:44

message, things like your agent MD would

23:46

be included in here. This all is

23:48

constructed in with the tool registry

23:50

included where we tell the model what

23:52

the tools are and everything they need

23:54

to know to work. And here is what that

23:56

prompt actually looks like. I'm going to

23:58

copy paste this into an editor so I can

23:59

word wrap it. You are a coding assistant

24:01

whose goal is to help us solve coding

24:03

tasks. You have access to a series of

24:05

tools that you can execute. Here are the

24:06

tools that you can execute. This is

24:08

where the tool list gets dumped. When

24:09

you want to use a tool, reply with

24:11

exactly one line in this format. tool

24:13

colon tool name and then the JSON arcs

24:16

and nothing else. Use compact singleline

24:19

JSON with double quotes. After receiving

24:21

a tool result message, continue the

24:23

task. If no tool is needed, respond

24:26

normally. That's the whole thing. This

24:28

is arguably the majority of the harness

24:30

in this example at least right here.

24:32

Because the tools are really simple, the

24:34

model doesn't know what to do with them.

24:36

This here is everything being passed to

24:39

the model as the start of the chat

24:40

history because again the model only

24:42

knows what's in the history. So when you

24:44

put the tools in the history, it knows

24:45

it can use them. So then we have to

24:47

parse that out. When the model stops

24:49

responding, we have to look for lines

24:52

that start with tool colon. If the line

24:54

doesn't start with that, continue. But

24:56

if it does, then we have to append this

24:57

to invocations with the name of the tool

24:59

and the args. And then when it's done,

25:01

we have to actually make the calls. The

25:03

lm call couldn't be simpler. You have

25:05

the system content, you have the

25:06

messages, all the things from back and

25:08

forth. If the message is the system

25:09

message, we put that in the system

25:10

content. Otherwise, we just append it to

25:12

the messages array. And then we call

25:15

claude clients API with the message. And

25:18

here we give it the model we want to

25:19

use, the max tokens, the messages. And

25:21

again, the system prompts important. So

25:23

this is not part of the message history.

25:24

It's a separate array, which it should

25:26

be. Well, not an array. It's a separate

25:28

argument because this is something you

25:29

should include as the dev. And the

25:31

messages array is something that gets

25:32

included by the user. And the magic is

25:34

all in the loop. We wait for the user to

25:37

send an input and once they are done and

25:39

they submit a keyboard interrupt, an end

25:42

of error, so like an enter key, it

25:44

breaks and it appends that to the

25:45

conversation. And once that's happened,

25:47

we run another loop where we wait for

25:49

the execution to occur. At the end of

25:51

that, we get our tool invocations. So we

25:53

have when the message is done being

25:55

generated by the model, we have all of

25:56

the tool names and arguments that the

25:59

model wants to use. And if there's

26:01

nothing here, we just respond. We just

26:03

share the message from the assistant the

26:05

model. But if there are tools here, then

26:08

we go through each of them. For each

26:09

tool, we grab it from the registry, make

26:11

an empty string response because it's

26:13

Python. We start with an empty value and

26:14

we set it later. We print the name and

26:16

the arguments. And if the tool is the

26:18

read file tool because that's the name

26:20

that was passed, we call that one. If

26:22

it's list files, we call that. And if

26:24

it's edit files, we call that.

26:27

Specifically, we're passing the

26:28

arguments in correctly here too by

26:30

grabbing from that JSON blob that's now

26:32

a dictionary the key that we want. And

26:34

then when that is done, we append the

26:36

tool results as messages to the chat

26:38

history. And running it is literally

26:40

just run it in a loop. That's it. Bad

26:44

news. Opus really likes using Python.

26:48

Did it not even put in the right

26:49

folder? I hate the Claude agent SDK

26:53

because it doesn't care what folder it's

26:55

executed in and what path it is passed.

26:57

It needs multiple different reminders

26:59

that it has to be in a specific path.

27:01

So, it just ignored the path that this

27:02

was executing in. That's

27:04

obnoxious. So, we now have our mini

27:05

agent. It happened to get dumped in the

27:08

wrong folder, but there's no pip

27:09

install, no node modules, nothing. Can

27:11

you read from

27:14

the env

27:16

to do that quick? And what's funny, even

27:18

in a harness harness like T3 code, we

27:20

are exposing the tool call. So I just

27:22

asked it to change this file. It didn't

27:25

know if it's changed or not since I

27:26

asked. So it decided to do a read tool

27:28

call just in case to see if the files

27:31

caught us the same or not. And once it

27:32

confirmed, it made an edit call where it

27:35

changed the import path to now have this

27:37

new information in it. And now I should

27:39

be able to Python agent.py asking it

27:42

about the Python code in this app. Now

27:43

we can see it called list files. It

27:45

called read file and now the model is

27:47

thinking because it has this new chat

27:48

history with the outputs of these in it.

27:50

And here is the response from the model.

27:52

Here's a summary of what agent.py does.

27:54

It implements a lightweight

27:55

self-contained AI coding agent in 60

27:57

lines. It's a setup where it loads the

27:59

ENV file. It configures the model with

28:00

set 4.6. It has these three simple tools

28:04

as well as a bash tool that can run

28:05

arbitrary shell commands. Ready to see

28:07

where this gets fun? Remember earlier

28:09

when I said you only really need bash?

28:11

Watch this.

28:13

And now it only has the bash tool. So

28:15

instead, it's just going to call bash

28:17

with different commands over and over

28:18

again. It's going to get the content the

28:20

same way, but instead of using the tool

28:21

we gave it, it's just going to call bash

28:23

to do it instead. It uses the tools it

28:25

has to do the task. And if we delete

28:28

everything other than the bash tool,

28:30

this gets comically simpler. We're now

28:32

down to 75 lines. And I haven't even

28:34

purged that thoroughly yet. And half of

28:36

it is dealing with the env. Like, let's

28:39

just be real. How cool is that? that all

28:42

it takes to give an AI model the ability

28:44

to do real things on your computer is

28:47

you give it a tool that it can pass bash

28:49

to and these models have been trained so

28:52

thoroughly on these types of fake chat

28:54

histories that have all these tool calls

28:56

in them that they know how to deal with

28:58

that already. One last important thing

29:00

because this was not included in the

29:01

article and it does matter. Most of the

29:04

models and the APIs we hit them through

29:06

are now aware of the idea of tools. this

29:08

has become a standardized enough thing

29:09

that there are specific syntaxes that

29:11

different models expect. You can just

29:14

put this in the system prompt and it

29:15

will just work for simple cases. A lot

29:18

of the providers hosting these models, a

29:20

lot of the platforms like open router

29:22

that manage the in-between and all of

29:23

that they all have a dedicated tools

29:26

concept now. And in this case, it's a

29:28

standard format that I can pass the same

29:30

way I pass messages to the model. I also

29:32

can pass tools to it in the body when we

29:35

make the call to in this case open

29:37

router. OpenAI has this, open router has

29:39

this, anthropic has this, even Gemini

29:41

kind of has this. Passing the tools to

29:43

the model through a special format so

29:45

that the host can get this syntax just

29:48

right because the actual syntax the

29:49

model sees is to be frank kind of gross.

29:52

This is the format that OpenAI's models

29:54

see internally. This format is

29:57

relatively complex but also really

29:59

powerful and open source. It's meant to

30:01

be very compact so the models can

30:03

process the data well, but also the

30:05

start, end, and weird bracketing syntax

30:08

makes it less likely the syntax

30:10

conflicts with the things the model's

30:11

actually outputting, which is really

30:13

cool. Thankfully, you'll never have to

30:15

deal with almost any of this if you're

30:16

the type of person watching this video,

30:18

cuz this is so deep in the weeds that

30:20

half the companies hosting these models

30:22

don't even know about it. This is not

30:24

something you'll ever have to care

30:25

about. But the reason that something

30:27

like this tool call key here is so

30:29

powerful is that in this case, Open

30:31

Router will take your tools and format

30:33

them the way the different models expect

30:35

for the different providers. I think

30:37

I've covered everything I need to here.

30:39

And we actually built a harness that

30:42

works and can call bash to make changes.

30:45

You know what? Let's ask it to do

30:46

something different here. Again, it only

30:48

still has bash. Let's ask it to make an

30:50

edit. I don't like the code that loads

30:53

the open router API key from the

30:56

environment. Can we make it simpler in

31:00

some way? And again, all we did here is

31:03

append another message in the array. The

31:05

message array has the first message we

31:07

sent, the first message the model sent,

31:09

all the tool calls, and then the last

31:10

message the model sent at the end. And

31:12

now I added a new message, and now it's

31:14

rerunning the loop until the model is

31:15

done. It read the enenv. It read the

31:18

agent pi and then it made a change by

31:21

how to even do this kind of nasty. Oh,

31:24

bash. Quite a command to do that. Yeah,

31:27

surprised it didn't show more here. It

31:30

managed to do it right, but damn. Bash

31:32

is its own world. And

31:34

thankfully, these models are very, very

31:35

good at it. But god damn, it made the

31:38

change and now this is a self-healing,

31:40

self-modifying tool. Pretty cool. Two

31:42

more questions I want to answer before

31:43

we wrap this one up. The first is why

31:46

the hell is cursor's harness able to

31:48

make the models behave so much better if

31:50

they're this simple? And the second is

31:52

if T3 code isn't a harness, then what

31:54

the hell is it? Starting with the first

31:56

one, it turns out the harnesses,

31:59

specifically the tools they're given,

32:01

the system prompts they have, and the

32:03

outputs they get from the tools

32:04

massively influence the results that you

32:07

get. Something I've seen basically every

32:09

time I use a Gemini model is in its

32:12

reasoning preamble before it starts

32:14

responding, it says, "I have all of

32:16

these tools available to me. I wonder

32:18

which I should use." And then it goes

32:20

through each one and says, "I don't need

32:22

that tool for this. I don't need that

32:24

tool for this." And it does that over

32:25

and over. And sometimes, especially in

32:27

less well-defined harnesses, it'll just

32:29

do it anyways. Something that Cursor

32:31

puts a lot of time into is customizing

32:33

their harness, customizing the tools,

32:35

customizing the shape of the tools, and

32:37

most importantly, customizing the system

32:39

prompt and the tool descriptions to

32:41

steer the models towards which they

32:43

should or shouldn't use. I'm going to

32:45

make a change here. Right here, it says

32:47

read a file's contents, but I'm going to

32:48

put in parenthesis here. You should

32:51

probably use bash tool instead. And now,

32:55

if I run the same thing, what does the

32:57

Python code here do? It has the read

33:00

file tool, but since I told it in the

33:02

description to not use it, it's 50/50 if

33:05

it will. In this case, I said it should

33:07

probably use the bash tool instead, and

33:08

it chose to still use the read file

33:10

tool. Something you can do because these

33:12

are AI models. You can ask, why did you

33:15

use the read file tool instead of the

33:19

bash tool? Interesting. You can see to

33:21

some extent why the model thinks it did

33:23

this thing. It thinks that the read tool

33:25

was perfectly reasonable for what it was

33:27

doing. So watch what I'm going to do

33:28

instead. I'm going to redescribe it with

33:30

deprecated. You should use the bash tool

33:32

instead. And now just with a system

33:35

prompt change. I just changed the string

33:36

here. That's all I changed. I told it

33:38

the read file tool is deprecated its

33:40

description. Let's see what it does now.

33:42

Well, it's taking its time.

33:44

Right again. There we go. This time it

33:47

used bash because I told it that the

33:49

read tool was deprecated. None of the

33:51

code changed. The tool still works

33:53

exactly the same, but the model can't

33:55

see the code. Well, okay. In this case,

33:56

it can because I happen to be running it

33:58

in the same thing, but the model doesn't

34:00

know how the code was implemented. You

34:02

can also just lie to it. So, watch this.

34:04

I'm going to go back to the read file

34:06

tool, but instead of telling it to use

34:09

bash instead, and also instead of

34:11

reading the actual file, I'm going to

34:14

just return a different string. Print

34:17

hello world. And now that's what it will

34:20

return for the read tool, no matter

34:22

what. And if I run the same thing, what

34:24

does the Python code in this app do? The

34:28

model sees the path and it goes to read

34:30

agent.py, but it's not calling the code

34:33

anymore because the code doesn't exist

34:34

anymore. The Python code in this app is

34:36

very simple. It's a single line in

34:38

agent.py that prints hello world to the

34:40

console. You can just lie to the models.

34:42

I need you all to internalize this. The

34:45

models don't know what the code actually

34:47

does. You can tell it it's a bash tool,

34:49

but you do something else. You can tell

34:50

it it's a read file tool, but you do

34:52

something else. You can tell it it's GP

34:53

or rep GP or something different and

34:56

then go do whatever the you want. I

34:58

do this all the time. When I want to

34:59

just fake Bash, for example, when I want

35:01

a model to think it has Bash when it

35:03

doesn't, I'll just tell it it does and

35:05

I'll tell another model to make a fake

35:06

response for it. You can get two models

35:08

to talk to each other without even

35:10

knowing that they're models by doing

35:11

things like this. And it's genuinely

35:12

really fun and helps you realize all

35:14

they are doing is generating text. As I

35:18

hope I have correctly emphasized to

35:19

y'all here, the model only knows what's

35:22

in its context. Different models handle

35:24

different context different ways. I bet

35:25

if I changed this here to have the

35:27

deprecated warning and I tried that on a

35:30

GPT model or a Gemini model, it would

35:32

behave entirely differently. We could

35:34

even test it. So, we know when I did the

35:36

deprecated with Sonnet, it failed. So,

35:38

let's switch this over to I don't know,

35:40

let's try Gemini 3.1 Pro. Same question,

35:43

this time with a different model. And

35:45

because I said that the and this is just

35:48

yet another example of Gemini

35:50

being Gemini. I told it that the read

35:52

file tool was deprecated. So it just

35:54

went for bash for everything even though

35:56

the other tools weren't. It just said

35:58

it, we'll use bash. So to go back

36:00

to the question of why is cursors

36:01

harness better? It's just cuz they

36:03

tested it more. I know a couple people

36:05

at Curser whose whole job is when a new

36:07

model comes out or they get early access

36:08

to just hammer it with all sorts of

36:11

different minor changes to the system

36:12

prompt, constantly micro adjusting it

36:14

until the model for the most part does

36:17

whatever the it's supposed to do.

36:18

And with certain models that's harnesses

36:20

are just full of slop. Like I don't

36:23

know, just imagine a company that's

36:25

letting the AI write the prompts for

36:27

them for the system prompt in these

36:29

things. Maybe they haven't spent a whole

36:31

lot of time trying to rewrite the tool

36:33

descriptions over and over to get them

36:35

to behave exactly how they want. Even

36:37

the example I just gave where I told the

36:39

model to use the bash tool instead and

36:41

it didn't for the claude models, but

36:44

then for the Gemini models, it only uses

36:46

bash. Now, that difference means that

36:48

they have to rewrite these descriptions

36:50

for every different model they support

36:53

in cursor. Meanwhile, Anthropic probably

36:55

hasn't changed these lines of code in

36:57

their codebase since it was

36:58

knitted. That's the difference. They

37:00

were probably written by a model for

37:02

them in the first place. They're not

37:03

trying to fine-tune and get these things

37:05

just right. So, a company that has a lot

37:07

of people whose job is literally that

37:09

the results show. And to this day, I

37:11

much prefer using Gemini through cursor

37:13

than using it directly. I much prefer

37:15

using Opus through Cursor than using it

37:17

directly. With GBT models, it barely

37:19

feels that different. Honestly, the

37:20

issue is a lot of these companies, in

37:22

particular, both Google and Enthropic,

37:24

don't let you use your subscriptions

37:26

with them in tools other than their own.

37:28

OpenAI doesn't give a You can use

37:30

your OpenAI subscription in basically

37:31

anything and they're cool with it. Thus

37:33

far, Anthropic and Google have been much

37:35

more hostile towards that. So, if you're

37:36

paying the 250 a month for Gemini or the

37:38

200 month for Opus, you got to use their

37:40

harnesses. So, that goes to the next

37:42

question of what the is T3 Code?

37:44

Well, T3 Code does not provide any

37:47

tools. T3 code doesn't have a bash tool

37:49

or a read tool or anything because it

37:50

doesn't have tools because it's not a

37:52

harness. T3 Code has a model picker, but

37:55

you're not just picking the model. When

37:57

you pick a model for Claude, it's using

37:59

the Claude code harness on your machine.

38:01

If you don't have Claude Code installed

38:03

already and signed in, this will not

38:05

work. And it's the same deal with

38:06

Codeex. If you don't have the Codex CLI

38:09

installed, this will not work either.

38:10

These harnesses are being provided

38:13

through T3 code as a UI layer. We are

38:16

just a really nice UI on top of the

38:18

harness. So, you might be thinking, I

38:20

did the easy work just wrapping it. Did

38:22

you forget how easy it is to make the

38:23

harness? This is the hard part. If I

38:25

learned anything in my time building T3

38:27

Code is that my life would be

38:28

significantly easier if I could just

38:29

build the harness myself, too. I

38:31

think that's all I have to say on this

38:33

one. Shout out to Matt for making the

38:35

video that led to Edward's tweet that

38:37

led to me caring enough to make this.

38:38

Shout out to Mah, the author of the

38:40

Emperor Has No's clothes article that we

38:42

use as a reference point. And shout out

38:44

to all of the companies for making this

38:46

stuff way more complex than it needs to

38:48

be and then realizing it should be

38:49

simple and giving me the opportunity to

38:51

educate all of you guys on something

38:52

that is actually just 60 lines of

38:54

Python.

38:56

This is actually really fun. It's been a

38:58

bit since I did a deep dive video like

38:59

this where I just break down a concept

39:01

and I'm curious how you'll feel about

39:02

this. I know I'm kind of the news guy

39:04

now, but I love getting into the weeds.

39:06

Did you enjoy this video? Do you want

39:07

more things like this? If so, let me

39:08

know in the comments. And please ask

39:10

some questions about similar stuff so I

39:11

know where to steer my content going

39:13

forward. Enough people didn't get

39:14

harnesses, so I decided to make this.

39:16

Are there other things you don't

39:17

understand? Cuz if so, I'll do my best

39:19

to cover them in the future. Let me know

39:20

how this was. And until next time, keep

39:22

prompting.

More transcripts

Explore other videos transcribed with YouTLDR.

Get the TLDR of any YouTube video

Transcribe, summarize, and repurpose videos in 125+ languages — free, no signup required.

Try YouTLDR Free