0:14
Hey there, I'm Mario. I built pie in a
0:17
world of slop and this is a strategy, a
0:19
tragedy in three acts. Just to talk
0:22
about this real quick, bunch of people
0:23
on the internet gave me money for ad
0:25
space on my torso and all of that goes
0:26
to a charity. So yeah, thanks guys.
0:29
So act one building pi in the beginning
0:32
there was cloud code and was good right
0:34
we all got basically catnipped by that
0:37
thing and stopped sleeping um bunch of
0:41
stuff before that but code cloud code
0:42
was the one thing that kind of clicked
0:44
with me the most and to preface all of
0:46
this I love the cloud cloud team they're
0:48
are brilliant people talented super high
0:50
velocity so uh they also created the
0:53
entire game major props to them so this
0:56
is not a roast this is just me an old
0:58
man telling you why I stopped using
0:59
cloud code and built my own thing. Um in
1:03
2025 I started using cloud code in about
1:05
April I think thanks to Peter uh because
1:08
he told us the agents are working now
1:12
and back then it was simple and
1:13
predictable and fit my workflow but
1:17
the token madness got hold of them I
1:19
think and the team got bigger and they
1:20
started uh dog fooding that stuff and
1:23
build a lot of features a lot of
1:24
features I don't need which is fine I
1:26
can just ignore them but with velocity
1:28
and more features come more bucks and
1:30
that's mad because I used to work at
1:33
construction sites and if my hammer
1:35
breaks every day I'm getting really mad
1:36
and if my development tools break every
1:38
day I'm also getting mad. So there was
1:41
this it's just a running gag and here's
1:43
tar telling us that cloud code is now a
1:44
game engine and here's Mitchell from
1:46
Ghosty telling us no it's not and
1:48
eventually they fixed the flicker but
1:50
then other stuff broke and I think
1:51
they're now in the third iteration of a
1:54
2y renderer. Yeah but that's just a
1:56
symptom. The real problem is that my
1:58
context wasn't my context. Cloud code is
2:01
the thing that controls my context. And
2:03
behind my back, cloud code does things
2:05
uh to the context. So you have the
2:08
system prompt which changes on every
2:09
release, including the tool definitions.
2:11
They would remove tools, modify tools.
2:14
It's not good. They would insert system
2:17
reminders in the most oppoune place in
2:20
your context, telling the model, here's
2:21
some information. It may or may not be
2:24
relevant to what you're doing. That it
2:26
actually says it may or may not be
2:27
relevant what you're doing. And that
2:29
kind of confused the model and that kind
2:30
of broke my workflows.
2:34
On top of all that, there's zero
2:35
observability because that's how the
2:36
tool is constructed and I like knowing
2:39
what my agents are doing. There's zero
2:41
model choice which is obvious. It's the
2:42
native entropic uh harness. So it makes
2:44
sense for them to want you to use cloud,
2:46
right? And there's almost zero
2:48
extensibility and some of you might have
2:50
written some hooks for cloud code, but
2:51
I'm telling you the number of hooks and
2:54
the depth of those hooks is very
2:55
shallow. Um, and every time a hook
2:58
triggers, what actually happens is a new
3:00
process gets spawned. Basically, the
3:01
command you specified for the hook to be
3:03
executed. And I don't find that
3:05
specifically efficient. So, I uh took a
3:08
step back and looked around for
3:09
alternatives. And I'd like to especially
3:11
call out AMP and factory droid, the
3:14
Porsche and Lamborghini of coding agent
3:16
harnesses. So, if you can afford them,
3:17
please use them. They're at the
3:18
frontier. They're really good, and the
3:20
teams are fantastic. And there's a bunch
3:22
of other options. And I have history in
3:23
OSS. So naturally I kind of gravitated
3:26
towards open code and again brilliant
3:28
team super high execution velocity and
3:31
they don't sell you hype they sell you
3:33
tools that work for the most part. I
3:36
started looking under the hood of open
3:38
code uh with respect to context handling
3:40
as well because that's the most
3:41
important part for me and I found a
3:43
bunch of things like given some
3:45
conditions open code would just uh prune
3:49
tool output after a specific minimum
3:52
amount of tokens and that basically
3:54
lobomizes the model. Uh there's also LSP
3:57
server support which means every time
3:59
your model is calling the edit tool open
4:02
code goes to the LSP server that's
4:03
connected asks are there any errors and
4:06
if so injects that as part of the edit
4:08
tool uh result which is bad because
4:11
think about how you add editing code
4:13
you're not writing a line of code
4:15
checking the errors writing the next
4:16
line checking the errors you don't do
4:18
that you finish your work and then you
4:20
check the errors this confuses the model
4:23
there's a bunch of other things like
4:24
storing individual messages of a session
4:26
in a JSON file. Each me message is a
4:29
JSON file on disk. Uh there was this and
4:31
this happens to all of us. No, no claim
4:33
there. But it's not great if by default
4:36
a server spins up, course headers are
4:38
set in such a way that any website you
4:39
open in your browser can now access your
4:41
open code server. That's yeah, and
4:44
entirely unrelated to all of this, I
4:46
started looking into benchmarks for
4:47
coding agent harnesses and found
4:49
terminal bench um which is a pretty good
4:52
benchmark all things considered. And the
4:54
funny part about it is that it's the
4:56
most minimal kind of thing you can think
4:58
of. All it gives the model is a tool to
5:01
send keystrokes to to a T-Max session
5:03
and read the output of that T-Max
5:05
session. There's no file tools, no sub
5:07
agents, none of that stuff. And it's one
5:11
of the best performing harnesses in the
5:12
leaderboard. Here's the leaderboard from
5:14
December 2025. irrespective of model
5:17
family terminal scores higher mostly
5:20
high even higher than the native harness
5:22
of that model. So what does that tell
5:25
us? A form two thesis is we are in the
5:28
[ __ ] around and find out phase of coding
5:30
agents and their current form is not
5:31
their final form right. So second thesis
5:35
is we need better ways to [ __ ] around
5:37
and for me that means self modifying
5:40
malleable agents things that the agent
5:42
itself can modify and I can modify
5:45
depending on my workflow. So I stripped
5:47
away all the things built a minimal core
5:49
but made it super extensible and made it
5:52
so that the agent can modify itself
5:55
with some creature comforts. It's not
5:56
entirely bare bones. Uh so that's PI.
5:59
It's an agent that adapts to your
6:00
workflow instead of the other way
6:01
around. It comes with four packages. Uh
6:04
an AI package that's basically just an
6:06
abstraction across providers and context
6:08
handoff between providers. An agent core
6:11
uh which is just a while loop and the
6:12
tool calling. A bespoke toy framework. I
6:15
come out of game development. So I built
6:17
a thing that actually doesn't flicker
6:18
too much. And the coding agent itself.
6:21
Here's Pi's system prompt.
6:23
That's it. Eventually the industry
6:26
created a new standard called skills
6:28
which is basically just markdown files.
6:30
So we added that as well. and that needs
6:31
to go in a system prompt. So, be
6:33
crouchingly, we had to add a couple more
6:35
lines. And finally, here's the magic
6:37
that makes Pi able to modify itself. We
6:40
ship the documentation which was
6:42
handcrafted by me and an agent. Um, and
6:45
code examples of extensions,
6:48
and all we need to do for the agent to
6:50
modify itself is tell it, here's the
6:52
documentation. Here's some code that
6:54
shows you how to modify yourself by
6:57
It comes with four tools. That's all it
6:59
has. Retrate, edit, mesh. Here's the
7:01
tool definitions. Don't read the the
7:02
text. Just look at the size.
7:05
That's it. Here's what happens when you
7:08
start a new session in one of these
7:11
So the thing is the models are actually
7:13
reinforcement trained up to a wazoo. So
7:15
they know what a coding agent is because
7:17
a coding agent harness is basically what
7:19
they're being trained when they are
7:20
post-trained. You don't need 10,000
7:22
tokens to tell them you're a coding
7:24
agent. They know because they are coding
7:26
agents. No, PI is also YOL by default
7:29
because my security needs are different
7:30
than yours. And I don't think a little
7:32
dialogue that pops up every now every
7:35
time you call bash asking you to approve
7:38
is a smart security uh uh mechanism. So
7:41
instead I give you so much rope that you
7:44
can build anything that's fit for your
7:46
specific security needs. There's also
7:49
stuff that's not built in. I'm a he
7:53
because this is how I do it. But if you
7:56
don't like that then you just ask Pi to
7:57
build you sub agent support or plan mode
8:00
or MCP support whatever you need.
8:02
Extensibility comes with a bunch of
8:04
table stakes and then with the
8:06
extensions itself and extensions imply
8:08
are just TypeScript modules. In the
8:10
simplest case a TypeScript file on disk.
8:12
You point PI at that. Here's an
8:14
extension loaded as part of the harness.
8:16
And with that you get a basically an
8:19
extension API that lets you hook into
8:21
everything and define stuff for the
8:23
harness to expose to the to the model.
8:25
And that includes tools uh slashcomand
8:28
shortcuts. You can listen in on any kind
8:29
of event and react and then save state
8:32
in the session that's optionally
8:36
provided to the agent as well or stored
8:38
there for tools that analyze sessions as
8:41
part of your organizational workflows.
8:43
You can do custom compaction, custom
8:45
providers and you have full control over
8:46
the tool. So you can modify everything
8:48
in PI and you can then bundle all of
8:50
that up and put it on mpm or on GitHub
8:53
because I think we don't need to
8:55
reinvent another bunch of silos called
8:58
marketplaces. We already have package
9:00
manage managers and all of that hot
9:03
reloads. So if you develop an extension
9:06
for pi, you do so in the session and you
9:09
hot reload the changes and see the the
9:12
effects of that immediately which is
9:14
very great and it's also game
9:15
development thing is in game development
9:17
you want high very low iteration uh
9:20
speeds and that's great. So a couple of
9:23
examples cloud or anthropic ships the
9:25
slash by the way which lets you talk to
9:26
the agent why goes on its main quest. I
9:29
posted this little prompt on Twitter
9:31
jokingly and somebody build it in five
9:33
minutes with more features and they
9:35
didn't have to fork a clone pie. They
9:37
just let the agent write the extension
9:40
based on the prompt. Here's Nico. He's
9:42
one of the most prolific uh extension
9:44
writers. I don't know what the [ __ ] is
9:46
going on here. It's a chat room for all
9:48
of his Pi agents and they talk with each
9:49
other. I would never use this, but all
9:51
of this is custom including the UI. or
9:53
you can play NES games or you can play
9:58
And there's a bunch of other examples
9:59
I'm not going to talk about. So, how do
10:01
you build a PI extension? You don't. You
10:03
tell Pi to build it for you based on
10:05
your specifications. And then you just
10:06
iterate with it on that and hot reload
10:08
during the session. I'm going to skip
10:10
that example as well. And if you don't
10:12
like building things yourself, and I
10:14
hope you do like building things
10:15
yourself, but if you don't, you can look
10:17
on MPM or our little search uh interface
10:20
on top of MPM to find packages for sub
10:23
agents, MCP, and so on. So, does it
10:25
actually work? Well, here's the terminal
10:27
bench leaderboard from October before Pi
10:29
had compaction. I added that for Peter's
10:31
claw thingy. It scored sixth place.
10:35
Uh, but none of this is actually about
10:36
Pi. If you want to retake, I I basically
10:39
want you to retake control of your tools
10:40
and workflows. So build your own. Um and
10:43
if you want to know more about pi and
10:44
openclaw, go to this talk please. Yeah.
10:46
And then eventually Peter happened. He
10:48
put pi inside of open claw as its aentic
10:51
core which meant my open source project
10:53
became the target of a lot of openclaw
10:55
instances unbeknownst to their users. So
10:57
this is act 2 oss in the age of
10:59
clankers. Clankers are destroying oss.
11:01
Here's tal draw. They closed down the
11:03
issue on pull request tracker. Here's
11:05
open clause uh trackers. Here's mine.
11:08
Half of that is open source instances
11:10
who post garbage. So I started to rage
11:12
against the clankers.
11:14
Um if you send a pull request, it gets
11:16
autoclosed with a comment that asks you
11:18
to please write a nice issue in your
11:21
human voice, no longer than a screen
11:22
worth of text. And if I see that I write
11:25
looks good to me and your account name
11:26
gets put in a file in the repository and
11:28
the next time you send a pull request,
11:30
it's let through. Clankers don't read
11:33
that comment. They don't go back once
11:34
they posted a pull request. So that's a
11:36
perfect filter. Uh Mitchell eventually
11:38
turned it into vouch. Here's a clanker.
11:40
Uh I also labeled them. If you had
11:42
interactions with openclaw, your issues
11:44
get dep prioritized. I also built tools
11:47
where I embed uh issues and pull request
11:49
texts into 3D space. So I see clusters
11:52
of issues. Uh I also invented OSS
11:54
vacation. I just close the tracker
11:56
whenever I want. So I have my life back.
11:58
So does this work? Yes, sort of.
12:02
Which leads me to act three. Slow the
12:04
[ __ ] down. Everything's broken.
12:09
And then there's people that say, "Our
12:10
product's been 100% built by agents."
12:12
Yes, we know it [ __ ] sucks now.
12:22
And I'm hearing this from my peers and
12:24
this is entirely unhealthy.
12:27
Um, so here's how we should not work
12:28
with agents and why, at least in my
12:30
opinion. I wrote this on my blog a while
12:32
ago, but the basic is this. We're having
12:34
armory of agents and you're using beats
12:36
on been and you don't know that it's
12:38
basically uninstallable malware and
12:40
entropic build a C compiler that kind of
12:41
works but actually doesn't and we're
12:43
hoping the next generation of models
12:44
will fix it and here is Perso building a
12:46
browser and that's also super [ __ ]
12:48
broken but the next generation will fix
12:50
it and SAS is dead software solved in
12:52
six months and my grandma just built
12:54
herself a Spotify with her open claw
12:56
come on people so agents are actually
13:00
combounding boooos which is my word for
13:01
errors with serial learning and No
13:03
bottlenecks and uh delayed pain. The
13:06
delayed pain is for you. Here's your
13:08
code base on a human on one agent and 10
13:11
agents. How much of the agent code can
13:13
you review? Here's the same codebase but
13:16
expressed in number of boooos per day.
13:19
How much of those boooos do you think
13:21
you'll find? Then you say, "Oh, I have a
13:23
review agent. Let me introduce you to
13:26
the wonderful world of the Oro." Doesn't
13:28
work. It catches some issues. Um the
13:31
problem is that agents and merchants
13:32
have learned complexity. Where did they
13:34
learn that complexity from? From the
13:36
internet. What's on the internet? All
13:37
our old garbage code. There are some
13:39
pearls on the internet, really
13:41
well-designed systems, but 90% of code
13:43
on the internet is our old garbage. And
13:45
that's what the models learn from. And
13:47
every decision of an agent is local,
13:49
especially if the codebase is so big
13:51
that it doesn't fit into its context.
13:52
And if you let it go wild and add
13:55
abstractions everywhere that are
13:57
intertwined. Um, so that leads to lots
14:00
of abstractions and duplication and
14:02
backwards compatibility. Who has seen
14:04
that in the output of their agent? It's
14:06
[ __ ] annoying or defense in depth. So
14:09
yeah, you get enterprise grade
14:11
complexity within two weeks with just
14:13
two humans and 10 agents.
14:16
And then you say, but my detailed spec.
14:19
Yes, sure. You know what we call a
14:21
sufficiently detailed spec? It's a
14:25
So if you leave blanks in your spec,
14:28
what do you think happens? How does the
14:29
model fill in the blanks? And with what
14:31
does it fill that in? It fills it in
14:34
with the garbage that it learned on the
14:35
internet from our old code, which is
14:37
garbage to mediocre. And then you say,
14:39
but humans also, yes, humans are
14:41
horrible, fail failable beings, but they
14:44
can learn and they are bottlenecks.
14:46
There's only so many boooos they can add
14:48
to your code base on a daily basis. And
14:51
humans feel pain, which is a very
14:54
interesting property because humans hate
14:55
pain. And once there's too much pain,
14:57
the human has a bunch of options. It can
15:00
quit their job. It can uh blame somebody
15:04
else and make them fix it or everybody
15:06
bands together and starts refactoring
15:07
the [ __ ] out of the garbage codebase,
15:10
right? Agents will happily keep [ __ ]
15:16
And now your agents MD and super complex
15:19
memory systems will not save you. agents
15:21
don't learn the way we learn.
15:24
Those are my most most beloved people. I
15:26
don't even read the code anymore.
15:28
Congratulations. Something is broken and
15:31
your users are screaming. So, who you
15:32
going to call? Not yourself because you
15:35
haven't read the code. So, you're
15:36
relying on your agents, but they are now
15:38
also overwhelmed because the codebase is
15:40
so humongous that there's absolutely
15:42
zero chance they can get all the context
15:44
they need to fix the issues. And long
15:46
context windows are a heck, as most of
15:49
you will find out this year. as
15:50
everybody's switching to 1 million
15:52
tokens context windows and agentic
15:54
search is also failing.
15:57
So the agent patches locally and [ __ ]
15:59
[ __ ] up globally. If you see this in
16:01
your codebase, you're [ __ ]
16:06
So you cannot trust your codebase
16:08
anymore and also not your test because
16:09
your agent wrote your test. So good
16:11
game. So here's how I think we should
16:13
work. Um there's a bunch of properties
16:15
for good agent tasks. That means scope.
16:18
If you can scope it in such a way that
16:20
the agent is guaranteed to find all the
16:22
things it needs to find to do a good
16:23
job, you're done. That means modularize
16:26
your codebase. If you can give it a
16:28
function to evaluate how well it did the
16:30
job, even better. Hill climbing, auto
16:32
research. Uh, anything non-m mission
16:34
critical, let it wipe. Boring stuff, let
16:36
it wipe. Reproduction cases for user
16:39
issues, which are usually only partial
16:40
in information, perfect. I don't spend
16:43
any mornings anymore doing that. Or if
16:44
you don't have a human near you, rubber
16:46
duck. So, lots of tasks you can use them
16:48
for and save time. At the end of that,
16:51
you evaluate. You take what's
16:53
reasonable. Most of it isn't. And then
16:55
finalize. My final slide, more or less,
16:58
slow the [ __ ] down. Think about what
17:00
you're building and why. And don't just
17:02
build because your agent can do it. Now,
17:03
that's stupid. Uh, learn to say no. This
17:07
is your most valuable uh capability at
17:10
the moment. Fewer features, but the ones
17:12
that matter. And then use your agents to
17:14
polish the [ __ ] out of that. Enlighten
17:16
your users, not your uh token maxing
17:20
desires. Get the amount of generated
17:22
code uh that you need to review.
17:26
And non-critical code, sure, wipe slop
17:28
ahead. Critical code, read every [ __ ]
17:30
line. See the keynote after me for more
17:33
info on that. So, how do you know what's
17:35
critical? Any guesses?
17:38
Well, you read the [ __ ] code. Uh, if
17:42
you do anything important, write it by
17:43
hand. You can use a clanker to help you
17:45
with that, but don't let it make the
17:47
decisions for you because we've learned
17:49
all the decisions it makes are learned
17:51
from the internet. And that friction is
17:53
the thing that builds the understanding
17:55
of the system in your head, which is
17:57
important. And it's also where you learn
18:01
new things. And all of this requires
18:03
discipline and agency. And all of this
18:06
still requires humans. Thank you.