0:15
morning. Thanks for having us. Um, today
0:17
I want to talk with Christina about
0:19
friction a little bit. Um
0:23
this is um a a social preview that came
0:28
up automatically when someone submitted
0:34
um basically there was this is a forum
0:36
post that goes with um a security
0:38
incident that was deployed accidentally.
0:40
It was a configuration change that
0:42
caused a problem and the social preview
0:44
post had the marketing tagline of that
0:46
company which said ship without
0:49
Um, and we want to encourage to add a
0:53
little bit of friction to it. Um, and
0:56
I'll tell you why. So, who are we? Um,
0:59
I've been doing software development for
1:01
20 years, most of it in the open source
1:03
space. Um, I have created Flask, which
1:06
is a Python framework, which ironically
1:08
is so much in the weights that a lot of
1:10
people um are learning about it now
1:13
because the machines are producing it.
1:15
Um, and I left my previous company that
1:18
I worked for, Sentry, in April last
1:19
year, which perfectly coincided with um,
1:22
me having time and then obviously Cloud
1:24
Code. And so I fell deep into a hole of
1:27
aicing engineering and I started writing
1:29
on my blog and and and a lot of people
1:31
reached out to me over the last year um,
1:33
being all excited about this. Um, and
1:36
then I started with a friend in October,
1:39
a company called Arendelle where we are
1:42
trying to make sense of all the AI
1:46
>> yeah, and my name is Christina and I
1:48
work with Armen at this company called
1:50
Arendelle. But importantly, I am what I
1:52
like to call a native AI engineer. And
1:55
what that basically means is that these
1:57
tools have been around longer than I
1:59
have. Um, so what this means is like
2:01
they've been super foundational in how
2:03
I've become a software engineer. Not
2:05
just because obviously I use them to
2:06
work, but also because this is the means
2:08
by which I've learned to do what I do.
2:11
And before Arendelle I was working at
2:16
>> So we want to share a little bit from
2:18
practice not just theory but um I will
2:20
readily admit that I don't think we have
2:22
all the solutions. So we have been
2:24
building with or on agents for a good 12
2:26
months. Um we had huge leverage and
2:29
great disappointment and we we really
2:32
keep running into two types of problems.
2:34
Um I I think especially if you listen to
2:37
some earlier talks at at this conference
2:39
you will have learned a lot about um
2:41
that you should keep using your brain.
2:43
Um it's for some reason that's really
2:45
really hard. So there's a psychological
2:46
problem and the other one is the
2:48
engineering challenge is like the they
2:49
seem to be producing worse code for some
2:52
people and better code for some other
2:53
people and like what is it that actually
2:54
makes that work. Um and so this is
2:57
really not a solution as it is our part
2:59
of the journey of how we think so far we
3:01
have managed. Um yeah, so problem number
3:06
one is the psychology part which is like
3:08
why is it even though everybody told you
3:10
many times over that you should be using
3:11
your brain, you should be slowing down,
3:13
it's actually incredibly hard. It's just
3:14
one more prompt and and we don't sleep
3:16
that much. Like what is it that actually
3:18
makes it so hard? And then would it be
3:21
that hard if the machines would actually
3:22
be writing perfect code and we wouldn't
3:23
have to think quite as much and like
3:25
what is it is there something we can do
3:27
to make this a little bit better?
3:29
So I'll begin by introducing the first
3:31
part of these problems, the psychology
3:33
problem. And what I want to talk first
3:35
about is the shift. So I'm sure a lot of
3:38
us here who have been playing with these
3:39
tools for a while now experienced this
3:41
at some point. We were prompting
3:43
prompting not so good and then at some
3:45
point suddenly it clicked and they were
3:47
really really useful for us and it was
3:50
fun in the beginning and they gave us a
3:52
lot of extra time right because not
3:53
everyone was using them. They were
3:55
actually tools that made us more
3:56
productive, that made it more fun to do
3:58
our jobs. But very quickly, because they
4:00
were so useful and they got us so
4:01
hooked, everyone was using them. And so
4:03
this kind of had the opposite effect
4:06
where suddenly the baseline expectation
4:08
was just that everyone is now using them
4:10
and you have to use them. And so this
4:12
this fun and free time translated into
4:15
pressure. Now we all have to ship faster
4:17
and produce more code. And it is just
4:20
not sustainable to review and to
4:22
actually have time to think.
4:25
And so this leads us to the trap and I
4:28
actually think there's two parts of this
4:30
problem of this trap and one of them a
4:32
lot of engineers have spoken about and
4:33
it's that these tools are super
4:35
addictive. You never know if that next
4:38
prompt is going to be the one that makes
4:40
your product work and you've added a new
4:41
feature or if it's going to be that last
4:43
drop of slop that brings your product
4:45
crashing down. And so it's very
4:48
addictive. We keep doing what we're
4:49
doing. It's not a great solution. But
4:52
also most importantly, and I don't think
4:53
we realize this as much is that because
4:55
we produce a lot of output very fast, we
4:58
are tricked into thinking that we're
4:59
actually being more efficient doing more
5:01
work. And this is quite the opposite
5:03
because now we don't have as much time
5:05
to actually stop and think and design
5:07
what we're doing. Ask ourselves, is this
5:08
the best way in which I can implement
5:10
this or could I be some doing something
5:12
better? And when you're in this flow,
5:15
it's very difficult for yourself to stop
5:17
and it's definitely very difficult for
5:18
your agent to stop because it's running
5:20
around and it's reading files that it
5:22
should have never even read. So we are
5:24
the ones that need to actually have the
5:26
agency to be in control here.
5:29
>> And one thing that from a if you start
5:32
scaling this from like one person to an
5:34
engineering team that actually took me
5:35
quite a while to realize is that it
5:37
really changes the composition of the
5:39
engineering team. We we were really
5:41
supply constrained by creation of code
5:43
and so like the balance between writing
5:45
code and reviewing code and engineering
5:47
teams was usually quite decent. Now
5:49
every engineer has a multitude of
5:52
producing power compared to their
5:54
reviewing power and so obviously we are
5:56
piling up on poll requests but we are
5:58
also slowly starting to expand the total
6:01
amount of humans in an organization that
6:03
are participating in engineering
6:04
process. I talked to a lot of engineers
6:06
over the last year and increasingly the
6:08
one of the things that came up is like
6:10
now I have marketing people shipping
6:11
code. I have um former CEOs sh CEOs that
6:16
used to be like engineers now shipping
6:18
code again. And so the the roles that
6:21
those people have in the companies also
6:23
doesn't give them there's not that much
6:26
um um the responsibility doesn't rest in
6:29
them. The the responsibility still rests
6:31
with the engineering team. And so the
6:34
the total number of entities both humans
6:36
and machines that are participating in
6:37
the code creation process outnumbers the
6:39
ones that can carry responsibility.
6:41
We're not there where the machine can be
6:42
responsible for the code changes. And so
6:44
that has led to more and more code
6:46
reviews being skipped being rubber
6:47
stamped. Um and on the goal to small PRs
6:51
that that we want to see again so that
6:53
this reviewing process goes um this
6:55
amplification is something that at the
6:57
very least we need to recognize.
6:59
And so when you get this pull request
7:02
that looks really daunting and has 5,000
7:04
lines of code in it, this is actually
7:05
when you should be thinking and that's
7:06
exactly when it's the most overwhelming
7:09
and and increasingly we're tapping out
7:13
On the engineering side, what we're
7:15
doing is we are creating larger pull
7:18
requests. We're creating these massive
7:20
changes because it is free now, right?
7:23
And the if you think about how the
7:25
agents work, they're really optimized to
7:27
creating code that runs. Like their main
7:29
objective is write some code, run the
7:32
tests, make some progress. The
7:33
reinforcement learning sort of gets this
7:35
in. And so the the agents are writing
7:37
kind of code that is is when you as a
7:41
human as an software engineer start
7:43
learning how to write code you wouldn't
7:45
necessarily write. So for instance, you
7:47
see quite a bit of code that tries to
7:49
read a config file and if it doesn't
7:50
read a config file, it loads some
7:51
defaults. And as an engineer, you know,
7:53
that's actually not great because I
7:54
might not notice that I'm reading
7:56
reading the default config file. And so
7:58
I might only discover that I have a
8:00
massive problem after two hours when I
8:03
already wrote database records with
8:05
wrong data. And so these machines, they
8:08
they optimize towards making progress to
8:10
shipping stuff to like unblocking
8:12
themselves. And as a result, they're
8:14
creating many more failure conditions
8:15
than human written code normally would
8:17
do. in parts is because you as a human
8:19
feel a little bit of a you feel bad when
8:22
you write code like this. There's
8:23
there's something that sort of builds up
8:24
emotionally in yourself, but the agent
8:26
doesn't have a reason for this. It it
8:28
doesn't feel anything. And so if you if
8:31
you create these services that are sort
8:33
of hobbling along and they're actually
8:34
willing to to recover from local
8:36
failures, you actually create very very
8:38
brittle systems. And this also means
8:42
that you're very quickly creating a
8:44
codebase of the size and complexity that
8:45
the agent itself can no longer dig
8:47
itself out from. It's going to start no
8:49
longer reading all the files that it
8:50
should. It's it's creating code in a new
8:52
file that has already done somewhere
8:54
else. And so this this entire machinery
8:58
over time creates much more entropy in a
9:00
source code than you would normally have
9:03
if if humans were on it. And a big part
9:05
of this is that humans feel bad and
9:07
agents don't really have any emotions
9:09
that they communicate to you.
9:11
>> But as Armen likes to say, don't worry,
9:14
not all is lost. We have s found some
9:16
correlation between what the agents
9:18
really excel at doing and the types of
9:20
code bases that we actually put them to
9:22
work into. And for example, the main
9:24
example here is libraries versus
9:26
products. What we found is that for
9:28
libraries, they tend to excel a lot
9:30
more. And this makes sense because
9:31
intrinsically when you're building a
9:33
library, you tend to have a very clearly
9:34
defined problem that you're trying to
9:36
solve. And most of the time you can even
9:38
map the set of features that you want to
9:40
build to the API service and it has very
9:43
tight constraints. And because this is
9:45
something that you probably want to
9:46
build on top of or make accessible to
9:48
other people, it's likely that it's
9:50
going to be a very simple core in which
9:52
you can then plug into. And on the other
9:54
hand, products and perhaps this is a bit
9:56
more unlucky for the rest of us because
9:58
we all probably are more into building
9:59
products. Uh it's much harder because
10:02
there are so many interacting concerns
10:04
and components like for example you have
10:06
your UI, your API response. You have
10:08
different permissions depending on the
10:10
feature flags, the billing and so on.
10:12
And so there's this very heavy
10:14
intertwining between different
10:15
components. And what this means is that
10:17
for the agent itself, it's impossible to
10:19
fe fit all of this into its context
10:22
window. it has no way to actually
10:24
understand the entire global structure
10:26
and so locally the agent tends to be
10:28
very reasonable but when it gets to the
10:31
global scale it becomes a bit demented.
10:34
So what we're proposing here is that
10:36
just as you would do with any type of
10:38
system design in the past, your codebase
10:40
has now become infrastructure and as
10:43
such you have to design it in the way so
10:45
that it is also legible for the agent
10:47
and it can make the most of it.
10:51
And so this is what we're proposing is
10:53
an agent legible codebase and one of the
10:56
main points that is very clear to all of
10:58
us I'm sure is modularization. So like
11:00
we have different components and this
11:02
makes it easy for the agent to add one
11:04
feature in one spot without corrupting
11:06
everything else. But importantly this
11:07
also means modularizing your code flow
11:09
itself. So for example I've been working
11:12
on some refactoring. We're building
11:13
somewhat of an AI assistant. And for me
11:16
it was super important to understand
11:18
which steps of my code are actually like
11:20
the main points. So say like you get
11:22
user message then I pass the message to
11:24
the agent loop and then I have to deal
11:26
with the output. And this is where these
11:30
points are very clearly defined for me.
11:31
So the code was not as messy. But it
11:34
happens to be that between these points,
11:35
between these steps, that's where the
11:37
agent tends to add the most fuzz. So it
11:39
will be parsing between different types.
11:41
It's adding things to state that
11:43
shouldn't be in state. And so you end up
11:45
with these behaviors that you didn't
11:46
want to support and that are unexpected
11:48
and can be quite dangerous. Another
11:51
point is trying to follow all of the
11:53
known patterns because I think we all
11:55
know by now there's no point in fighting
11:57
the RL the reinforcement learning. The
12:00
more we can lean into it the better that
12:02
our output is going to be and it's also
12:04
more scalable down the line. Then as
12:07
mentioned with libraries like if you
12:08
have a simple core and you push the
12:10
complexity to other abstraction layers
12:12
then it's going to be easier for
12:14
yourself and the agent to be able to
12:15
read your codebase and no hidden magic.
12:18
So for example here uh using react
12:21
server actions or using OM instead of
12:23
rorowsql what this does is that it hides
12:26
intent from the agent and if the agent
12:28
can't see something it can surely not
12:32
and so to be more precise these are the
12:35
examples of mechanical enforcement that
12:37
we have been using at the company and
12:40
most of these we actually achieve with
12:42
uh linting rules. So the main example
12:44
would be no bare catch holes. Great.
12:48
Imagine that there's an example here.
12:50
The agent found the very catch all and
12:51
was like, "Oh no, this is bad. Edited
12:54
it." But yeah, so we also try to have
12:58
our SQL uh always in one query interface
13:01
so that the agent doesn't have to go
13:02
hunting around the codebase finding all
13:04
of the different places because if it
13:06
misses one then you can have breaking
13:07
behaviors and again that's dangerous. We
13:10
try to have one primitives components
13:12
library for the UI and not have any raw
13:14
for example input uh input boxes. Uh so
13:17
that it's we always have one type of
13:19
styling. It's very consistent one kind
13:21
of behavior. We don't have any dynamic
13:23
imports. And this may not sound as
13:26
important but actually we enforce unique
13:28
function names. And the reason for this
13:30
is not just more legibility for you and
13:31
the agent, but it's actually also the
13:33
token efficiency. So if your agent is
13:35
gripping for a specific feature or
13:37
something in your codebase, if it only
13:38
gets one output, it's going to be much
13:40
better at continuing with the loop. And
13:43
we've started exploring something
13:45
recently called erasable syntax only
13:47
TypeScript mode. And what this does is
13:49
that your code is basically JavaScript
13:51
and it has the type annotations on top.
13:54
And this means that there's no
13:55
transpiling direction because there's
13:57
one source of truth between your actual
13:59
code and the compiler. And so when the
14:02
agent is looking for errors, it doesn't
14:03
have to have this like confusion of oh
14:06
my god, where am I looking at? It is
14:08
much better at finding them.
14:11
And so the goal really is get in this
14:15
loop somehow like get the agent to
14:17
produce as good code as it can, but you
14:19
really need to find a way to feel the
14:21
pain that the agent doesn't feel and you
14:24
need to be woken up in a way when you
14:27
should be looking at this. And one of
14:28
the things we have been doing is we
14:29
build a PI extension for our review
14:31
needs where we are separating out the
14:34
kind of input that normally would go
14:36
back to the agent. So this is mechanical
14:38
bugs. It is where it clearly violated
14:41
the agents MD. Um but then we
14:44
specifically call out the kind of
14:45
changes where the human's brain should
14:47
reactivate, right? It's like we don't
14:49
think that the database migration should
14:51
ever go in without the human making a
14:52
judgment call on this because it very
14:54
much depends on the locks, the size of
14:55
the data in production. Um if there are
14:58
permissioning changes, you better think
14:59
about this themselves rather than the
15:00
agent because they can be they can be
15:04
Just some examples where we learned if
15:07
we miss it, we regret it. Um and you
15:11
will miss it. But this these machines
15:13
can help you find this and then you see
15:15
this and then you actually get a little
15:17
bit of a hit like, oh now now I have to
15:19
kick into gear and do something here. Um
15:22
this is what this looks like in pi. Um
15:25
you have the um on the bottom you have
15:27
the human call outs on the top you have
15:30
what is go what basically if you were to
15:32
end this review and say like fix the
15:34
issues the the agent would go back and
15:35
automatically act on the first two um
15:38
but but this is the moment where I will
15:40
now go and see like is this a dependency
15:41
I actually want to have in this codebase
15:43
like do I like the maintainers is this
15:45
does this work for me
15:48
and we obviously like the speed like
15:51
this is addictive it is great we feel
15:53
there's a lot of productivity
15:54
But it is so devious if you start
15:57
relying on it speed where you really
15:59
shouldn't. And so I can only encourage
16:02
you to find the areas where you you have
16:04
this feeling that this is actually net
16:05
positive. For me a lot of this is
16:08
reproduction cases like when a customer
16:10
reports an issue I can I can have the
16:11
age and reproduce this perfectly and I
16:14
have a really good starting point
16:16
exploring different type of product
16:17
directions for as long as you commit
16:18
yourself to doing this uh with the code
16:20
that it generates. Um all of this is
16:23
great but on the other hand system
16:24
architecture creating reliability in the
16:26
system they're not just very good at
16:29
because we really still have to go slow.
16:31
It's there is so much mess that can
16:33
appear in a codebase in so little time.
16:35
Mario was already talking about this
16:36
earlier but like we forget that we
16:37
producing months and months of technical
16:39
debt in the in in a time of weeks in a
16:42
time of days sometimes and it becomes so
16:45
much harder to actually understand
16:46
what's going on as codebase. the when
16:48
the understanding of your own code
16:50
drops, it is really really hard and it's
16:53
also psychologically hard. I've found
16:55
some code pieces that actually didn't
16:57
work in production and I was kind of
16:59
frustrated learning that I was the one
17:00
that committed it with the agent and
17:02
just didn't really see that. It's it's a
17:04
very disappointing experience when it
17:06
happens and then you realize that you
17:07
actually were the one that screwed up.
17:09
Um, and so it is it is psychologically
17:13
incredibly hard to to really judge
17:15
objectively the state of the codebase.
17:18
And the only way right now is to really
17:20
slow down a little bit on on that front
17:24
and this this friction. I know that
17:26
friction like every engineering team
17:28
I've ever worked at said like we need to
17:29
get rid of the friction in shipping and
17:31
and that is true. Like there's a lot of
17:33
stuff that's very very annoying and
17:35
shouldn't be there. But if you have
17:36
worked on large enough engineering work,
17:38
SLOs's are a great system that is
17:40
intentionally designed to put friction
17:41
into the engineering process to make you
17:43
think, do I need this reliability? Do I
17:45
need this criticality of the service? Am
17:48
I sufficiently staffed to run it? And
17:49
with the agents, we have now gotten this
17:52
idea that we should get rid of all of
17:53
this when in all reality we need of it.
17:56
Um because the friction actually in many
17:59
ways is what's necessary on a physical
18:01
level to steer. like without friction
18:03
there's no steering and and that is
18:05
really necessary. Um so you should you
18:08
should put a little bit more of a
18:10
positive association to this idea of
18:12
friction. Um because this is really
18:14
where your judgment is. This is where
18:15
your experience is and you should be
18:17
inserting that and start feeling it.