Full Transcript

·YouTLDR

The Never Ending Lore of Harness | Vivek Trivedy (Product Lead, Langchain)

1:33:2619,026 words · ~95 min readEnglishTranscribed Apr 22, 2026
AI Summary

An agent is defined as 'Model + Harness,' where the harness is all code and logic surrounding the LLM to manage context, tools, and verification. True progress in agent performance comes from engineering the harness (context engineering, self-verification, and problem decomposition) rather than just waiting for better frontier models.

It provides a rigorous engineering framework for moving past simple LLM wrappers into robust agentic systems that can handle long-horizon tasks through systematic state management.

Section summaries

0:00-2:21

Intro and Model Gossip

skip

General discussion about recent Claude/GPT releases that is now dated.

2:21-16:27

Vivek's PhD and Career Journey

optional

Personal background on vision research and working at AWS/Lockheed Martin.

16:27-25:51

LangChain Open Source Strategy

watch

Crucial context on how LangChain uses community feedback to drive product development.

25:51-35:15

The Anatomy of a Harness

watch

The core technical framework of the video; explains the 'Model + Harness' definition.

35:15-51:42

File Systems and Continual Learning

watch

Detailed discussion on persistent storage and RL vs. Context Engineering.

51:42-1:12:51

Context Rot and Opinionated Agents

watch

Practical engineering tips for managing long-context agents and avoiding token waste.

1:12:51-1:31:39

Benchmarks and Future Outlook

optional

Discussion on simulation-as-a-service and advice for new grads.

Key points

  • The Agent Equation: Model + Harness — An agent isn't just an LLM; it is the model plus every piece of configuration, tool logic, and context management code around it. The harness is responsible for pushing the right information over the 'computational boundary' of the context window.
  • The File System as the Core Primitive — File systems are the most foundational harness component because they provide a persistent, structured storage layer that both humans and models already understand. They serve as essential 'scratchpads' and collaboration spaces for multi-agent orchestration.
  • Harness Hill Climbing & Continual Learning — Instead of static prompting, engineering should focus on 'harness hill climbing'—using trace data from LangSmith to iteratively refine prompts, tools, and verification steps. This creates a self-improvement loop where agents learn from their own historical failures.
  • Context Rot and Selective Disclosure — Models become significantly less capable as their context window fills up (Context Rot). Effective harnesses combat this through 'tool call offloading' (only showing head/tail of outputs) and 'progressive disclosure' (telling the model where full data lives without injecting it).
If you're not the model, you are a harness. Vivek Trivedy
The context window is like where all the computation actually happens... we need to decide what goes into that context window so it can do useful work for us. Vivek Trivedy

AI-generated from the transcript. May contain errors.

0:00

Hey everyone, welcome back to ground

0:01

zero. This is episode 13. Yeah, we are

0:04

running fast. Today we have ve from

0:06

langchain. So we leads their work on

0:09

open source agents and harnesses the hot

0:12

term right now. He's the person behind

0:15

DP agents the coding agent that went

0:17

from top 30 to top five on terminal

0:20

bench 2.0 by only changing the harness.

0:23

He's been writing some really good stuff

0:25

with lot of signal and alpha on what

0:28

harnesses actually are. Why agents

0:30

should be more opinated the idea of

0:33

harness as a service and um how planning

0:36

agents are really just dynamic workflow

0:38

generators.

0:40

Before Langen, he ran his own startup on

0:42

visual understanding agents and before

0:44

that uh was a scientist at AWS while

0:47

doing his PhD in CS at Temple. Uh, we'll

0:51

cover a lot into this. Uh, there's a lot

0:53

to get into. We welcome.

0:55

>> Thank you for having me. I'm super

0:57

hyped. I'm super I've been following you

0:58

on Twitter a bunch. So, yeah, I'm glad

1:00

we're making this happen.

1:01

>> How are you doing? And would love to

1:02

know your uh initial VIP check on Opus

1:05

4.7.

1:05

>> First of all, doing great. Whenever

1:07

there's a new model release, you know,

1:08

it's always like a good week for all of

1:09

us. It's maybe like an even more fun

1:12

week for like anyone who does like evals

1:14

on all the models. Um, so yeah, dropped

1:17

yesterday. We started like evaling it.

1:19

We have our like set across our

1:21

products. We have like open source evals

1:22

that we use and like also like for some

1:24

of like Lang Smith's products that we

1:26

use. It's a good model. It's a good

1:28

model. I don't think it was like a crazy

1:29

step change for tons of stuff that we're

1:31

doing. But TBD I think like the fun part

1:34

about stuff we'll like jump into which

1:36

is strong belief that every model needs

1:40

its own custom things that you add to

1:42

it. I know like anthropic release is a

1:43

nice skill uh that you can like easily

1:45

convert prompts and stuff but we're in

1:47

the middle of that process for like the

1:49

agents that we're going to use it for.

1:51

So it's a good model not a crazy step

1:53

change but we'll we'll fit it. We'll

1:54

we'll make it good.

1:55

>> I mean it is interesting in a way that I

1:57

have been seeing a lot of mixed opinions

2:00

right now. People have pretty much mixed

2:02

opinions on 4.7. Basically what they

2:04

have doing it with um the kota users as

2:07

well. I mean in just three four prompts

2:09

you are running out of I mean there's a

2:12

lot of good story I mean interesting

2:13

story behind but but yeah I mean the

2:16

kind of piece about these models being

2:18

coming up be open air or anthropic

2:20

anthropic specifically how they have

2:22

been doing good at public perception and

2:24

effective marketing as I say I mean

2:26

working well working working I mean it's

2:28

been rewarding for them

2:29

>> I mean they're great they're great they

2:30

they put out like great models obviously

2:32

they put out great products around the

2:34

models I think there's definitely some

2:37

stuff where

2:39

people are playing a lot more with the

2:42

models and like they're basically like

2:44

picking use cases they use models for.

2:46

So it's like everyone uses cloud code

2:47

like everyone uses codecs and that sort

2:48

of stuff. But like when you build like

2:50

your agents on top of those models, it's

2:52

like I need to actually care about the

2:54

prompts. I need to care about the

2:55

context engineering. I need to like care

2:57

about the tool design. And I think like

2:59

that's where it's really cool to like us

3:03

putting out content like other like

3:04

really cool people putting out content

3:05

which is like like how do I make a model

3:07

good at like my task basically because

3:09

at the end like my customers that's all

3:10

they care about that's all I care about

3:12

and I think like that's like a bunch of

3:13

the harnessge journey basically whether

3:16

you call context whether you call like

3:17

agent edge it's basically like fit some

3:20

sort of system around this model to make

3:21

it like sit at my task and that's like

3:24

what we're all trying to do and like

3:25

anthropic is trying to help us with

3:26

that. Open models are trying to help us

3:28

with that as well.

3:29

>> Totally makes sense. Um let's dive in um

3:31

about your journey. So you went for a

3:34

PhD in CS at Temple and I mean worth to

3:37

mention you did your bachelor's,

3:38

masters, PhD everything at Temple and

3:40

this has been a talk of the town as well

3:42

in past years on Twitter. People were

3:43

talking about it. People have again I

3:45

mean some opinions about Temple being a

3:48

university, good university or not. So

3:50

my question is to being a scientist I

3:52

mean doing a PhD PhD then to being a

3:55

scientist at AWS to running your own

3:56

startup on agents or visual

3:59

understanding to leading open source

4:00

agents at Langen. How has your journey

4:02

been like?

4:02

>> Happy to dive in. Um cool cool I'm so

4:05

I'm from around this area. So I'm from

4:06

like east coast uh Jersey like

4:08

Philadelphia area. I went to school at

4:10

Temple. So I did my undergrad there did

4:12

my masters there like my PhD there. So

4:14

like super early I was like I'm just

4:17

going to be a doctor like most kids

4:19

pressured by their parents like I'm

4:20

going to be a great doctor like quickly

4:22

realized like I don't really want to do

4:23

that most of my undergrad. So I do my

4:24

underground in math and math is like

4:27

really cool. I think there's a lot of

4:29

concepts in math that like translate

4:30

really well to CS and like physics and

4:32

things like that sort of like systems

4:33

thinking.

4:35

>> Math is also like at least for me maybe

4:37

I'm just not amazing at it. It's

4:38

incredibly hard. So like doing something

4:40

really hard does prepare you for other

4:42

things.

4:44

Yeah, dude. Undergrad was like really

4:45

fun. I enjoyed math. I got into like

4:46

some CS stuff. I think like late 2010s

4:50

was when there was a lot of cool stuff

4:53

in different parts of ML. So like I got

4:55

into computer vision stuff, like

4:57

undergrad research. And like I love

4:59

vision. So like I think vision is still

5:01

one of the coolest things out there.

5:03

There's like way less research done on

5:04

vision even today relative to text. Like

5:08

>> OCR is pretty important, right? OCR is

5:11

like now okay just just send just send

5:14

the PDF to Claude basically and like

5:16

obviously a bunch of systems engineering

5:17

around that but yeah man like I I loved

5:20

vision I still love vision vision was

5:22

really cool so like I did undergrad in

5:24

that did like research around that and

5:25

then I just went straight into like

5:27

masters in PhD like right after I

5:29

graduated like early 2020s and then yeah

5:33

my PhD was basically all around like

5:36

vision focused representation learning

5:39

so yeah I can talk a little bit about

5:40

that. So the first like topics that I

5:42

was working on was like graph neural

5:44

networks which are like I don't know how

5:46

hot those are anymore but I do see like

5:48

some really cool people still doing

5:49

research around those. Um basically like

5:51

graph representation learning but it's

5:52

like graph representation learning for

5:54

like vision basically. So it's like if I

5:56

like decompose an image into like

5:58

particular objects and like I make a

6:00

graph of that and then I do like

6:01

representation learning do we get like a

6:03

better end vector for like retrieval

6:05

like classification and then like we did

6:06

this at also like the data set level as

6:09

well. So like what if I have like kind

6:11

of like few shot examples. It's called

6:12

like transductive learning like use

6:14

other information in the data set to

6:16

help you classify the next thing. Dude,

6:18

that was really cool. Like I think

6:19

graphs I'm like bearish on graphs

6:21

overall actually. So maybe hot take but

6:23

like that was a really cool part of

6:24

research and like that was my first like

6:26

dabbling into like computer vision stuff

6:28

like undergrad then my first like PhD

6:30

topic which like it shifted a little bit

6:32

after like the chat PT moment like tons

6:34

of research became around okay like

6:37

let's do VLMs for everything and let's

6:39

do like representation learning on the

6:41

VLMs like what are VLMs like actually

6:43

seeing when they're doing their like

6:45

attention mechanism over images. So

6:48

yeah, dude, it was great. It was great.

6:50

I like really enjoyed my time in PhD. I

6:52

think it's like you get some sort of

6:54

unbounded time with your adviser to just

6:58

pick an interesting problem and just

6:59

like rabbit hole in it. So I did like

7:01

retrieval stuff like representation

7:03

learning stuff. Yeah, dude. It was

7:04

great. I enjoyed it.

7:05

>> Awesome. Um, so I had a chat with

7:08

Tensorcut the other day. He started

7:10

Paradigma. He dropped out of PhD. So my

7:13

question to you is what do you really

7:15

think about the scenario right now the

7:18

linkage between academia and the

7:20

industry and right now if you have been

7:23

like if someone is going for PhD or

7:24

something like that. So what do you

7:25

really think about is is is that is this

7:27

worth it or how far we have come is

7:31

still necessary to go for a PhD to I

7:34

mean it is again very opinionated um

7:36

question but still I mean I want to

7:38

really understand your

7:39

>> yeah absolutely so like it's a great

7:41

question like people ask me this

7:43

question like locally like my friends or

7:44

like younger brothers and stuff.

7:46

>> Yeah. So like maybe my PhD was like

7:49

slightly different because I was doing

7:51

research at Temple but I was also doing

7:53

research and like working on like prod

7:55

projects when I was at AWS and those are

7:57

happening at the same time and I like

8:00

strongly believe that that is like a

8:02

fantastic mix for anyone who wants to do

8:05

like research but then sort of

8:07

understand maybe like how their research

8:10

is going to be applied in like some

8:11

settings. And I think today like the

8:15

point basically of a PhD to me is like

8:17

you pick a topic that you're like really

8:19

deeply interested in and you like poke

8:22

around the edges of that topic to try to

8:24

figure out like how we can make like

8:25

this thing better. And like that doesn't

8:27

like really require a degree to do that.

8:29

There's tons of like sick researchers on

8:31

X who just like post like random blogs

8:33

and like they don't have a PhD. they

8:35

probably don't maybe don't have CS

8:36

background but there's like you just

8:37

pick a topic you like rabbit hole it

8:41

you just like push the boundary of

8:42

what's possible and you do that like in

8:44

a verifiable way so you like write code

8:46

do experiments you try to share like

8:47

open research and if you're able to find

8:51

a company that allows you to do that

8:52

like lang's fantastic at that like I

8:54

think they really cultivate like hey

8:56

like we're going to like pick this topic

8:57

we're just going to like figure out how

8:58

it works and we're going to like publish

9:00

content about it basically

9:02

>> I would say that's great I think it it

9:03

kind of depends like if you find a great

9:05

company, a good great founder that you

9:07

vibe with that lets you do both.

9:08

Industry is like amazing and like

9:10

especially AI research like it's super

9:12

helpful across a lot of companies. You

9:14

can probably make a lot of money and

9:15

like do interesting research at the same

9:17

time. So yeah, kind of like a

9:18

non-answer, but if you do find that

9:20

scenario amazing if you just want to

9:22

like grind on like some sort of topic

9:24

and PhD for like a bunch of years, also

9:27

great. I actually don't think you can go

9:29

wrong like just by being curious and

9:30

just exploring it.

9:31

>> Yep. I can see you have uh you you were

9:35

like working on your startup about

9:36

visual understanding agents. So I want

9:39

to understand your learnings there and

9:41

how do you see the vision space right

9:44

now like how can you correlate between

9:47

uh the time when you started and the

9:48

time we have come so far with the

9:51

current frontier state-of-the-art

9:52

research and products building. Yeah,

9:54

dude. Um, yeah. So, like I started that

9:56

startup after I graduated like my PhD.

9:59

So, that was sort of like mid last year

10:01

with a friend. And basically like the

10:04

main thing that we were working on like

10:06

starts was called Agentify. And like the

10:08

main idea was basically that basically

10:10

vision compared to text like really lags

10:12

behind in frontier models for like

10:14

things like visual reasoning but also

10:16

things like perception just generally.

10:18

So there's like tons of things where

10:19

you'll like show an image or like an

10:22

object like o two overlapping boxes to

10:24

the model, right? And it's like it

10:25

doesn't like fully understand that those

10:26

two things like overlapping and like

10:28

part of this is just a perception

10:30

problem in the visual encoder where it's

10:32

like some of these like fine grain

10:33

details, it's just not able to

10:34

understand them with like the native

10:36

training that it has. But that I think

10:39

is like a fantastic opportunity because

10:41

it's like how much of that gets absorbed

10:43

into the vision encoder backbone versus

10:46

like how much do we augment models with

10:49

like tool calling behavior that they're

10:51

exceptional at and actually use that as

10:54

the mechanism to like take vision

10:56

capabilities and like put them into the

10:57

models. Like that's basically the whole

10:59

like idea that we were working on. So

11:00

like research and like product around

11:02

that which is like what if I just took

11:04

all of the classic vision models that we

11:06

already have and like a lot of this was

11:08

honestly inspired by Meta's work on SAM.

11:11

So I think like SAM and that whole

11:13

series is like incredible like SAM 123.

11:17

It also supports like video segmentation

11:19

which is like insane and you can also

11:20

like fine-tune it. You can do like meds

11:22

SAM and things like that. So it's

11:23

basically like BET was okay models are

11:26

amazing. They're getting very smart, but

11:28

like their vision capabilities are

11:29

lagging behind. But we can augment them

11:31

with tools and like you can basically

11:34

like do the right tool selection in the

11:37

moment to like get that capability. Like

11:39

segmentation is something that it was in

11:41

Gemini Flash across the Gemini series,

11:43

but like compare that to like SAM,

11:44

right? Like SAM was like way better. If

11:46

you just use like Sam as a tool compared

11:48

to like the native segmentation Gemini,

11:49

you would be just like way happier. and

11:51

like all you had to really do was like

11:52

point to the right spot which is like

11:54

way easier than doing like semantic

11:56

segmentation. So that was the idea. I

11:58

still think that that is true in vision

12:00

today. Like even with like Opus 4.7's

12:04

new benches, it's still not as good at

12:08

visual perception as like we need it to

12:10

be. So I still think tool use is like

12:12

really really exciting for yeah just for

12:16

like agentic systems like visual

12:18

basically making a bunch of like vision

12:19

specific tools for your task and like

12:20

augmenting uh yeah augmenting your agent

12:23

with that.

12:23

>> I think there is a lot of scope to do

12:25

alongside UI bench as well. I mean again

12:29

uh it's more about one's taste but uh

12:31

but there are lots of ifs and buts lot

12:34

of nuances where you really need to take

12:37

care of like even if you're cloning a

12:38

website I mean there's lot of sc uh

12:41

scope to play around something so my

12:43

next question is about your work at

12:45

Loheed Martin. So you you you interned

12:48

there. I think that was your first um

12:51

job and honestly a lot of what people

12:53

see about world is kind of sophisticated

12:56

reals on social media about American

12:58

weaponry. So what was the reality like

13:00

from the inside? What what what you were

13:02

working on? How does it feel like to

13:03

work at some defense um kind of defense

13:06

company and what experience lead?

13:08

>> That is like such a throwback. So that

13:10

was like my first internship at like

13:12

tech ever. So, I was like a bio intern

13:15

in undergrad and I was like looking for

13:16

internships and I gave my resume and I

13:19

got an internship at like Loy Martin

13:20

which is amazing because like I don't

13:22

know how good my bio resume was for

13:24

getting like any internships. Yeah, man.

13:26

I wish I say like tons of stuff I did on

13:29

>> What do you mean by bio resume? It was

13:31

like like you were working on some bio

13:33

>> Yeah. So like I went to undergrad as

13:35

like a biochem major because like I

13:37

wanted to be like a doctor.

13:40

>> Amazing.

13:40

>> Yeah. So like then like after freshman

13:42

year I applied to like internships cuz I

13:44

I switched I wanted to do tech after

13:46

that or like at least explore it with

13:47

like a bio resume and they were like

13:50

dude like what like what are what are we

13:52

doing here? And then like I think I

13:53

basically just like talked like to the

13:55

hiring manager and just said like hey

13:57

I'm like really down to like learn this

13:58

thing like which is like data science

14:00

like that time there bunch of these like

14:01

data science courses and things coming

14:02

out so it was still like early and I was

14:04

like hey like I took these like Python

14:05

classes and like I'm super down to learn

14:08

this. And basically it was like yeah I

14:10

mean it sounds great.

14:13

I ended up working on the data science

14:14

team there and it was basically like my

14:17

first introduction into like kind of

14:21

like data analysis sort of stuff. So

14:23

like understanding like it was much like

14:25

stats basically. So like I wouldn't say

14:27

it was like ML but it was like this is

14:29

like intro to like making plots like

14:32

slice this data this way. So it was a

14:33

bunch of just like empathy for like very

14:36

very messy data as like my first

14:38

internship which is actually like very

14:40

valuable today just like insane amounts

14:42

of data which is like does not look very

14:43

clean and yeah man I wish I say more it

14:46

was basically like a great learning

14:47

experience because I was kind of

14:48

learning how to code and like doing like

14:50

data science stuff and then it was also

14:52

like a decent confidence boost because

14:53

I'm like okay maybe I can do like tech

14:56

stuff and yeah I interned there and it

14:58

was like fun and then yeah I didn't

15:00

really go back after that but I started

15:03

getting into more like research stuff at

15:04

school.

15:04

>> Awesome. Um, also recently I was just

15:07

kind of exploring the timeline. I see

15:09

Mike Mill who is a pretty famous, you

15:11

know, internet celebrity was looking for

15:13

an AI guy and you came up through

15:15

Temple. Apparently Mike was surprised

15:18

how many Temple people are in AI and so

15:21

did you end up connecting with him? Did

15:22

you share anything about Langchen and

15:24

stuff?

15:24

>> So Meek Mill is like he's like a rapper

15:26

from from Philadelphia and like I guess

15:29

he lives around Temple like that's where

15:30

he was from and I think everyone was

15:33

like when they saw that tweet they were

15:34

like Meek Mills get into AI so okay let

15:37

me just like reply basically because I

15:38

think like honestly like randomly

15:40

posting on Twitter X is like awesome.

15:42

You can meet so many cool people like

15:44

that and I we'll talk about this but I

15:47

met like Harrison the founder of like

15:48

CEO and the CEO of like W

15:52

And yeah, he did not reply to me. I hope

15:54

his like startup is doing sick, whatever

15:56

he's whatever he's doing. But like I'll

15:58

like repeat it if he does need someone

16:00

for help with like AI. I'm actually like

16:03

seven blocks down. So I could totally

16:06

like just pull up and help him. So no, I

16:09

think that's a good lesson though is

16:10

just like randomly posting maybe like

16:11

I'll just keep doing that and then maybe

16:13

something will happen.

16:14

>> Yep. Awesome.

16:17

So I mean the next question to you is so

16:20

when did you join Langchain and uh what

16:22

actually pulled you there specifically?

16:24

So and since you joined what actually

16:27

has

16:27

>> So this is like this is so much fun. Um

16:30

I was working on my startup like after I

16:32

finished my PhD that didn't work out

16:34

like we basically stopped around the

16:36

fall. At the same time, I was basically

16:39

like doing my first foray into just like

16:41

posting like random stuff on Twitter

16:44

just like my thoughts like basically

16:45

just like open source stuff like hacking

16:46

on random stuff and

16:49

from a bunch of the stuff I was posting

16:50

around like so like last year I also

16:53

like sort of believe that like we have

16:54

amazing models but like because we did a

16:56

bunch of stuff in this like visual

16:58

understanding space with like agents and

17:00

stuff. I was like very very confident

17:02

that models need like some stuff around

17:04

them to like help them do these tasks

17:06

because like they just suck at them out

17:08

of the box and like we basically saw

17:09

this every day. So that's basically when

17:12

a lot of maybe the ideas that were

17:14

brewing around harnessge like started to

17:17

maybe get more like crystallized and I

17:19

just started like posting about that

17:20

online. It's like, hey, like this is

17:23

maybe like what harnesses look like.

17:25

Like harnesses are like supposed to like

17:26

wrap models and like if we're trying to

17:28

do like vertical tasks. It like really

17:30

helps to have some sort of like

17:31

opinionated like prompts, context

17:33

engineering, like tool call structure

17:35

like all this sort of stuff. And I think

17:38

I just like started DMing Harrison like

17:41

the CEO from that which is like super

17:42

sick. He is also always thinking about

17:46

like the frontier of like AI systems

17:49

which is awesome. And then we started

17:51

chatting maybe like late last year just

17:54

like yeah like what would it look like

17:55

to build open-source infrastructure

17:58

around like agent engineering and like

18:02

maybe the best way to facilitate that is

18:04

by helping people build good harnesses

18:07

like whatever good means like let's

18:08

discover like what good means and make

18:10

open source software about that. So, it

18:12

was basically like, okay, that sounds

18:14

sick. And then I was like, I don't

18:16

exactly know what I'm going to do. Like,

18:17

maybe I'll continue like working on the

18:18

startup or like, but I would love to

18:20

join something that like really aligns.

18:21

So, then I started working with like

18:22

their open source team late like last

18:25

year on what ended up becoming like what

18:28

was deep agents, but ended up becoming

18:30

like a lot bigger. Um, so yeah, we were

18:32

working on like the very very early

18:34

versions of like deep agents last year,

18:36

which is like one of our libraries at

18:38

Langchain that we that we have. It's

18:40

like our library to help people build

18:42

harnesses. Um, or at least it's one of

18:44

the ways that people can build harnesses

18:45

using using Wangchain. And yeah, I loved

18:48

it. I love the team. Uh, amazing people

18:51

doing open source. And then I decided to

18:53

join like full-time in in December.

18:54

>> Amazing. Um, and and I mean, the

18:57

adoption is just crazy, dude. I mean, so

18:59

I want to understand about the growth

19:01

here. So, so again I mean right now

19:03

Twitter is full of people flaming

19:05

millions in ARR every month and but like

19:07

a feels like one of the most you know

19:10

stressed metrics out there. So my

19:12

question is how has lang approached

19:14

growth in real terms be it opensource be

19:17

it community adoption be it enterprise

19:19

or

19:19

>> yeah dude it's a great question. So I I

19:21

think about this a bunch because like I

19:23

think the best way to maybe think about

19:24

it is like basically like work backwards

19:26

from you want to like help people build

19:30

stuff using like the tools you're you're

19:32

putting out there, right? And like the

19:34

goal is basically just like help people

19:36

build like really cool things and like

19:39

make that process of building as easy as

19:40

possible. I think in like open source

19:42

that comes through like very clearly

19:44

because in open source I think you get a

19:46

lot of like empathy for the end user

19:48

because they're like directly using your

19:50

product like all the code is like fully

19:52

visible like go inspect it also like put

19:55

your opinions in like our GitHub issues

19:58

and tell us like what's good what's bad

20:00

like what should we fix like what should

20:02

we add also like it's totally cool to

20:05

like disagree in open source because

20:07

like the maintainers sort of have

20:09

limited bandwidth to address like all of

20:12

the things, but we want to make sure

20:13

that the most impactful things that are

20:15

going to help like the most users build

20:17

like the coolest stuff like we like

20:18

prioritize those. So, I think there's a

20:21

there's a big part of growth which is

20:23

why I like really like X um and like

20:27

these direct feedback channels or like

20:28

Slack for example or just like messaging

20:31

builders and customers because you

20:34

basically get to see exactly what

20:35

they're doing. you build like a lot of

20:36

empathy for shoot like this thing that

20:39

we built like it's a little broken in

20:41

this way or like it doesn't exactly like

20:42

fit the use case and then you hear a

20:44

bunch of those stories and you sort of

20:45

like work backwards to say okay like we

20:47

need to improve like this part of our

20:49

library or like we need to like make it

20:51

possible for others to improve our

20:53

library as well. That's like an amazing

20:54

part of open source that we get tons of

20:56

like amazing feedback, tons of like user

20:59

contributions which is great because you

21:02

sort of like grow with your community

21:04

and I think like that's a really big

21:06

part of open source and related to that

21:08

which I really really like about

21:09

Langchain like one of the reasons why I

21:12

joined and like I really enjoy working

21:13

here is there's a lot of like learnings

21:16

that we get from all the research that I

21:19

do in like open source and like putting

21:21

stuff out there and getting feedback

21:23

that slowly like make their way into our

21:25

products as well because it's like for

21:27

example a lot of stuff in like Lang

21:29

Smith for example which is like okay

21:31

like how do you build good evals like

21:33

how do you how do you actually enable

21:35

agents and users to build like really

21:37

good evals like how do you like

21:38

understand what's happening in traces

21:40

like mind signals from

21:42

>> like a lot of that we put out just in

21:44

the open like I did a bunch of blogs on

21:46

that stuff there's other people who are

21:47

like hacking on that stuff as well and a

21:49

lot of the stuff in open source you sort

21:51

of see how the community interacts with

21:53

it. You also just see the raw numbers

21:55

and you put it out there and it's like

21:56

hey like I would love this or like I'm

21:59

using this and it's like oh we should

22:01

make that as easy as possible. Put it

22:04

into a product and like if people love

22:06

the product then like the rest of it

22:08

sort of takes care of itself. It's like

22:09

yes you will make money you know your

22:13

customers will be really happy and then

22:14

like just continue the loop like just

22:15

keep making it better basically. So I

22:17

think like yeah dude customer feedback

22:19

is amazing like community feedback is

22:21

amazing. So it's like a really really

22:22

big part of I think lang chain a really

22:25

big part of like a a lot of the open

22:26

source stuff that we do

22:27

>> I can imagine of course and more

22:29

specifically here so you are leading the

22:32

open source egen and harnesses work

22:34

right now so what does a typical um week

22:37

looks like for you it's more about

22:39

research engineering or product

22:42

>> yeah dude whatever

22:44

>> I think the fun part is like it's it is

22:46

actually like a mix of a ton of stuff

22:48

and I like really really like that so

22:50

it's like the goal is bas basically pick

22:52

the most important thing to work on at

22:55

this time and then like we'll like we'll

22:57

chat about it maybe over the weekend or

22:59

like the week before like Harrison

23:00

jumped in with with us like we'll DM and

23:03

let's just like sprint towards that and

23:05

build it basically and like maybe what

23:08

that looks like lately

23:11

like lately like a ton of my work has

23:13

been on like eval continual learning

23:17

essentially like methods for using like

23:19

evals and continual learning to make

23:21

like agents and like their harness

23:22

better. So that's like basically like

23:24

the research direction and I would say

23:26

maybe like 50% of the week goes into

23:30

okay let's like pick a research

23:31

hypothesis let's like figure out what

23:33

the experiment design around that might

23:35

be. Like for example, last week we were

23:37

doing a bunch on can you like just in

23:40

time generate evals uh like for any

23:42

given task like what does that look

23:44

like? Like are you overfitting to them

23:45

and like what is your like fitting

23:47

algorithm? There's like tons of stuff

23:48

that we put out. There's like a lot of

23:50

good content on like harness hill

23:52

climbing basically. But yeah,

23:54

essentially it's like research. Let's

23:55

pick that task. Um kind of like a PhD.

23:58

We're going to make a hypothesis. We're

24:00

going to like run the experiments on it.

24:02

We're going to get like get metrics and

24:03

we're going to post them on Slack and

24:04

we're going to like review them and like

24:07

argue our takes about them essentially.

24:11

Yeah. Then the other maybe bunch of

24:12

percentage like 50% is like talking to

24:15

customers like talking to people like on

24:17

Twitter getting a bunch of feedback from

24:18

them on like the open source stuff like

24:20

how can we improve our libraries whether

24:23

that's like lang chain lang graph like

24:24

deep agents anything in like lang and

24:27

then a bunch of that is talking with

24:29

like product teams as well. So there's

24:31

like tons of great teams at Lang Chain

24:34

that do a bunch of good work on like all

24:36

the products that we have. So there's

24:37

tons of learnings that I think come from

24:38

open source that we can like port back

24:41

into the products that we're going to

24:42

build and yeah just keeping that

24:44

feedback loop is good. So I would say

24:46

like it's a mix bunch of like research

24:48

and then engineering stuff and then a

24:51

bunch of like I don't know like what the

24:53

term today is but like devril like

24:56

devril devx which is just like if

24:58

someone asks a question on Twitter like

24:59

we should respond to them and we should

25:01

like put our ideas out there and we

25:02

should like be willing to engage with

25:04

other people's ideas and yeah just hear

25:06

what people are saying. So it's like a

25:07

mix yeah it's a mix of those things.

25:09

what percentage of your article source

25:12

like article is coming from this

25:14

research source I can imagine a certain

25:16

percentage but because dude I mean I

25:20

mean let's just come to harnesses like

25:22

what this what is all about the load

25:24

behind harnesses right you know

25:26

>> so you mentioned that the definition of

25:28

agent is basically model plus harness

25:30

right

25:31

>> so I mean this is something like I mean

25:33

it is being in like people know this

25:35

from quite some time like this is this

25:37

is a fact but I think this is the

25:39

cleanest framing anyone any anyone have

25:42

seen at least on Twitter. So if you're

25:44

not the model, you are a harness, right?

25:46

And and a harness is every piece of

25:49

code, configuration or execution logic

25:51

that isn't the model itself.

25:53

>> So can you walk me through how you

25:56

arrive at the definition?

25:58

>> Yeah. Yeah. Yeah, dude. I think like it

26:00

is it is definitely like a cleanish sort

26:04

of specification of like what is this

26:06

thing that we're talking about and I

26:09

think like maybe the definition doesn't

26:11

really matter like as much like what the

26:13

exact equation is but like there is one

26:16

thing that's helpful which is like when

26:18

you're communicating with someone about

26:20

like how we're going to make this agent

26:21

better we need like some shared language

26:24

so we can talk about like what is the

26:26

thing that we're going to optimize

26:27

basically right so it's like

26:29

like working backward from model

26:32

capabilities because like that's sort of

26:35

the thing that we need to wrap

26:37

intelligence like wrap systems around to

26:40

like amplify the intelligence of the

26:42

model. So it's like I basically view it

26:44

as there's some sort of computation

26:46

happening inside the LLM and like where

26:49

that's happening is over this like

26:51

context window boundary. So like all the

26:54

compute happens when I basically like

26:55

take context from like my system and I

26:59

push it over the boundary and I put it

27:00

into the context window like for the

27:03

model to do computation on and then

27:05

produce tokens basically. And like some

27:07

of those tokens correspond to like tool

27:09

calls and then I go and execute those

27:10

tool calls and like I return the context

27:12

back. And like the reason why I like

27:14

that is because like models by

27:16

themselves they're basically just like

27:18

>> token input machines and like token

27:20

generators basically. But like we need

27:22

to put a system around the model so it

27:25

can do useful things. And I really like

27:29

maybe like working backwards from what

27:31

should the agent do and like maybe even

27:34

like what does my customer want the

27:36

agent to do and then like figure out if

27:38

I just like give it like a really really

27:40

simple model like maybe like really

27:41

really simple harness. Can the agent can

27:43

the model and like the agent can the

27:44

agent basically just do that? And like

27:46

if the agent can just do that with like

27:48

a really simple harness, then that's

27:50

like amazing because then we can just

27:52

like give that to the user essentially.

27:55

Where things maybe get like more

27:57

interesting is like where like a really

27:59

simple harness just like can't do that

28:01

today. And that might just be because

28:02

like it doesn't have the right tools or

28:04

maybe like the model isn't intelligent

28:06

enough to like orchestrate those tools

28:07

in order to do that. Or maybe it's like

28:10

some of our context engineering opinions

28:13

in the harness aren't good enough and

28:14

it's like hey like you're you're putting

28:17

a bunch of like really big tool call

28:19

outputs like into the context window and

28:21

it's like confusing the model. We should

28:24

find out ways to not do that. But these

28:26

are all basically like harness level

28:28

configurations that we're doing and

28:30

they're external to the model. Like the

28:32

model is basically just like a

28:34

computation unit and it computes things

28:36

over its context window and like we need

28:38

to decide what goes into that context

28:40

window so it can do like useful work for

28:42

us.

28:42

>> If I have to ask you some like three uh

28:45

three bullet points what really makes a

28:49

good hardness according to you what are

28:51

they?

28:51

>> Yeah. So there's a bunch, but if if I

28:54

had to pick like three right now, I

28:55

would say

28:57

basically prompting and like very very

29:01

clear instructions

29:03

for better or worse. Like there was this

29:04

whole thing like prompting is dead. Like

29:06

prompting is like totally not dead. It

29:08

is like so useful, so helpful. And like

29:10

I I don't just mean like prompting in

29:12

terms of just a system prompt. Like

29:14

prompting also applies to like the tool

29:17

descriptions as well that get like

29:19

autoloaded into context. It also applies

29:21

to how well your like skills front

29:25

matter explains like how to use these

29:27

skills or like how to use like other

29:28

skills. It it also applies to like if

29:31

you have sub agents, does like the sub

29:33

agent front matter specify like when

29:35

this should be used or like how to use

29:37

it basically. So it's just like

29:38

basically prompting that encodes really

29:41

really good instructions from the user

29:44

or on behalf of the user for like how to

29:47

use this agent to do useful work. That's

29:49

like super important. I think like

29:50

prompting is honestly more important

29:52

today than it ever was before because

29:54

our like the systems we have are way

29:56

more intelligent. So we're able to guide

29:58

them towards doing useful work more

30:00

easily with good prompts. That's one. I

30:03

think the other one that we're spending

30:04

a bunch of time on right now is

30:07

basically verification. So we did like

30:10

some blogs around this on like making

30:12

coding agents better. But there's sort

30:15

of like maybe two things in

30:17

verification. like first is prompting,

30:18

second is like verification. So there's

30:20

like a built-in verification that you

30:23

might inject like into into the harness

30:26

itself. So like that can be like a hook

30:29

basically. So like before the model

30:31

tries to go and exit like force it to

30:33

like recheck the work or like make sure

30:37

>> really

30:37

>> verification is basically like if if I

30:39

give so for example if we just use like

30:42

all the terminal bench tasks, right? So

30:44

like terminal bench task comes with like

30:46

an environment. It comes with like a

30:48

task and then it comes with like a

30:49

verifier that will run after the agent

30:52

thinks it's done, right? But like

30:54

obviously we can't use that verifier

30:55

information. So like what the agent

30:57

needs to do is like it needs to like

30:59

self-verify its work before that

31:01

verifier runs to like be like very very

31:04

sure that the code that it developed

31:07

solves the task that we're that we're

31:08

like trying to solve. Maybe there's two

31:10

parts of that. One part is we need to

31:13

like teach agents what the useful

31:16

primitives are for verifying their work.

31:18

I think like one immediate one if like

31:20

anyone uses like the claude model or

31:23

like even like GPT 5.4 is like agents

31:26

are very susceptible towards like

31:28

picking the easy way out in verification

31:30

which is like they test like trivial

31:32

cases or like not not like very

31:34

difficult cases. Obviously, that fails

31:37

in the verifier because it's just like,

31:38

hey, like I checked like these three

31:39

cases are really easy, so like I'm good

31:41

essentially and like that's bad. Like we

31:44

should teach agents to be much more

31:46

thorough when they're like generating

31:49

verification for themselves. That's like

31:50

one part of it. The other part of it is

31:52

like like this is all code. So like we

31:55

have in our repos tons of like unit

31:58

tests and like tons of like evals that

32:00

we already use. Like that is great

32:03

context that we should give to the

32:04

agent. so that it can like run that eval

32:07

suite and that might be run with a hook

32:08

for example like I don't want like maybe

32:11

the agent won't run it by itself but

32:12

like when it tries to exit that should

32:14

just maybe run my eval suite or a subset

32:16

of it and it should inject the context

32:18

or like the results back to the agent so

32:21

the agent can see like what failed like

32:24

what what passed basically because like

32:26

we need some sort of signal to give back

32:29

to the agent so we can like fix the

32:31

thing that it generated so it's like

32:33

self-verify or like use external signals

32:36

from like existing evals so you can like

32:38

fix the things that are going wrong. And

32:39

I think that's like a really really big

32:41

part of it. And like maybe the last part

32:43

that we're focusing a ton on is

32:47

high level. It's kind of like

32:49

orchestration basically but for doing

32:52

things that are more long horizon

32:55

basically like it's problem

32:57

decomposition and like making sure that

32:59

like when we use like sub agents to do

33:02

problem decomposition like two things

33:03

are true. So one is we're picking the

33:06

right model like agent for the job

33:08

because like every model is like good at

33:11

different things and also that um this

33:14

is a lot of context engineering. We're

33:16

basically like bounding the sub problem

33:19

that the agent needs to do in like a

33:20

decent enough window that it can like

33:22

manage it. Basically what I mean by that

33:24

is um I wanted to like do things in like

33:28

a 50k to like a 150k token range roughly

33:32

or like 200k. sort of it depends on the

33:35

model but like I don't want to give a

33:36

subtask to like a sub agent if it's if

33:40

it's so big that it's like okay it's

33:43

going to start getting into like really

33:45

really high context zones like dumb zone

33:47

which like Dex calls it um from human

33:49

layer which I love and yeah so it's like

33:52

efficiently being able to take a problem

33:54

decompose it and then use like sub

33:56

agents as like compute sources to like

33:58

do those problems and like filter stuff

34:00

back to the main agent and like some of

34:02

it is just good model choice like for

34:04

example like we find that maybe the GPT

34:08

series like 5.4 for is exceptional at

34:10

like planning uh which is amazing and

34:13

like Gemini like I find is like really

34:16

really good at like multimodal stuff and

34:18

so actually so is they all are but like

34:20

Gemini is like really good at it and

34:21

like Flash is actually amazing bang for

34:24

a buck for like speed cost and

34:26

multimodal stuff like a lot of this is

34:28

just informed by like dog fooding and

34:29

evals like hey like we need to like test

34:31

these models and figure out what are

34:33

they good at so yeah I think I think

34:34

those are the three maybe roughly and

34:36

there's like way more obviously so it's

34:37

like like prompting

34:38

like systems around like verification

34:41

like self-improvement uh like via traces

34:44

or like via evals and then the last

34:46

thing is like kind of like orchestration

34:47

but maybe it's like context engineering

34:50

around problem decomposition

34:52

>> makes sense um you just mentioned about

34:54

uh 5.4 for for uh planning. So uh so uh

35:00

pretty much I think it uh not just a

35:03

black box but it is kind of a reasoning

35:06

sandwich where where I mean you

35:08

mentioned as well x high for planning

35:10

high for execution x high for

35:12

verification um like running only at x

35:15

high scored 53.9%

35:18

due to timeouts versus 63.6% at high. So

35:22

I mean that's counterative right? I mean

35:25

does more reasoning made it worse?

35:27

>> Yeah. So I think I think this is

35:29

basically touching on like the point

35:30

that I think about a bunch which is like

35:33

we need to like what we try to do is

35:35

basically like we're trying to design

35:36

like an agent system around like a task

35:39

that we need to solve right and like

35:40

that task has maybe like a bunch of

35:42

constraints like I think the one you're

35:44

talking about is maybe like the the some

35:45

of the terminal bench work that we were

35:47

doing and just trying to publish. So

35:49

yeah like for that use case we we had

35:51

like an artificial constraint which was

35:53

like we have a like a timebounded run

35:57

essentially like after this amount of

35:58

time like the sandbox just like exits

36:00

and like the run doesn't get scored or

36:02

like the run gets scored like wherever

36:04

we left the state of the sandbox and

36:06

yeah so I think maybe the takeaway from

36:08

that is less that like maybe like x high

36:10

reasoning all the way through like

36:12

wouldn't have been better. It actually

36:14

like does a great job. It just takes

36:16

like a really long time. So then it like

36:18

runs out of time to like complete the

36:20

task. But also it's like not compute

36:23

efficient and it's not like cost

36:24

efficient. Like it's awesome to like run

36:26

X high at everything all the time and

36:28

spend a bunch of token on like every

36:29

single problem. Like practically

36:32

speaking um you have to pay for the

36:35

tokens and like also like practically

36:37

speaking from like a user experience

36:38

like am I just going to wait for GPT 5.4

36:41

afford to just like think super hard all

36:43

the time or like can I use a smaller

36:46

model or like a cheaper model that I

36:48

like write really good instructions for

36:50

and it can just go do that task like

36:52

immediately then my user just like sort

36:53

of gets like a more you know like

36:56

latency reduced interaction. So it's

36:59

like yeah I think main takeaway is like

37:01

XH high actually for me is amazing and I

37:03

do a bunch of like planning in X high

37:04

when I'm like just coding but because

37:06

like when I'm in the loop I want like

37:08

feedback because like it's annoying if

37:10

I'm just like staring at a blank screen.

37:12

I use like high for a bunch of like in

37:14

the loop coding. So like X high planning

37:17

and then like high for execution. So but

37:19

yeah it just depends. It like totally

37:20

depends on like the work that we're

37:21

doing. I think that's like the main

37:24

thread that I think about.

37:24

>> Awesome. Okay. I mean yeah that makes

37:27

sense. saw and and I have seen that

37:29

people are using people are preferring

37:32

5.4 xi codeex over opus 4.6 six I mean

37:36

now seven has like mixed opinions I mean

37:39

anyways um so uh again like you said

37:42

about what about hardnesses and

37:44

everything and there was a potential a

37:46

lot of news about file system as well

37:48

like I can't give a count the number of

37:51

blogs I have number of Twitter articles

37:53

I have read about file system right and

37:56

even like in your anatomy post you said

37:58

that the file system is arguably the

38:01

most foundational harness primitive so I

38:04

mean it's a it's It's it's a strong

38:06

claim and um and previously obsidian co

38:09

also mentioned about everything just

38:11

about file system. So why the file

38:13

system and how does it kind of make it

38:16

really influential in in this harness

38:19

design and things around agent

38:21

engineering. What other tools?

38:23

>> I mean I'm like incredibly bullish on

38:25

file systems. I think like a ton of

38:27

people internally also are and like a

38:30

ton of people across industry like very

38:31

bullish on file systems. Like one of the

38:33

early decisions in like DB agent when we

38:35

were building it last year was basically

38:37

like using the file system and that was

38:40

more because we saw like two things. one

38:43

like how useful it actually is for

38:45

context management and like two agents

38:49

are just exceptional at using file

38:50

systems already right so it's like it's

38:52

kind of two things like the model is

38:54

already very very good at using this

38:55

tool so I don't have to coersse it a

38:58

bunch to get good at like using these

39:00

sort of like patterns and like now like

39:02

with newer models is probably even like

39:03

post trainer even more on getting good

39:05

at file system stuff so that's like

39:06

amazing the the other thing that's like

39:08

really amazing about file systems or

39:10

like basically the concept of a file

39:13

system. I I'll I'll maybe like

39:14

generalize it a little bit, which is

39:15

like I need some sort of like persistent

39:18

storage that my agent can use to both

39:21

like access information and then like

39:24

offload information. And like that's

39:26

maybe the higher level primitive like a

39:28

file system ends up being like a really

39:29

really easy way to do that. But like the

39:31

primitive is like the LLM the model

39:35

basically has like this computational

39:37

boundary that I put stuff into and like

39:39

I can take stuff out of essentially,

39:41

right? And like all the comput happens

39:43

here and the decision for like where to

39:47

store stuff and like how to access it

39:48

like file systems end up being fantastic

39:51

storage primitives to do that and like

39:53

the reason why I say like the concept of

39:55

a file system is like in in like lang

39:57

chain like in our libraries we have this

39:59

concept like virtual file systems where

40:01

it's like you expose file system like

40:04

storage essentially right so like the

40:07

operations that you would do on a file

40:09

system for example like ls for example

40:11

right or like you're like grapping over

40:13

that. It depends like what your

40:15

underlying storage system is. But can

40:17

you like use existing storage like for

40:19

example like S3 for example or like

40:22

Postgress, right? And then like what

40:23

does it look like to use that as storage

40:25

and then like put it over the

40:27

computational boundary so like the agent

40:28

can like search over this stuff and like

40:30

pull it into context.

40:32

Like agents are exceptional at doing

40:34

that. And the other thing is like

40:36

context management is so important

40:38

because like the context window is like

40:40

where all the computation actually

40:41

happens that we need some mechanism of

40:43

achieving that which is like why I'm so

40:45

bullish on file systems. It's both like

40:47

and then and then actually like maybe

40:48

one more thing I'll add is

40:51

>> now that we're doing a bunch more stuff

40:53

on multi- aent orchestration and like

40:56

multi- aent like collaboration sort of

40:57

stuff. So I think I said like a little

40:59

bit about decomposing like really big

41:00

problems into like sub problems, right?

41:03

But like where should all of that work

41:05

get stored for all of like the

41:07

decomposition that the sub agents do? So

41:09

like file systems actually also become

41:12

excellent like collaborations places. So

41:16

like sub agents can like write to

41:17

particular files and like main agent can

41:19

like read from there and like it doesn't

41:20

pollute like the main agent context

41:22

window a bunch. So it becomes like a

41:24

place where you just like write files

41:26

and like files are basically excellent

41:28

scratch pads or excellent like like

41:30

planning places or excellent like

41:32

persistent storage places like an agent

41:34

needs to come back to something and this

41:36

sort of like primitive that files encode

41:39

information really well like file

41:41

systems

41:42

offer like interfaces to like external

41:45

storage that already exists and like it

41:48

really helps with context management.

41:50

Like all of those things together I

41:52

think make it really really good for for

41:55

as like a harness tool for like an

41:57

agent. And I think a lot of harnesses

41:59

like like basically I think everyone is

42:01

like settled around file systems like

42:03

like it's uh it's not like too

42:04

controversial to say like I'm going to

42:06

give my agent a file system and like

42:08

that's a part of my harness you know

42:09

like people just sort of like oh yeah

42:10

that that makes sense. It's interesting

42:12

to know right I mean this is something

42:14

so basic something so fundamental is

42:17

kind of changed the whole trajectory of

42:19

the space in like 6 months and everyone

42:22

is kind of getting adapted to this thing

42:24

and on the same note you have uh you

42:27

have also mentioned about memory via

42:29

agents.mmd and and this is something you

42:31

kind of connect with you know like

42:33

injecting and start and you also call

42:36

this continual learning so I'm very

42:38

interesting to know about why do you

42:40

think So, and like is it really or is it

42:43

more like a persistent or consistent

42:44

notepad? So, what you really think about

42:47

this could be aligned to

42:49

>> I think like a a ton of a ton of like my

42:52

work recently has been around like this

42:54

just general idea of continual learning

42:57

basically. So like h how do I help my

43:00

agents which are producing a bunch of

43:02

data over time like I'm using let's

43:05

let's just take like my personal agent

43:06

like I'm using this one agent a ton over

43:08

time

43:09

>> and it's producing a ton of data which

43:12

is like traces essentially right and

43:14

then like all those traces like I'm

43:15

storing somewhere like we store them in

43:16

length you can put all your traces in

43:19

one place and how do I update the

43:22

definition of the agent in order to

43:25

learn from all of the data that it's

43:27

producing Right. So there's like maybe

43:30

two ways to really do that. And memory

43:33

is sort of a subpiece of continual

43:35

learning. Like continual learning like

43:36

overall to me is as I'm acting in the

43:39

world and as I'm like sort of like

43:41

producing data kind of like how we

43:42

humans do. Like I'm doing stuff in the

43:44

world and I'm like learning from the

43:46

feedback that I'm getting, right? Like I

43:48

ran and I tripped and I fell when I was

43:49

a kid and like this is a great trace

43:51

stored in my brain to say like please

43:53

like don't do that. Same thing for

43:55

agents. But the way that we actually

43:58

like update the like the agent knowledge

44:01

is like really different probably

44:03

because like we don't understand exactly

44:05

how like experiential memory that humans

44:09

experience like how does like my

44:10

experiential memory as a human get

44:12

encoded into my brain like I don't

44:14

exactly know how that process works and

44:17

we need to do that process essentially

44:21

for agents and like the agents

44:24

computation boundary is just it's

44:26

context window basically. So I need to

44:28

be able to like take learnings from the

44:30

past and I need to be able to like do

44:32

two things. One is um inject them into

44:37

the context window at the appropriate

44:40

time

44:41

>> so that when that scenario comes up, it

44:44

can like use that prior information to

44:46

like fix the thing. Like for example,

44:48

maybe this comes up in like user memory

44:50

for coding, right? It's like you're

44:52

doing a bunch of like coding with your

44:54

coding agent and then like you give it

44:57

it has that trace and like maybe you

44:58

like annotate that trace with human

45:00

feedback saying like hey like the way

45:02

that you did this or like you use this

45:04

library but like we never use that

45:06

library so like please like always use

45:07

this other library right and it's like

45:09

okay like great should that piece of

45:12

feedback and like context should that

45:13

always be in like my always on memory

45:16

right is that like just in my agents.mmd

45:18

that always gets like loaded in or is

45:21

this something that gets injected like

45:22

in real time into the agent like

45:25

contextually. This is like why I'm super

45:27

interested also in like search as a way

45:29

of doing this because like we're I think

45:33

it's like almost like unfathomable the

45:35

data scale that we're going to start

45:36

producing with agents. So like agents

45:38

run like all the time non-stop. they

45:41

produce like millions of tokens like

45:43

every few minutes and like that's a ton

45:45

of information that we need to like sift

45:47

through to figure out what's useful from

45:49

that and like what's not useful from

45:50

that. So like search is like a really

45:53

really big part of distilling a bunch of

45:55

trace knowledge into like nuggets or

45:58

like memories that I can actually

46:00

retrieve that are useful because like

46:02

tons of that trace will actually be

46:04

noise. So it's sort of this process of

46:05

like distilling

46:07

great data which is like trace data but

46:10

into nuggets that I can actually like

46:12

bring into context when I need to.

46:13

That's like one. And then the other one

46:15

is like really interesting for us is

46:17

instead of just selectively and

46:19

contextually pulling the right thing

46:21

over the like the context window

46:23

boundary for like computation to happen

46:25

over it. So like context engineering

46:26

like you can also just touch the

46:28

weights. So like we like lean in a bunch

46:30

into like open models and like I love

46:32

open models. I use like GLM5 a bunch

46:35

like a ton of the team does as well. And

46:37

that's like amazing as well. That's like

46:39

continual learning by using feedback

46:41

from traces and like distilling that

46:43

into data that you can do like RL on

46:47

essentially and like making that process

46:48

a lot easier. And both are really

46:52

interesting like we're leaning into both

46:54

and I think both will happen. So it's

46:56

actually not going to be like an or like

46:58

everything will be RL or like everything

47:00

will be like context entry. you totally

47:01

need both because there's like tons of

47:04

things that you don't want to RL or like

47:06

it just doesn't make sense to like

47:07

fact-based retrieval like you can like

47:10

include that data in there but it makes

47:13

more sense to do search in order to

47:15

retrieve some of that stuff. So it's

47:16

like yeah those are maybe the

47:18

interesting bits that we're sort of

47:20

leaning into like sort of

47:21

>> you just mentioned there are tons of

47:23

things which you don't want to RL so can

47:27

you mention what kind of arenas do you

47:29

think we should go for RL or we should

47:32

not like where there is like it is

47:35

constrained by compute resources or

47:37

anything

47:38

>> I'm like super bullish on if you're like

47:42

if you're a builder or a company

47:44

producing some sort of like data in

47:46

vertical and you want to like do two

47:50

things. One, make your model way better

47:52

at that task and like basically like fit

47:54

to your data, fit to your use case, then

47:56

also like make it like way faster and

47:57

like way cheaper. Like RL is something

47:59

like definitely like worth exploring

48:00

because fine-tuning has gotten like way

48:03

easier in the last whatever year. Like

48:06

there's actually like amazing companies

48:07

that will help you fine-tune if you like

48:09

bring the data, if you massage it

48:10

properly, like you store all your data

48:12

like Langmith and you can like pull it

48:13

down to do RL over it. Um,

48:16

in terms of things that you like should

48:18

RL on or you shouldn't RL on, I think

48:21

it's really really great if you have

48:23

some sort of like vertical that you want

48:24

to like make your model like really

48:26

really good at. I think we see a lot of

48:27

companies that have started, okay, like

48:30

I'm building this like model and it's

48:33

going to be really really good at search

48:35

and I'm going to expose that as like a

48:37

sub agent to like my main agent and like

48:39

this sub agent is going to rock at that

48:41

or it's like this this model we like

48:44

fine-tune on a bunch of our like

48:46

customer service data and like it's

48:48

really really good at that use case or

48:50

like finance data for example or like

48:51

even even yesterday um like OpenAI

48:53

released Rosalind right which is like

48:55

all about bio

48:57

That's like amazing, right? And that

48:58

also like sort of it it butts heads with

49:01

this whole idea that the general purpose

49:05

everything is just going to like kind of

49:07

like subsume everything, right? It's

49:09

like I'm going to have like one general

49:10

agent that's just going to like it's

49:12

going to be so good. It's just going to

49:13

get exactly what I'm saying. It's going

49:14

like solve the task. Like maybe in the

49:16

limit that is definitely maybe going to

49:18

be true, but to like today like we have

49:20

to build for today, you know? So like

49:22

today it's super helpful actually to

49:24

take the opposite view like curate a ton

49:27

of data and like pick a niche that you

49:30

really care about or like that your

49:31

customers care about and like build the

49:33

best data for that like build the best

49:35

harness for your model around that and

49:38

just like sort of rock at that task. And

49:39

I think like RL is amazing for imbuing

49:42

sort of like vertical specific skills

49:44

into an open model and you get it like

49:48

way cheaper like way faster and like

49:50

depending on the original like training

49:52

distribution of that task in like the

49:55

frontier labs like data mixture like

49:58

you're it's very likely that your

50:00

fine-tune model will be better than that

50:02

open model or sorry than that closed

50:04

model at that task as well because like

50:05

you have the data and you like

50:06

fine-tuned it and like maybe like where

50:08

you don't want to use RL4 is like I I

50:10

honestly think it's a really good idea

50:12

just to start with harness engineering

50:13

like or like just really good context

50:15

engineering

50:17

because it's so easy actually like

50:20

relative to RL that just like pick your

50:22

model like design like a really really

50:24

simple harness around it first like for

50:26

example we have like this abstraction

50:27

and lang chain called like create agent

50:30

which is just a react loop and then you

50:31

can like build a bunch of stuff on top

50:33

of that until like you don't need to

50:34

anymore or you can use like deep agents

50:36

out of the box if you want to and Yeah,

50:38

just like go and build and do maybe

50:40

start with harness engineering and like

50:41

maybe the other point was like

50:44

there's things that like things like

50:46

factbased retrieval like fact-based

50:48

retrieval is just it's just like maybe

50:50

more of a search problem like I just

50:52

want to find the thing and I want to

50:53

inject it into my context essentially.

50:57

So it's like yeah that might be like one

50:58

example where it's like hey like you can

50:59

RL this thing and maybe RL on the domain

51:01

but like the way that you put it over

51:03

the like boundary for computation the

51:06

context window is just find it

51:08

essentially via some search mechanism

51:10

>> you mentioned about search previously

51:11

right like you will be going for search

51:14

like essentially so uh there comes this

51:18

concept of context context ro so you

51:21

site the chroma research on how models

51:24

get words on on as context fills up

51:27

maybe compaction tool call offloading

51:30

skills as you know progressive

51:31

disclosure. So which of these has the

51:33

biggest impact in practice when when it

51:36

comes to context fraud and and what are

51:39

the what are the kind of potential um

51:41

practices you specifically use to avoid

51:43

these?

51:45

>> They they all matter actually and I I

51:48

think it it sort of depends on like the

51:51

design that you're going for out of the

51:53

box, right? So I think like maybe maybe

51:55

like a good recipe essentially is that

51:57

like we start building the agent with

52:00

like a goal in mind like I want the

52:02

agent to do this thing but like really

52:03

really focus on like context rot because

52:06

after you pass like some sort of like

52:07

context threshold it gets like just like

52:09

really dumb and like like you said we

52:11

have like levers to fight against that

52:13

which is I can use like sub agents to

52:16

decompose the problem into like more

52:17

manageable chunks so I don't pollute my

52:19

main context window like that's like

52:21

amazing but like basically what what

52:24

like predicates that is that I can

52:25

actually efficiently decompose the

52:27

problem, right? So it's like maybe

52:29

that's like instructions that I give in

52:31

the system prompt to the agent of saying

52:33

like this is how you like decompose a

52:35

problem into like these tasks and like

52:38

if it's like a task specific agent then

52:40

you probably already have a bunch of

52:41

like human priors for how to go tackle

52:43

the problem. Like for example, for

52:44

coding agents, the way that we decompose

52:46

a problem is like you have agents that

52:49

do like sub agents do like codebased

52:50

search essentially and they like do that

52:53

separately and they pull in the

52:55

important information into like the main

52:57

agent to do some of that stuff and like

52:59

maybe there's like a web search agent as

53:00

well like

53:01

>> has to go like pull external information

53:02

and like find that and prepare it for

53:04

the agent. So it's like yeah basically

53:06

like working backwards from like I need

53:08

to avoid context rot like one way to do

53:10

that is like sub aents like is my

53:12

problem amendable to like sub aents like

53:14

if it is fantastic another way to do

53:16

that is like and these are often in

53:18

conjunction like we we like lang like

53:21

our docs we publish like a bunch of

53:22

stuff on like multi- aent docs as well

53:25

and skills are kind of related to that

53:27

which is skills to me they basically

53:31

kind of like encode knowledge and

53:33

workflows like skills are awesome

53:35

because everyone before skills like

53:40

hated writing good docs if that makes

53:42

sense. Like everyone was just like so

53:44

lazy

53:45

>> and they were like I'm just going to

53:47

like tell the model like some sort of

53:48

like random stuff like kind of like hand

53:50

wavy and it'll just get it. But like for

53:52

some reason like skills came out and

53:54

like because maybe skills are like

53:55

sharable and like other people like see

53:57

the skills like everyone writes like

53:59

very very good like workflow

54:01

descriptions in skills and like the

54:03

agent sort of like sees the skill

54:04

content and then it executes the

54:06

workflow and that's like amazing because

54:10

>> I basically get like a very small

54:12

snippet of like when to use this skill

54:14

and like I avoid all the context rot and

54:17

like when necessary we like pull in the

54:20

right context from the skill into there.

54:22

Like the the tricky thing with skills is

54:24

always like

54:26

basically knowing when to trigger them.

54:27

And that again comes down to like

54:29

instruction following which is like we

54:31

have some skills evals as well where

54:32

like we'll have scenarios and then we'll

54:35

sort of have like the skills that we

54:36

want to have like triggered basically

54:40

and then like we we have evals where

54:42

it's like we we only want that skill to

54:44

be triggered because like like let's say

54:46

it like triggers the wrong skill first

54:48

and then like eventually it does a bunch

54:49

of like stuff and then it figures out

54:51

like oh actually I need to do this skill

54:52

like that's bad because you wasted a

54:54

bunch of tokens essentially. So I think

54:56

eval help a bunch with context rot which

54:59

is one does the problem succeed at the

55:02

end like that's a really big part of

55:04

evals and the other one is sort of like

55:06

fine grained metrics on the evals which

55:09

is like how long did it take like how

55:11

many tokens did it take what was the

55:13

overall cost right and then like reading

55:14

the trajectory and then seeing like in

55:17

my effort to reduce like context rot by

55:19

doing like sub agent routing or like

55:21

triggering the right skills is that

55:23

working and then there's like maybe also

55:25

like determine ministic stuff which is

55:26

good. So like tool call offloading. So

55:29

like this happens a bunch with like bash

55:31

calls like you have you you you run the

55:33

shell and uh it's just like a mess. So

55:36

you get this gigantic like tool like

55:38

this output string and you can just pipe

55:40

that into context or you can just take

55:43

like the head and the tail and you pipe

55:45

that into context because that's usually

55:47

the important bits and then you tell the

55:49

model that like the rest of this string

55:51

lives in this file over here if you can

55:53

if you want to access it and then you

55:55

can go and do that. So, it's basically

55:56

like doing a bunch of stuff on the

55:59

model's behalf to really protect that

56:02

like incredibly precious artifact, which

56:04

is our context window. And like we I we

56:08

just think like really hard about like

56:10

if something doesn't need to go in here,

56:12

like really like don't put it in there.

56:13

But if something does need to go in

56:15

here, like do our very best to like

56:18

spend compute on like search or like

56:19

really good instructions to make sure it

56:21

gets in there.

56:21

>> Makes sense.

56:23

Interesting. I mean that that actually

56:25

makes a lot of sense. Um I'm curious

56:27

about so for people who may who may not

56:31

know the space well. So there's been

56:32

like open claw boom. I mean I I just saw

56:36

on Twitter it is kind of declining as

56:38

well. So Hermes has been getting as much

56:40

attention as open claw right which which

56:42

is coming out of news research. So how

56:45

does

56:46

deep agents differ from both of them? It

56:49

would be useful to explain this from

56:51

first principles for both technical and

56:53

nontechnical listeners since we are

56:55

going to spend a lot of time talk about

56:57

hardness in this conversation.

56:58

>> Both amazingly sick projects like

57:02

openclaw amazing like what Peter did

57:04

there and then also like what the new

57:05

guys are doing with Hermes is like so

57:07

cool. Yeah. I think like the main way

57:09

that I think about it is like you have

57:11

this like claw architecture right looks

57:13

like a little bit different from like

57:15

claw to claw but like overarching

57:17

architecture of like I deploy this

57:20

somewhere there's some sort of like

57:21

messaging

57:24

>> it's like live talk to it back and forth

57:27

there's like a heartbeat that triggers

57:29

like over and over again that has like

57:31

some sort of like memory primitives in

57:33

there so it's like it's basically like a

57:36

very opinionated

57:38

harness for the use case that is like my

57:42

personal agent. So I think like a claw

57:44

is like really the first it's the first

57:47

like really mainstream personal agent

57:50

like maybe like besides chatbt like

57:52

chachi is like it didn't like really

57:54

feel like a personal agent like it had

57:56

like memory and stuff like people like

57:57

message their claws like all day like

57:59

maybe people do that with chatbt too but

58:00

it's like the architecture of the

58:02

harness behind like the claw that makes

58:05

it like feel really personal because of

58:07

all the things they put behind it around

58:08

like the integrations like like what's

58:10

happening and telegram and those types

58:12

of things and like the memory that gets

58:13

updated. Like the big thing is honestly

58:14

like I like the heartbeat thing a lot. I

58:16

think like that doesn't get enough hype.

58:17

It's like very ingenious to like wake it

58:20

up on some cadence to like do things for

58:22

example crrons and things like that. So

58:24

the way I think about it like high level

58:26

is a claw is an amazing choice of an

58:31

opinionated harness for like a personal

58:34

agent essentially. And that's like an

58:38

awesome choice that like they make. And

58:41

I think we should have like a lot more

58:43

of these like people should like build

58:45

their own or like people should use them

58:46

more and see if they like them. And then

58:50

maybe like going back to like the

58:51

primitives, I think like you can build

58:54

tons of agents that are not claws that

58:57

like completely solve like your task

59:00

like really really well. And that's

59:03

basically like how I view like maybe

59:04

like Langchain's create agent or like

59:07

deep agents or like all of the other

59:09

great companies that are like building

59:11

harness primitives which is your

59:14

probably your task does not require like

59:16

a claw like most most likely like it's

59:19

awesome like you should have a claw in

59:20

your life but like if you're doing

59:23

something else like you don't need a

59:24

claw. So like actually what you need is

59:27

amazing instructions, amazing context

59:30

engineering, like amazing choice of like

59:32

what models you're going to use to hit

59:34

like the paro frontier of like Perf cost

59:38

and latency and like you can start from

59:40

like a simple harness and you can

59:42

assemble a harness around like that

59:44

model or like models to like build that

59:47

thing essentially. And I think like claw

59:49

is like one instantiation of an

59:51

opinionated harness for like personal

59:53

agents basically and it's like awesome.

59:55

And I know like people use claws for

59:57

like other things as well. So like I

59:58

think claws if you like edit the harness

1:00:01

around them like the base harness and

1:00:02

you like make them I don't know if you

1:00:04

like change them for like another task

1:00:06

of research that's like awesome too. But

1:00:08

I think like that whole process of like

1:00:10

taking a task and like you have a

1:00:13

harness that like wraps a model or like

1:00:14

models and like you sort of like direct

1:00:16

it towards a goal. That's basically I

1:00:19

think the goal of like laying chains

1:00:21

like create agent and like deep agents

1:00:23

which is we have like some opinions in

1:00:24

there to get you started but like really

1:00:27

we want to help you build the best agent

1:00:30

like for your tasks. Like that might be

1:00:33

us giving you all the tooling. That

1:00:34

might be like me and like the rest of

1:00:37

the team like blogging about like actual

1:00:38

use cases and like sharing our evals and

1:00:41

like just publishing results. But yeah,

1:00:44

basically like customize a base harness

1:00:46

to make it like really really good at a

1:00:47

task and like a claw is a like

1:00:49

phenomenal example of

1:00:51

>> makes sense.

1:00:53

What what you really see the future of

1:00:55

it? I mean I mean let's say down the

1:00:58

line what the next um what what it would

1:01:02

look like after let's say five

1:01:04

iterations of it or what you really see

1:01:06

the future of in in a year or so let's

1:01:10

say

1:01:11

>> I mean honestly in a year and six months

1:01:12

like everything's going to change

1:01:13

obviously no I'm just kidding like it's

1:01:15

it's hard to say like I'm like in the

1:01:18

short term very very bullish on

1:01:21

basically helping people build like open

1:01:25

or like us providing open infrastructure

1:01:28

to help other people build agents that

1:01:31

are like amazing for their task. And I

1:01:32

think that is not going away in the near

1:01:36

to medium term at all. In fact, I think

1:01:39

it's going to go in the complete

1:01:40

opposite direction, which is like

1:01:41

everyone is going to start basically

1:01:43

taking their tasks and they're going to

1:01:45

either do like harness engineering

1:01:47

around those tasks, which is largely

1:01:49

like very good context engineering, like

1:01:50

very good like prompts and very good

1:01:52

tools and like very good skills. They're

1:01:54

going to do all of that around like some

1:01:57

sort of task they care about basically.

1:01:59

And I think open harness engineering is

1:02:01

a big part of that. And I think like

1:02:02

open models are also like a really big

1:02:04

part of that which is we're going to see

1:02:06

like a big growth of I'm going to take

1:02:08

like Kimmy, I'm going to take like GLM5

1:02:10

and I have this data and like a big

1:02:12

future is I'm just going to like

1:02:14

fine-tune that model on my data and I'm

1:02:17

going to make it really good. I'm going

1:02:18

to just keep doing that over and over

1:02:19

again and I'm going to compare how that

1:02:21

does to like a Frontier model and I'm

1:02:23

going to make the trade-off between like

1:02:25

is it better, is it like just as good,

1:02:27

what's like the cost, what's like the

1:02:28

latency tradeoff and then like maybe

1:02:31

like a little bit more like longish term

1:02:35

from that which is like it would be

1:02:38

awesome if we got like some sort of like

1:02:43

AGI model that just did everything. I

1:02:46

would love that. Like so then I can like

1:02:48

totally stop talking about like

1:02:50

harnesses and like evals and I can just

1:02:52

like enjoy the model. But it still

1:02:55

really does help to specify like the

1:02:58

intelligence that we want that model to

1:03:01

like act on in a particular situation.

1:03:03

And like I still think even in like the

1:03:05

medium to long term, it's going to be

1:03:06

super helpful for humans to get really

1:03:09

really good at both describing the thing

1:03:13

that they want and not just like hand

1:03:15

wavy like writing like kind of how we

1:03:17

like write really detailed prompts like

1:03:20

getting comfortable with taking the

1:03:22

thing I want and like putting it into

1:03:24

like language basically. And then the

1:03:25

other thing is we're still going to want

1:03:27

to like verify the work that agents are

1:03:30

doing in in some way. I hope like

1:03:32

autonomous verification systems get a

1:03:34

lot better, but like they're not going

1:03:37

to be perfect and we're still going to

1:03:38

want to be able to say like when an

1:03:40

agent is doing good versus like when an

1:03:42

agent is doing bad and that can become

1:03:43

like part of the feedback signal and I

1:03:46

still think that's going to exist like

1:03:47

for for a little bit and like that's

1:03:48

that's like not a bad thing. That's like

1:03:50

totally fine like we can still work on

1:03:52

that.

1:03:53

>> Makes sense. Um you know dude there are

1:03:56

a lot of I mean bunch of companies I

1:03:57

mean interestingly everyone who is

1:03:59

working on frontier are coming out of

1:04:00

their own harness own their own agent.

1:04:03

So so recently RAM basically built their

1:04:05

own harness right. So have you seen what

1:04:08

they put out? So and my other question

1:04:11

is like you yourself um have used open

1:04:13

code. So it seems like um enterprises

1:04:16

building custom harnesses puts real

1:04:19

pressure on competitors. So I'm guessing

1:04:21

a um release like that forces companies

1:04:24

like SLA which also got recently big fun

1:04:27

who competed to RAM and others to build

1:04:29

their own thing too. So what do you make

1:04:32

of this trend?

1:04:33

>> Yeah, I mean like the ramp is amazing

1:04:35

obviously like they put out such fire

1:04:37

blogs. I like ramp lab stuff. I think

1:04:39

like the the overall trend of like

1:04:42

building your own harness or like or

1:04:44

basically like building your own agent

1:04:46

that's custom for your task is like

1:04:48

fantastic. Like I I think more teams

1:04:50

should basically devote time towards

1:04:53

like investing maybe like in the process

1:04:56

of both like helping their teams build

1:04:59

agents, right? That doesn't just mean

1:05:00

like coders. That means like everyone

1:05:02

like the people who are doing go to

1:05:03

market like marketing, sales, all those

1:05:06

people can like benefit in some way from

1:05:08

agents. They just need like help doing

1:05:09

that basically. And I think it's great

1:05:12

that a company basically like picks a

1:05:14

problem and they're like we're going to

1:05:18

solve that by building the best harness

1:05:20

and that means like the best context

1:05:22

edge like the best verification the best

1:05:23

tool stack also a big part of it and

1:05:27

like we work on a lot of the stuff at

1:05:29

like lang is building the correct or

1:05:32

building like really easy to use systems

1:05:35

for taking the trace data and then like

1:05:38

improving the agent because I think

1:05:39

there's a lot of stuff around

1:05:40

improvement loops which is our first

1:05:43

pass at the agent isn't amazing. So like

1:05:45

this comp these companies are like okay

1:05:47

I'm going to pick a task I care about

1:05:49

and like my first version is going to be

1:05:51

like kind of mid totally fine but then

1:05:53

I'm going to get the data from somewhere

1:05:56

and I'm going to like make it better

1:05:57

over time by just like spending a ton of

1:05:59

time on it or like maybe spending a

1:06:01

bunch of like compute on it to like

1:06:02

understand the data and like improve the

1:06:05

prompts like fix the edge cases like

1:06:07

improve all the errors right and it's

1:06:09

just like I still think like we we will

1:06:11

have tons of vertical companies because

1:06:13

today like someone has to do the work

1:06:16

like someone has to like invest in doing

1:06:18

that like someone has to do like sales

1:06:19

around that right it's not just going to

1:06:20

like happen by itself and I think like

1:06:23

more tooling around that and like more

1:06:25

yeah just more like research that helps

1:06:27

people do that that's like a good thing

1:06:28

like doing the open is like an even

1:06:30

better thing so it's like more

1:06:33

>> I think it's also very ambitious to do

1:06:34

you know you you're you're already at

1:06:37

Frontier and why to depend on someone

1:06:39

like let I mean if you want to be at

1:06:41

frontier you have to build something

1:06:42

like what what other people at are

1:06:44

working on. Interesting. So, um beforeh

1:06:47

going to the other segment of the

1:06:48

podcast, let's have some quickfire

1:06:50

chats. Uh so, so there is a meme you

1:06:53

liked from Mintly Fly Slack. Would be

1:06:56

awesome if you can share screen and

1:06:58

share that. Yeah, please. Then I'll go

1:07:00

ahead.

1:07:00

>> Let me

1:07:02

let me get that off. Um

1:07:05

I love this guy. Let me share. Dude,

1:07:07

this guy is so funny. A dude, I love

1:07:09

this guy so much. Dude, this guy this I

1:07:11

don't know like where this came from or

1:07:13

like who can I turn on volume?

1:07:16

>> So then I mean you like this like this

1:07:19

from mental slack channel that

1:07:21

apparently went viral across I think

1:07:23

startups. So what is it and what it made

1:07:25

it so hard? What's

1:07:26

>> so I have I have no idea. I think it's

1:07:28

Nick. It's Nick from Mintify like

1:07:30

tweeted it one day who's funny like I

1:07:33

was just like this is amazing. So I I

1:07:35

just sent it to all my friends um just

1:07:37

like randomly. I think it was our like

1:07:38

uh like soccer chat. I'm like something

1:07:41

happened with like Arsenal or something.

1:07:42

I I sent this to like my friends because

1:07:44

they I think they lost. Yeah. And like

1:07:45

now we have this in our Slack as well.

1:07:48

Like someone made it into like a gift

1:07:50

and like whenever maybe something goes

1:07:51

wrong like we just sort of throw this

1:07:53

guy like I don't know what it is or who

1:07:55

made it, but like I love this guy. I use

1:07:57

it all the time.

1:07:58

>> Awesome. Um my next question is um most

1:08:03

underrated harness feature that nobody

1:08:06

talks about. most underrated.

1:08:09

It's a good question because like I feel

1:08:10

like if it's underrated, we should be

1:08:11

talking about it.

1:08:14

>> Yeah, exactly.

1:08:16

>> Okay. Okay. Okay. I think like one thing

1:08:19

that we use a bunch is like this idea of

1:08:23

like we call it middleware but like

1:08:24

hooks just generally. So like for for a

1:08:28

lot of teams it's like super useful to

1:08:31

inject sort of like deterministic

1:08:33

actions like basically like do

1:08:35

deterministic code execution like

1:08:37

somewhere in the harness. And I think

1:08:39

that's like super underrated maybe

1:08:40

because it requires like sort of like

1:08:41

custom logic. It's not just like you

1:08:43

think of a tool and you just sort of

1:08:44

like add it. But yeah, I think hooks

1:08:48

that sort of like control bad model

1:08:51

behavior are like really really helpful

1:08:53

or like not just bad model behavior like

1:08:54

help the model like do things. So for

1:08:55

example like triggering excuse me

1:08:58

triggering like self-verification and I

1:08:59

think people should like build more

1:09:01

hooks to control their models. Makes

1:09:03

sense. Um interesting. So we have

1:09:06

something which people should talk

1:09:08

about. Interesting. The model that

1:09:10

surprised you most in agent workloads

1:09:13

this year

1:09:14

>> in both I mean we can go in both ways

1:09:18

which was like something which you were

1:09:20

not expecting and it comes out really

1:09:22

better and something which you kind of

1:09:24

not expecting and it was like it comes

1:09:26

out.

1:09:26

>> Yeah. So, I'm like so impressed by open

1:09:30

models generally as like actually ways

1:09:33

that I get work done. And like I think

1:09:36

like it's it always like feels really

1:09:38

good to talk about open models, but like

1:09:40

you sort of like love the idea of open

1:09:42

models, but then like you don't use

1:09:43

them. Like that's like that's not good.

1:09:46

But actually like the open models that

1:09:48

have come out this year are like

1:09:50

amazing. So like the GLM series is like

1:09:53

fantastic and like it is actually a good

1:09:57

agentic coding partner. It's like very

1:09:59

fast and it does amazing work. So like

1:10:02

maybe at the start of this year like

1:10:04

last year I don't think I would have

1:10:05

expected my GLM f like my GLM use to be

1:10:09

so high and there's other models too

1:10:11

like um like the Ry team who you had on

1:10:13

like they're they're amazing. Miniax is

1:10:15

one that we actually like eval on a

1:10:16

bunch and like these are all amazing.

1:10:18

So, like open models have surprised me.

1:10:21

Like I was hoping it would happen, but

1:10:23

it did happen and that's awesome and

1:10:25

like we should invest a bunch more in

1:10:27

that and like I hope like I hope like

1:10:30

teams actually like think about using

1:10:31

them in like their actual workloads

1:10:32

because they're amazing. Yeah, that

1:10:34

surprised me but in a in a good way. I

1:10:36

was like super happy and like it's only

1:10:38

going to get better and that's like

1:10:39

really really good and it's like way

1:10:41

cheaper and faster.

1:10:42

>> Awesome. Um okay, this one is lost. one

1:10:44

thing you would change about how the

1:10:46

industry builds agents right now. It can

1:10:49

be any common practice or something like

1:10:50

that.

1:10:52

>> How should they change? I think

1:10:54

basically like this whole thing that

1:10:56

we've been talking about right now is

1:10:57

like I would love if like that was like

1:11:00

easier for people to do or like more

1:11:01

people like did it which is basically

1:11:03

like maybe like work backwards from a

1:11:06

task and like a goal that you really

1:11:07

want and then like the whole point to me

1:11:11

is just like build a system like for

1:11:13

your team or like for yourself and like

1:11:15

for your agent to like make it better

1:11:17

over time. Like maybe like I'm saying

1:11:18

that because I'm like we're thinking a

1:11:20

lot about continual learning. So this is

1:11:22

like both the agent design which is like

1:11:24

prompts tools like the whole harness

1:11:26

thing like the verification loops like

1:11:27

all this sort of stuff and then also

1:11:28

it's like sort of the infrastructure

1:11:31

around it for doing like

1:11:32

self-improvement. So this is like the

1:11:34

unsexy stuff, but I think the stuff that

1:11:36

like really matters, which is okay like

1:11:38

are you like is tracing on? Like are you

1:11:41

like putting your traces somewhere

1:11:43

basically like are you using your traces

1:11:47

to like mine errors like via monitoring

1:11:50

basically? like lang supports that and

1:11:51

like we think about that a bunch which

1:11:52

is like trace came in like how do I

1:11:54

figure out if something happened and

1:11:55

like am I making eval from that right

1:11:57

and then like am I am I like reading the

1:12:00

evals basically so it's sort of like the

1:12:01

systems approach around like building an

1:12:03

agent and like making it better I think

1:12:06

teams are doing that that's amazing but

1:12:08

like it's awesome and I think teams

1:12:10

should team should try to do that

1:12:11

>> makes sense um also on the same note

1:12:14

there was this um recent paper called

1:12:16

meta harness and DDR also posted about

1:12:19

it a lot of people are working on auto

1:12:21

research and this field adjacent if not

1:12:24

a version of auto research itself then

1:12:26

you have also things like you know post

1:12:28

train bench where a hardness is used to

1:12:31

post train models so if those two

1:12:34

directions start merging so better

1:12:37

harnesses improving post training and

1:12:39

meta harness improving the hardness loop

1:12:40

itself I mean that feels pretty

1:12:42

explosive pretty interesting how do you

1:12:44

think about that convergence

1:12:46

>> I I think it's super exciting like I

1:12:48

love teams that are like

1:12:49

productionalized ing like auto research

1:12:51

and like doing so like we we have I

1:12:53

think we did like something around like

1:12:55

harness opt maybe like a couple months

1:12:58

ago and there were definitely like some

1:13:00

issues that I saw back then and I think

1:13:03

we still have like some of the issues

1:13:04

but like now like a lot more teams like

1:13:06

putting a lot more effort into it. So

1:13:07

it's like I think this is amazing that

1:13:09

and also maybe I'll just like this is my

1:13:11

take as well. So, like I have like we

1:13:13

put up a bunch of blogs and I think like

1:13:16

there's there's like algorithms that we

1:13:18

need to discover to make like agents and

1:13:20

harnesses better using some sort of like

1:13:22

grounding signal. And that's basically

1:13:24

like auto research is like I have a

1:13:25

grounding signal and I hill climb that

1:13:27

signal and like I update my harness and

1:13:28

like meta harnesses that like we have

1:13:30

one like better harness and like we tons

1:13:32

of good people have like work around

1:13:33

this which is amazing. And basically

1:13:35

like I view like eval as such an

1:13:38

important part of this like feedback

1:13:41

loop because like eval are basically how

1:13:42

we like ground our like auto research

1:13:45

loops over time and it's not just like

1:13:46

ground like in the moment like if I run

1:13:48

auto research like later like a two

1:13:50

weeks later I still have that same like

1:13:52

grounding mechanism and maybe hot take

1:13:55

but I think like you can almost try to

1:13:59

define an agent via a set of like evals

1:14:02

that sort of serve as not just a spec,

1:14:04

but a spec that you can like verify and

1:14:06

like ground. You can do it via fitting

1:14:08

to like eval.

1:14:11

And then you basically have a fitting

1:14:13

algorithm, right? And the fitting

1:14:14

algorithm can be like meta meta harness

1:14:16

or like better harness or like any of

1:14:18

these. And that fitting algorithm is

1:14:20

basically run on evals like reflect on

1:14:23

traces and like update harness or like

1:14:26

prepare data to do like RL on it. And I

1:14:28

think like we're in such early early

1:14:30

innings of this self-improvement loop

1:14:34

basically and I'm I'm super excited

1:14:35

about it. I think it's like really

1:14:36

really cool. There's like stuff to work

1:14:38

out around overfitting and stuff but

1:14:41

like that will happen and like people

1:14:42

will use this a bunch more.

1:14:43

>> Awesome. Makes sense. Um

1:14:45

>> I'm pretty much looking forward to as

1:14:47

we're approaching to the next um I mean

1:14:49

last segment of the podcast we have some

1:14:50

questions around environments harnesses.

1:14:54

pretty much harnesses we have covered

1:14:56

but eval and benchmarks around

1:14:58

benchmarks. So, so how do harness fit

1:15:02

into this broader idea of simulation as

1:15:05

a service? So there are companies whose

1:15:07

whole business is simulating work

1:15:09

categories, decisions, operating

1:15:11

environments. If better harnesses lead

1:15:14

to better simulations, so where does the

1:15:16

open-source side go and do you think

1:15:18

langen will eventually release an open

1:15:20

source simulation?

1:15:20

>> Yeah. So like I think these these two

1:15:22

things are super related. So like thing

1:15:25

that we wrote about before there's like

1:15:28

like evals and like environments like

1:15:30

they're not the same but they sort of

1:15:32

like rhyme with harnesses as well. So

1:15:34

it's like basically like the like the

1:15:37

main idea is like I need like some place

1:15:40

for my agent to do work that sort of

1:15:42

like reflects actual work that's going

1:15:44

to be doing like in the real world

1:15:46

basically right so it's like I'm going

1:15:47

to build an environment like there's

1:15:48

tons of like awesome environment

1:15:49

startups that are doing that and like

1:15:51

running the agent in them so it can

1:15:53

produce like a good feedback signal so I

1:15:54

can like train on basically that's like

1:15:56

amazing. I think like even a big part of

1:16:00

like evals are going to start looking

1:16:02

like environments because like when when

1:16:05

we first started trying to like eval

1:16:08

it was really simple. It was like chat

1:16:10

completions evals, right? It was like

1:16:12

I'm going to give you like a really

1:16:13

simple like input prompt and I'm going

1:16:15

to have like a number or like a

1:16:17

structured output at the end of it. I'm

1:16:18

just going to like map the keys, right?

1:16:19

I'm going to be like, "Hey, did you like

1:16:21

did you get them all right?" But as

1:16:23

agents are doing much more like

1:16:25

complicated work and like much more like

1:16:27

long horizon work actually like the

1:16:29

thing I want to eval is like a task

1:16:31

essentially and like the the best way to

1:16:34

maybe do that is to just like build an

1:16:36

environment and just like drop my agent

1:16:38

into the environment and like maybe like

1:16:42

what we do right because we actually do

1:16:43

this like we basically use Harbor right

1:16:45

and like Harbor those guys are awesome

1:16:46

like the the terminal bench guys

1:16:49

so like we'll pick our eval like it maps

1:16:52

to some sort of like hardware config and

1:16:54

then like we run the eval in the

1:16:56

environment that we built. Then like all

1:16:58

of the traces like go into Lenmith and

1:17:00

then we like read them and we look at we

1:17:02

like segment them based on like the

1:17:04

rubric like how much did it pass, how

1:17:05

much did it fail like how long did it

1:17:06

take and then we try to like improve the

1:17:08

agent and I think that process of like

1:17:11

building the environment and like you

1:17:13

asked about simulations like we think I

1:17:14

think about this a bunch which is like

1:17:16

what I really want to happen is the like

1:17:19

the company that we're building or like

1:17:21

the app or product that I'm building

1:17:23

like I want my agent to be able to like

1:17:25

test itself in that exact environment.

1:17:28

So it can figure out like when stuff

1:17:30

goes wrong essentially and then I can

1:17:32

like fix it, right? And like that's

1:17:34

basically the whole point of eval which

1:17:35

is like

1:17:36

>> they're sort of like a proxy for what

1:17:39

happens in production and like as I fit

1:17:41

to my evals I'm kind of imbuing like

1:17:45

behavior into the agent to make it pass.

1:17:48

The whole goal of evals is like to make

1:17:49

them pass, right? Like and like a lot of

1:17:51

our evals fail because like maybe the

1:17:53

models just aren't smart enough yet.

1:17:55

like eventually they will pass and like

1:17:58

then what I've done is like I've taken

1:17:59

that information from that eval and I've

1:18:01

sort of like transfer learned it into

1:18:03

like some sort of agent whether it's

1:18:04

like the weights or like the harness or

1:18:06

something and yeah I'm bullish on both

1:18:08

I'm bullish on like eval as a mechanism

1:18:10

of doing like agent improvement and also

1:18:12

bullish on like more eval looking like

1:18:16

environments basically instead of like

1:18:18

just like input output

1:18:19

>> pretty interesting you know this uh

1:18:21

terminal bench 2.0 Sweet bench pinch

1:18:23

bench. So the benchmark landscape of for

1:18:26

agents is growing fast but you um

1:18:29

explicitly say in your opinionated

1:18:31

agents post where I mean you said test

1:18:33

on real world users for your product

1:18:35

don't trust benchmarks your user has

1:18:37

never heard of terminal benchtop please

1:18:39

don't introduce it to them right

1:18:41

>> so so so so which benchmarks do you

1:18:44

actually trust and which ones are most

1:18:45

performance theater I mean I mean what

1:18:48

what's the general landscape where one

1:18:51

should actually think

1:18:52

>> I was definitely a bit hyperbolic like

1:18:53

like don't introduce anyone to Turbo

1:18:55

Veg. Like I love Turbo Veg. Like those

1:18:57

guys are awesome. But I think I think

1:18:58

the general point actually does stand

1:19:00

though, which is

1:19:02

like like eval to me are basically like

1:19:04

again they're like a mechanism of like

1:19:07

evals and benchmarks. They're basically

1:19:09

like a mechanism that like proxies

1:19:10

behavior that I want my agent to

1:19:12

actually have like via this like thing

1:19:16

which like roughly measures that, right?

1:19:18

So it's like I'm trying to measure like

1:19:21

long horizon like problem solving. Like

1:19:26

can I do that with like a really hard

1:19:27

terminal bench task? Like kind of maybe.

1:19:31

But like if my actual like app has

1:19:34

nothing to do with that, then like me

1:19:36

passing that like terminal bench task

1:19:38

doesn't map well into like my like long

1:19:40

horizon problem solving like the bio

1:19:42

domain, right? So it's like there's sort

1:19:44

of like rough proxy signals that measure

1:19:47

like so like in at like Langshin we have

1:19:50

like axes that we try to measure on. So

1:19:52

like every eval we tag to like an

1:19:54

access. So it's like retrieval, it's

1:19:55

like problem solving, it's like

1:19:56

planning, it's like tool use for example

1:19:58

like we we like try to tag every eval

1:20:02

we do that we tag every eval like one or

1:20:04

multiple axes, right? But I think it's

1:20:08

useful to use benchmarks as like a

1:20:11

general like vibe like a guidance. Like

1:20:14

you should definitely read the traces

1:20:15

from benchmarks. We like I like spend

1:20:17

tons of time every day just like reading

1:20:18

the traces from pre-built benchmarks.

1:20:21

But I think really the thing that helps

1:20:23

teams is

1:20:26

using their trace data to build evals

1:20:29

for themselves that actually like map

1:20:31

onto their customer use case that maybe

1:20:33

like no existing benchmark like does a

1:20:36

really good job of. And I think like

1:20:37

it's kind of like a moat if you want to

1:20:40

call it but it's like it's just a really

1:20:41

good way of like building a better agent

1:20:43

product which is there's awesome people

1:20:46

building awesome benchmarks. None of

1:20:47

those benchmarks map exactly onto like

1:20:49

what my agent needs to do. So like I can

1:20:52

like use those to roughly measure

1:20:54

problem solving ability, but like really

1:20:56

the best way to measure problem solving

1:20:57

ability is just to like get a

1:20:59

representative set of like evals and

1:21:01

tasks like my own bench and just like

1:21:04

use those and like that's going to vary

1:21:05

from like person to person like product

1:21:06

to product like feature to feature. So

1:21:09

>> makes sense.

1:21:09

>> Yeah. Yeah. Yeah.

1:21:11

>> Yeah. What's your opinion on computer

1:21:13

use stuff? Because this is u this is

1:21:15

something very subject to people like

1:21:16

the current approach is not good. You

1:21:18

can't really go in the screenshot way.

1:21:20

You really can't use MPI MCP or API way.

1:21:23

You have to bullish. You have to scale

1:21:24

GUI stuff. So what do you think about it

1:21:27

and about it scaling part of

1:21:29

>> dude? We were just talking about this

1:21:30

today actually like how much should we

1:21:32

do like more examples on computer use?

1:21:33

Like I'm like very fascinated by

1:21:35

computer use. I think it's like super

1:21:36

interesting. I think like there's maybe

1:21:38

like two things. One is there is still

1:21:41

definitely a visual perception problem

1:21:42

like that we like we've known for a

1:21:45

while like like fine grain details is

1:21:47

not it's not like amazing at that maybe

1:21:48

it's like less of a limitation now like

1:21:50

some of these models are like better at

1:21:52

computer use I don't know I don't have

1:21:55

like a great opinion on which way of

1:21:58

doing computer use is going to win like

1:21:59

the hybrid like pulling down like the

1:22:01

actual like webpage content and like

1:22:03

clicking versus like how much do you use

1:22:05

screenshots I would be like very happy

1:22:08

If like everything was just like just

1:22:11

worked with vision because that would

1:22:12

mean that we did we have made like a

1:22:14

step change in like visual perception

1:22:16

and like visual reasoning over

1:22:17

screenshots and like doing or sorry like

1:22:18

yeah visual reasoning over like

1:22:20

>> images basically.

1:22:22

>> Yeah, I don't know if it'll happen. Um

1:22:24

but I'm like for the applications of

1:22:26

computer use I think they're awesome and

1:22:28

like we should do we should do like more

1:22:30

stuff around those but I don't know. I

1:22:32

just haven't played with it as much.

1:22:33

>> I mean yeah awesome. Makes sense. So I

1:22:36

mean do you think there is some secret

1:22:38

sauce something which can be scaled to

1:22:42

scale more long horizon task about in

1:22:44

your experiment experience what what is

1:22:47

something which is blocking

1:22:49

uh because I think a lot of companies

1:22:52

are being forming up involvements for

1:22:53

long horizon task these days and been

1:22:56

selling to enterprises and frontier labs

1:22:57

now I mean what do you think of the

1:22:59

space about scaling

1:23:01

>> I think

1:23:02

>> I think like there's a lot of like good

1:23:05

work that a I think a lot of companies

1:23:07

have like good agents in like medium

1:23:10

horizon tasks like for example like we

1:23:12

like we have like a background coding

1:23:13

agent that can go like do things over

1:23:15

like hours like a few hours basically

1:23:17

right it's like they're all coding

1:23:18

related tasks like it's easy to like

1:23:20

pick those and like scale those and I

1:23:22

think yesterday there was like really

1:23:23

good work from like the proximal team

1:23:25

for like frontier suite which are like

1:23:28

hey like these are like 20our tasks

1:23:30

basically and like we're going to go and

1:23:33

run on

1:23:34

I think like one thing that is still

1:23:37

like really really tricky for models and

1:23:40

I think like in the near term what's

1:23:42

will happen is like hopefully like

1:23:44

models get like post-trained better on

1:23:45

this but like we will still have to

1:23:46

build a bunch of like harness

1:23:47

infrastructure around it which like

1:23:48

hopefully falls away is one like

1:23:51

decomposing a really difficult problem

1:23:54

into like subpieces

1:23:56

and then doing like verification of the

1:23:58

intermediate steps. I think like that is

1:24:00

like a really really good general

1:24:03

purpose recipe that we can use to like

1:24:07

keep doing like longer and longer

1:24:08

horizon tasks because like basically

1:24:11

like all a long horizon task really is

1:24:14

is like I'm going to do I'm going to get

1:24:16

this like really hard task. I'm just

1:24:18

going to do like a bunch of like little

1:24:20

sub pieces like over and over and over

1:24:21

again. And I need to make sure that like

1:24:23

I don't mess up any of the sub pieces or

1:24:24

like if I do mess up I need to like go

1:24:26

back and fix those

1:24:29

like the key thing is like figuring out

1:24:30

like when you messed up that's hard. So

1:24:33

so we we need better like

1:24:34

self-verification systems there that

1:24:36

might be like self bootstrapping like

1:24:38

testing for example

1:24:39

>> and like the other thing we need to do

1:24:40

is like teach systems how to like

1:24:44

decompose problems into like sub agents.

1:24:46

I think like there's really cool stuff

1:24:47

around RLMs around this.

1:24:52

They're like I still find them like a

1:24:53

little bit tricky to get working, but

1:24:54

like the ideas behind them like amazing

1:24:56

basically like externalize context as

1:24:59

like an object and then like sort of

1:25:00

like search over that and like decompose

1:25:02

problems like that for like really

1:25:04

really long horizon tasks. I don't know

1:25:06

it doesn't work amazing right now but

1:25:07

like the general strategy of like verify

1:25:09

and then like decompose like iteratively

1:25:12

I think that's like a good path forward.

1:25:15

like we're we're spending time there as

1:25:16

well. I'm sure like a bunch of other

1:25:17

people are well.

1:25:20

>> Awesome.

1:25:22

Great. Um I think we are pretty much uh

1:25:23

to the end of the part and um so what is

1:25:26

uh what is something which you are most

1:25:29

excited about to happen in let's say

1:25:31

again in 6 months or year because again

1:25:33

like we pretty much don't know but you

1:25:35

really want to see to yeah to happen.

1:25:38

>> Two things like one I'm super excited

1:25:40

for the World Cup. So like World Cup is

1:25:41

happening like here it's happening in

1:25:44

Philly. So, I'm like super stoked for

1:25:45

that. But besides that, I think like the

1:25:48

thing I'm like super stoked about is

1:25:51

we're we're like just starting to get

1:25:53

the first sparks of these like

1:25:56

self-improvement loops from data that's

1:25:59

generated from agents. And like we're

1:26:01

pushing like a ton on this like in the

1:26:03

last like couple months like we put our

1:26:05

like first like research around this.

1:26:06

There's other good teams doing this. But

1:26:07

I think like this is like such an

1:26:09

amazing on-ramp for us for like all

1:26:11

teams to self-improve like all of their

1:26:14

systems by doing like very very good

1:26:17

like data engineering like looking at

1:26:19

all of their trace data like mining it

1:26:21

for errors and like bootstrapping self

1:26:24

like probably to start they're going to

1:26:26

be like semi-autonomous self-improvement

1:26:28

loops like like humans will need to be

1:26:30

in it but the systems will get better

1:26:32

and better and I think the the flow of

1:26:36

build agent

1:26:37

use an environment, generate data from

1:26:40

it, and then like mine the data, point a

1:26:44

lot of compute at the trace data to

1:26:46

derive like eval and to derive training

1:26:48

data and then like use that to like

1:26:50

improve the agent. Just keep doing that

1:26:52

loop. That is like super exciting to me.

1:26:54

And it it's it already works actually

1:26:56

like every like people like we're doing

1:26:58

it like people are already doing it. It

1:26:59

like works. Customers are doing it. It's

1:27:01

awesome.

1:27:02

But it will only get better, I think,

1:27:04

with like better models and like we're

1:27:05

going to build everyone's going to build

1:27:06

better systems around some of this

1:27:08

stuff. So, yeah, I'm stoked in six

1:27:10

months. Like, I can't even imagine like

1:27:11

how good this loop is going to be. It's

1:27:12

going to be amazing.

1:27:15

>> Likewise. Totally. What's the next blog

1:27:18

coming?

1:27:18

>> Next blog. Um, okay. I'm supposed to

1:27:20

write one over this like weekend. Yeah.

1:27:22

Hopefully like next week. Yeah. Oh,

1:27:23

maybe like one thing that's cool I like

1:27:25

lang chain is like because we talked in

1:27:27

the beginning I actually think blogs are

1:27:30

like fantastic like artifacts like work

1:27:32

backwards from so it's like but your

1:27:34

team does a bunch of like amazing work

1:27:36

and like you should like totally share

1:27:37

that work so you can like kind of like

1:27:38

pick like a blog it's like I want to

1:27:40

write a blog about this and it's like

1:27:42

okay like what's all the work I have to

1:27:43

do to make sure that that blog doesn't

1:27:45

like suck basically. Yeah. Yeah.

1:27:47

>> That's great.

1:27:48

>> Yeah. Yeah. I think there's one I'm

1:27:50

thinking about a bunch which is like um

1:27:52

it's less like like agent engineering

1:27:54

stuff but more just like how much we've

1:27:56

like unbundled like agents. I think

1:27:58

there's been like a huge like unbundling

1:28:00

of agents uh into like pick a base

1:28:04

harness and like pick your skills like

1:28:05

pick your tools like um design your

1:28:08

agent like design the models and it's

1:28:10

not just like one monolithic system like

1:28:12

you totally don't have to get locked

1:28:13

into anything like you have the choice

1:28:15

to build like bespoke tooling for

1:28:17

yourself like for your company and like

1:28:19

the unbundling is awesome and like I

1:28:20

think like people are doing cool stuff

1:28:21

around that so hopefully like I'll like

1:28:23

riff on something about that or

1:28:24

something or just whatever I don't

1:28:29

Great. Okay. We um last question to you.

1:28:31

Um so imagine so um so the world is the

1:28:35

technology is changing

1:28:38

by an order of magnitude every week. we

1:28:40

all can see uh what advice would you

1:28:43

give to someone who is just starting out

1:28:45

of college who is someone 20 20 21 year

1:28:48

old because because things are not same

1:28:50

as it has been like I can say it's been

1:28:54

like like couple of years ago it's not

1:28:56

the same the world is changing so fast

1:28:58

and and it's sad to see that lot of

1:29:01

people are actually I mean don't even

1:29:04

care about what is really happening

1:29:05

right so even like if someone is

1:29:08

starting out college so what should they

1:29:10

really look forward to? I mean to to be

1:29:12

at frontier and to actually scale on

1:29:15

things to actually learn and be at good

1:29:17

places. So what's your opinion over

1:29:20

that?

1:29:22

>> I don't know how amazing advice I can

1:29:23

give on this honestly but like maybe

1:29:25

like some like general thoughts of like

1:29:28

what I was thinking when I was like

1:29:30

finishing like PhD and stuff and also

1:29:32

like there's like so many sick like kids

1:29:34

who are just like graduating undergrad

1:29:35

already like that I see on Twitter doing

1:29:37

great work. I think like there's there's

1:29:38

a couple like common threads which are

1:29:40

really cool which is basically like you

1:29:42

just sort of like pick something you're

1:29:44

like kind of interested in and you just

1:29:46

use like AI to help you learn that and

1:29:50

you just like kind of like rabbit hole

1:29:52

like really deep into that one thing.

1:29:54

And I think like that's probably like

1:29:57

really really useful because you can

1:29:58

kind of maybe use AI to become like top

1:30:02

maybe like 10% or like 5% of the world

1:30:04

if you like care enough and like the

1:30:06

problem is not like super crazy. And I

1:30:09

think like that's like really good. And

1:30:10

the other thing is like I think like

1:30:12

it's awesome when people just like post

1:30:14

their thoughts like online. And um I was

1:30:17

saying like it helped me like meet a lot

1:30:19

of like cool people. I see like awesome

1:30:22

like posts on X and like I love

1:30:23

interacting with them, but I think it's

1:30:25

basically just like it's kind of scary

1:30:27

to maybe like put your ideas like online

1:30:29

like dude I'm gonna get like roasted

1:30:30

like first I'm going to get roasted like

1:30:32

by my friends who are like dude why is

1:30:34

he posting so much on like Twitter about

1:30:35

like AI but it's totally fine like you

1:30:37

kind of like get over it but like it's

1:30:40

just like really good to like sort of

1:30:41

share your ideas because it helps you

1:30:43

like other people like challenge you and

1:30:44

then like you realize like oh like that

1:30:46

idea was dumb or maybe that idea was

1:30:47

like really good like resonates with

1:30:48

people and like the only way maybe for

1:30:50

like other people to like really help

1:30:52

you is if they like see your work or

1:30:54

they see your thoughts and then like I

1:30:55

think there's so many people who are

1:30:56

like willing to help. So just like maybe

1:30:58

like pick something just like grind on

1:31:00

it just like post about it basically.

1:31:01

And I feel like if you do that enough

1:31:03

times then something good will hopefully

1:31:06

happen or like you'll have learned

1:31:08

something which is also like really good

1:31:12

>> dude. Um this is so honest and I can

1:31:14

totally relate with both of your points

1:31:16

and basically this is something which I

1:31:17

have experienced again like because

1:31:19

there are so many trajectories so many

1:31:21

arenas opening as AI is evolving to

1:31:24

learn to to actually u make your make

1:31:27

you context aware about things I mean it

1:31:29

can be anything it can be posting side

1:31:31

of things pre-training inference

1:31:32

engineering environments data a lot I

1:31:35

mean you can't really keep it up about

1:31:37

things so again as you said use AI use

1:31:40

your knowledge sources and like read

1:31:42

good blogs, references, hack on,

1:31:44

experiment on and this is something and

1:31:46

that is the reason I mean even good

1:31:48

professors lot of colleges are not

1:31:50

actually wor about things. So I think

1:31:52

this is the best time to learn and

1:31:55

actually dig on things and and I think

1:31:58

there are wide arenas where one can

1:31:59

master one thing right because everyone

1:32:02

needs master of something and get into

1:32:04

places and let's I mean it can be

1:32:06

anything it can be even hiring as well

1:32:08

if you're really good at it so you can

1:32:09

make it to the places of course and as

1:32:11

you said about posting about stuff dude

1:32:14

I mean this is so underrated I mean if

1:32:16

you are really good poster if if you if

1:32:18

you can really uh kind of um convey your

1:32:22

thoughts well

1:32:24

amazing opportunities can open up and

1:32:26

this has been happening for me and I

1:32:27

have seen a lot of amazing people been

1:32:29

to places just by like I've I've

1:32:32

interviewed bunch of people I can give

1:32:34

example of kalome he's he's 19 he just

1:32:37

did ready to wait he went to meet

1:32:40

Shopify CEO then he got hired at prime

1:32:42

so I mean there's so many people who

1:32:44

have just gone to the same trajectory

1:32:46

just by posting their thoughts online

1:32:48

and it is and it is a fascinatingly

1:32:50

>> rewarding

1:32:51

It is actually rewarding. Totally. Um

1:32:55

awesome. I think um we at a wrap. So

1:33:00

thanks Viv. Uh for everyone listening,

1:33:02

deep agent is open source. Everything is

1:33:03

on GitHub and absolutely you read a

1:33:06

web's blog coming on um Twitter. It's

1:33:10

it's just amazing and that is something

1:33:12

which has led to this conversation. So I

1:33:14

hope more more and more of them coming

1:33:17

and follow him at um with Tan on

1:33:21

Twitter.

1:33:21

>> Yeah, dude. This was so fun. Oh, I had a

1:33:23

blast.

More transcripts

Explore other videos transcribed with YouTLDR.

Get the TLDR of any YouTube video

Transcribe, summarize, and repurpose videos in 125+ languages — free, no signup required.

Try YouTLDR Free