Full Transcript

·YouTLDR

From Chaos to Choreography: Multi-Agent Orchestration Patterns That Actually Work — Sandipan Bhaumik

26:304,072 words · ~20 min readEnglishTranscribed Apr 19, 2026
AI Summary

Multi-agent systems are distributed systems in disguise; moving from one agent to five increases complexity 25x, requiring classic distributed systems patterns like orchestration, immutable state, and circuit breakers to survive production.

As AI development shifts from individual LLM prompts to complex 'agentic' workflows, the primary failure points shift from model quality to architectural race conditions and state management.

Section summaries

0:00-1:00

Introduction

optional

Speaker background and high-level agenda.

1:00-5:00

The Complexity Problem

watch

Essential war story about how race conditions kill agentic workflows.

5:00-11:00

Orchestration vs. Choreography

watch

The fundamental decision framework for multi-agent coordination.

11:00-16:00

State Management & Contracts

watch

Explains how to use immutable state to prevent data corruption.

16:00-21:00

Failure Recovery (Saga/Circuit Breakers)

watch

Detailed implementation logic for making systems resilient.

21:00-25:00

Production Architecture on Databricks

optional

Specific implementation details using Mosaic AI and Unity Catalog; skip if not using Databricks.

Key points

  • Orchestration vs. Choreography — Choreography uses event-driven message buses where agents are autonomous but hard to debug, whereas Orchestration uses a central coordinator (like LangGraph) to manage the execution graph and retries.
  • Immutable State Snapshots — Instead of shared mutable databases where agents overwrite each other, state should be an append-only log of immutable versions (v1, v2, v3) handed off between agents.
  • Circuit Breakers and Compensation Patterns — Circuit breakers prevent cascading failures by failing fast when an agent is down, while the Compensation (Saga) pattern provides a way to 'undo' previous steps if a later stage fails.
This is no longer an AI problem. This is a distributed system problem. Sandipan Bhaumik
The problem wasn't with the prompts. The problem was we built a distributed system without distributed system thinking. Sandipan Bhaumik

AI-generated from the transcript. May contain errors.

0:00

Hi everyone, I'm Sandy. I have spent 18

0:02

years building data systems, a major

0:04

part of it focusing on building and

0:06

scaling distributed data systems in the

0:08

cloud. I've done it for multi-tenant

0:11

systems for software and SaaS companies,

0:13

and then for scaling data and AI

0:15

platforms in regulated industries like

0:17

financial services and healthcare. I've

0:19

learned a great deal about production

0:21

grade distributed systems while I have

0:23

been working at AWS and now in

0:25

Databricks. For the last 2 years, I've

0:27

been deploying multi-agent AI systems in

0:29

production. And I have watched brilliant

0:32

engineers make the same mistakes over

0:35

and over. They think adding more agents

0:38

is just like adding more features. It's

0:40

not. It's building a distributed system.

0:43

And today, I'm going to show you the

0:44

patterns that actually work when you

0:47

make the transition. These are lessons

0:49

that I have learned working in the

0:51

trenches, and today I'm here to share it

0:53

with you. Here's what we're covering

0:54

today. First, the problem. I'll share

0:57

you a very basic production war story

1:00

about race conditions and why complexity

1:03

explodes when you go from one agent to

1:05

five agents.

1:07

Um I'll I'll talk about the patterns,

1:09

choreography and orchestration patterns

1:11

for coordination of agents. I'll talk

1:13

about state management, uh talk about

1:16

failure recovery and how we can um

1:19

design for failure in production

1:21

systems. And then I'll I'll share how a

1:23

production grade architecture will look

1:24

like uh in as simple way possible. And

1:28

I'll also show you an example on how we

1:30

build this on Databricks. So, let's dive

1:32

into it. You see, one agent works

1:34

beautifully. You have got your LLM, some

1:37

prompts, maybe a retrieval augmented

1:40

generation pipeline, maybe some tool

1:43

calls. It demos great. Leadership loves

1:45

it. You feel happy and your team is

1:48

happy. And then, product comes back with

1:51

a request that changes everything. They

1:54

want five more agents. And here's what

1:56

happens. You think, "Okay, I know how to

1:58

build agents, and I will add five more."

2:01

Except, now you have coordination

2:03

problems. Agent A produces data that

2:07

Agent B needs. Agent C is waiting on

2:09

both Agent A and Agent B. Agent D just

2:12

updated the shared state that Agent B

2:14

was reading, and Agent E just crashed

2:18

and took the down this entire workflow.

2:20

This is no longer an AI problem. This is

2:24

a distributed system problem. And most

2:26

of you didn't sign up to be distributed

2:29

systems engineer. Let me tell you about

2:31

a production deployment where this went

2:33

very wrong. We built a credit

2:35

decisioning system for a financial

2:36

services company. The first agent,

2:38

credit score calculation, worked

2:40

perfectly. It worked great in demos, 2

2:42

weeks in production, zero issues. Then

2:45

we added four more agents, income

2:46

verification, risk assessment, fraud

2:49

detection, and final approval.

2:51

Uh we deployed all five. In 3 days'

2:54

time, we started seeing weird approvals.

2:57

Uh 20% of the decisions had incorrect

3:00

risk ratings. Customers who should have

3:02

been flagged were getting approved. The

3:04

business team was panicking. It took us

3:06

2 days to find out what was happening.

3:09

Credit score agent calculated a score of

3:11

750 and wrote to the database. The risk

3:14

assessment agent, on the other hand,

3:16

read from the database 500 milliseconds

3:19

later and got a score of 680 for the

3:22

same customer. Why did it happen?

3:25

Because we had a caching layer for

3:26

customer records. The write to

3:28

PostgreSQL SQL succeeded, but the cache

3:31

was not invalidated. The risk agent read

3:35

from the cache, and it got stale data.

3:40

Use It used the wrong score and made the

3:43

wrong decision. This is a classic

3:45

distributed systems problem. We had

3:47

caching layer between the agents and the

3:50

database. Cache invalidation failed, and

3:53

the agent was reading stale values. The

3:56

race condition wasn't in the database,

3:58

it was in the architecture. Multiple

4:01

agents, shared cache, no coordination on

4:04

cache invalidation. This took us quite a

4:07

while to find the pattern. It created

4:09

delays in delivery and led to wrong

4:13

decisions. And here's the lesson we

4:15

learned. The problem was, of course, not

4:17

with the model. The problem wasn't with

4:19

the prompts. The problem was we built a

4:22

distributed system without distributed

4:24

system thinking. And that's what kills

4:27

multi-agent projects, not bad AI, but

4:30

bad architecture. Now, I will show you

4:33

the architecture that works. We will

4:34

also look into a production grade

4:36

architecture. But first, let's

4:39

understand why this complexity explodes

4:42

so quickly. Now, when you move from a

4:44

one agent system to a multi-agent, let's

4:46

say five agent systems, it doesn't get

4:49

just five times harder. It gets 25 times

4:52

more complex. Coordination complexity

4:55

grows exponentially. One agent has got

4:57

zero coordination problems. Two agents

4:59

have got at least one connection. Five

5:02

agents have got at least 10 potential

5:04

connections and coordination. Each

5:06

connection is a failure point, a race

5:09

condition, a state synchronization

5:11

problem. You are not just building five

5:13

agents, you are building a coordination

5:16

problem across multiple relationships

5:19

and across and and possibility to have

5:22

multiple failure modes. And that's why

5:25

the complexity increases very, very

5:27

quickly. Now, I'm going to show you two

5:29

critical patterns. First pattern is

5:32

about how to coordinate multiple agents.

5:35

Then we will talk about how you can

5:37

manage state. And then we'll talk about

5:38

how you can recover and design for

5:40

failure. Now, these patterns come from

5:42

multiple years of distributed systems

5:44

work, and I can directly apply them on

5:46

multi-agent AI system. Once you get the

5:48

basics, it's really hard to miss these

5:51

patterns when you build multi-agent AI

5:53

architecture. The first decision you

5:55

need to make is about choreography or

5:58

orchestration. These are the two

5:59

fundamental patterns for distributed

6:01

coordination. Choreography means agents

6:04

coordinate through events. They are

6:06

decentralized, they are autonomous.

6:08

Orchestration means a central

6:10

coordinator manages the workflow. This

6:13

is centralized and controlled. Most

6:15

teams pick one instinctively and regret

6:18

it. Let me show you when to use each.

6:21

Let's start with choreography.

6:23

Choreography is event-driven.

6:25

Um the research agent finishes uh

6:28

research and publishes a research

6:30

completed event to a message bus. Agent

6:33

B subscribe to that message bus and

6:36

listens for the event type it is

6:38

interested in. The analysis agent

6:40

subscribes to that event type, picks it

6:43

up, does analysis, and publishes

6:45

analysis ready. Then the report agent

6:48

picks that analysis ready event,

6:50

generates the report. There is no

6:52

central coordinator here. Each agent is

6:55

autonomous, listening for events it

6:57

cares about, publishing when it is done.

7:00

This is the beauty of choreography.

7:02

Agents are loosely coupled.

7:04

It's easy to add add new agents and make

7:07

them subscribe to the events that

7:08

they're interested in. This drives high

7:11

autonomy and scales really well.

7:13

However, the nightmare of choreography

7:15

is debugging. When something fails,

7:17

you're playing detective with no real

7:20

clue. Which agent failed to publish? Did

7:23

the event get consumed? Did the event

7:25

get consumed twice? You need bulletproof

7:28

observability to make choreography work.

7:31

Even with the event propagation, you

7:33

need strong uh guarantees across

7:36

delivery of these events. Without it,

7:39

debugging is really hard. So, when

7:40

should you use choreography? You use

7:43

choreography when your workflow is

7:45

naturally event-driven, when agents need

7:47

to operate independently, when you are

7:50

adding agents frequently and don't want

7:52

to update a central coordinator. But it

7:55

is important to understand

7:57

it is possible only if you have strong

8:00

observability. If you can't trace events

8:03

through your system, choreography will

8:04

destroy you. I have seen teams choose

8:06

choreography because it feels more

8:09

agentic, more autonomous. Then they

8:11

spend months firefighting because they

8:14

can't debug distributed event flows.

8:16

Don't make that mistake. Now, let's look

8:18

at the alternative, orchestration.

8:21

Orchestration is centralized. You have a

8:23

workflow orchestrator that calls each

8:25

agent directly. Agent A runs first. The

8:28

orchestration calls Agent A, waits for

8:31

the result, gets the result back. Then

8:34

the orchestrator calls Agent B and C in

8:37

parallel if they are agents that need to

8:38

run in parallel. The orchestrator

8:40

manages the parallelism, not the agents.

8:43

B and C return their results to the

8:44

orchestrator. Then the orchestrator

8:46

calls Agent D with the combined results

8:49

from B and C. Every call goes through

8:51

the orchestrator. Agents never call each

8:53

other. The orchestrator is the single

8:55

source of truth. It knows the entire

8:58

execution graph. It manages state. It

9:01

handles retries. It logs every step.

9:04

Agents are dumb. They just take the

9:07

input, they do the work, they return the

9:09

output. The orchestrator does all the

9:12

smart coordination. In Databricks, one

9:14

way to implement this pattern would be

9:16

with LangGraph wired into AI agent

9:19

framework as the orchestrator. But any

9:22

workflow that gives you

9:24

DAGs, directed acyclic graphs, and

9:27

proper retry mechanisms would fit in

9:30

this kind of orchestrator patterns. You

9:32

use orchestration when you have complex

9:34

dependencies that need central

9:37

management, when you need to roll back,

9:39

compensate for failures, when you want

9:41

one dashboard showing the entire system

9:44

state, when your workflow is relatively

9:46

stable. In financial services, for

9:48

example, we use orchestration almost

9:51

exclusively. Why? Because it provides

9:54

easy debugging and the ability to roll

9:56

back, and that matters more than

9:58

autonomy in these kind of industries.

10:01

When something goes wrong with a credit

10:03

decision, for example, we need to know

10:05

exactly which agent made that call, in

10:08

what order, and with what data.

10:10

Orchestration gives us that.

10:12

Choreography doesn't. So, how do you

10:14

choose? Here's your decision framework.

10:16

Two axis.

10:18

Workflow complexity, simple to complex.

10:21

Autonomy requirements, low to high.

10:24

Simple workflow, high autonomy, you go

10:26

with choreography. You need complex

10:28

workflow with low autonomy tolerance,

10:31

you go with orchestration. The

10:32

interesting quadrant is the top right,

10:35

where you need complex workflow, but

10:36

agents need autonomy. This is where you

10:39

use hybrid patterns. Choreography with

10:42

saga patterns for compensation. I'll

10:44

talk about this pattern later in this uh

10:47

session as well.

10:49

Uh tools like Agent Bricks on Databricks

10:52

are starting to package these

10:54

orchestration patterns for common

10:56

multi-agent use cases. So, you don't

10:59

need to rebuild them every time. It

11:01

makes

11:02

building these patterns really easy in

11:04

production environments. Now, I use the

11:06

decision metrics uh every time to make

11:09

decisions with customers based on their

11:10

use cases. Uh

11:12

it's worth you take a screenshot. I'm

11:14

sure you'll reference it. Let me show

11:16

you what a production orchestration

11:18

actually looks like at the tail end of

11:20

the session. All right. Now, we have

11:21

chosen a call coordination uh pattern.

11:24

Now, let's talk about the thing that

11:25

actually when you scale. State. How do

11:29

agents share data without race

11:31

conditions? Without stale reads? Without

11:34

mystery bugs? Here's what most people do

11:36

first, and it's wrong. Shared mutable

11:39

state. Multiple agents writing at the

11:42

same database records at the same time.

11:44

Agent A reads credit score, calculates

11:47

the value, writes it back. Agent B does

11:49

the same thing at the same time. Both

11:51

read 680. Agent A writes

11:55

750. Agent B writes 720. Last write

12:00

wins. Agent A's update disappears. Lost

12:03

update. Uh I understand, yes, modern

12:06

databases have protections in place, row

12:08

locks, isolation levels, etc. But, you

12:11

have to use them correctly. Explicit

12:13

transactions um you have to build uh

12:16

serializable isolation. Uh you have to

12:19

make sure that you select for update. Uh

12:22

and and many teams don't.

12:24

Uh they use default isolation. They

12:27

don't use explicit locks, and they ship

12:30

race condition to production. We did it.

12:32

We did that mistake, and that resulted

12:34

in delayed value to the business. We

12:36

just assumed that the database would

12:37

handle these conditions, but they don't.

12:39

When it gets really complex, you have to

12:42

handle them explicitly in the code. Now,

12:44

here's what works. Immutable state

12:46

snapshots with versioning. Agent A

12:48

produces a state version, let's say

12:51

version one. It's sealed. It's

12:53

immutable. Nobody can modify it. State

12:56

is stored in the orchestrator database

12:58

as an append-only log. These are insert

13:01

operations, not not any update. Agent A

13:04

hands state version one to agent B.

13:07

Agent B validates the schema, checks

13:09

that the data contract matches with its

13:11

expectations. It processes it, produces

13:14

state version two. Also immutable. Agent

13:17

B inserts version two as the new row. It

13:19

doesn't update version one. And then

13:22

hands it to agent C. Same thing. Schema

13:24

validation version tracking,

13:26

immutability guarantee at each handoff.

13:29

Agent C fails. Now, if agent C fails,

13:32

you roll back to version two. If you

13:34

need to debug, you replace state

13:36

evolution

13:38

uh from version one through version N.

13:40

You can see exactly what each agent

13:43

received and produced. This eliminates

13:45

race conditions. No concurrent

13:48

modification to the same record. Each

13:50

agent appends a new version instead of

13:53

updating the shared state. Now, of

13:56

course, if you want to

13:58

uh save these state snapshots, they can

14:00

be logged

14:01

uh in any sort of append-only storage

14:04

for audit replay, but they are never

14:06

shared for read or write. Now, here's

14:08

how it looks like in code. Agent state

14:10

class, the frozen means immutable in

14:12

Python. It has a version number, the

14:14

data payload, and who created it. The

14:17

handoff function does three things.

14:19

First, it validates the schema.

14:21

Uh this is the contract enforcement. We

14:24

are checking that agent A's output

14:26

matches agent B's input contract. This

14:29

is critical, and we will come back to

14:31

this. Second, increment version. Create

14:34

a new immutable state object with

14:37

version N plus one. Third, execute the

14:40

next agent with that immutable state.

14:43

The agent can't modify the input state.

14:45

It can only produce a new state. This

14:48

prevents an entire class of bugs. It

14:51

prevents race conditions on shared

14:54

state. No stale reads. It provides a

14:56

clear lineage. Every state has a

14:58

version, and you know who has created

15:00

it. When something goes wrong, you can

15:02

trace back through state evolution.

15:04

Version seven produced bad output, look

15:07

into version six that went into the

15:08

agent. Look at version five before that.

15:11

You can binary search through your state

15:14

history to find where things went wrong.

15:17

And this becomes really, really

15:18

powerful. Now, state management is half

15:20

the battle. Data contracts are the other

15:23

half. Agent A can just throw um

15:26

arbitrary data at agent B and hope it

15:29

works. This doesn't work that way. They

15:31

need a contract in place. In this

15:33

example, research agent promises to

15:37

output findings, confident score,

15:39

sources, timestamp, etc. Analysis agent

15:42

declares it requires research agent

15:45

output with type and first.

15:47

Uh and it validates. If confidence is

15:50

below 0.7, it will reject the handoff.

15:55

This is the contract. If the research

15:57

hand if the research agent tries to

15:59

handoff low-quality data, the contract

16:03

catches it at the boundary. You find out

16:05

immediately, not three agents downstream

16:08

when it produces a report in garbage.

16:10

When we work with our customers um

16:13

using Databricks, one way of doing it is

16:16

uh registering these input-output

16:17

schemas in Unity Catalog. Uh so, every

16:20

agent's contract is versioned and

16:22

governed in one place. All right. We

16:24

talked about coordination patterns. We

16:26

talked about state management. Now, talk

16:28

about Now, now let's talk about another

16:30

thing that you need to keep in mind, and

16:32

that's failure and recovery. And and the

16:34

reason this is important is because

16:36

agents will fail. That's inevitable. The

16:38

LLM will time out. The API will rate

16:41

limit you. The agent will crash

16:43

mid-workflow. What happens then? What

16:45

happens then is what you need to plan

16:47

for and design in the system. Let's talk

16:49

about a few patterns. Let's talk about

16:51

the first pat- pattern, which is a

16:52

circuit breaker pattern, and this comes

16:55

straight from distributed system. When

16:57

agent A calls agent B, it wraps that

17:00

call in a circuit breaker. If agent B

17:02

fails repeatedly, say five times in a

17:05

row, the circuit breaker opens. Now,

17:07

instead of waiting for a timeout every

17:09

single time, you basically fail fast.

17:12

Circuit open, agent B is down, you just

17:15

try again later. You are not bombarding

17:17

agent B with requests. You're protecting

17:20

your system. After a time- timeout

17:22

period, let's say 60 seconds, it the

17:24

circuit goes half open. Then you test

17:27

agent B again with one request. If it

17:29

succeeds, the start circuit closes, and

17:32

normal operation resumes. If it fails,

17:34

the circuit opens again, and it resets

17:37

the timer. This prevents you from

17:39

cascading failures into the system.

17:42

One agent going down doesn't bring your

17:45

entire workflow down. You gracefully

17:48

degrade. Maybe you skip that agent and

17:51

continue with a reduced functionality.

17:54

Uh maybe you use cached results. Maybe

17:57

you alert a human. But, you don't crash

17:59

the entire workflow. Circuit breakers

18:02

are the single most important failure

18:06

recovery pattern for multi-agent

18:08

systems. Every agent call should be

18:10

wrapped with a

18:11

We enforce these circuit breaker

18:13

policies at the serving layer on

18:15

Databricks through model serving or

18:16

through AI Gateway. Here's how it looks

18:18

like in code. You track the failure

18:20

count, and you track the state. When you

18:22

call an agent, you check the state

18:24

first. If it is open, you fail fast. You

18:27

don't even try. If it is closed, you

18:29

make the call. If the call succeeds, you

18:31

reset the failure count and stay closed.

18:34

If it fails, you increment the failure

18:36

count. If you hit the threshold, you

18:38

open the circuit. After the timeout

18:40

period, you transition to half open. You

18:43

test one request. If it succeeds, you

18:45

close the circuit. If it fails, you open

18:47

it again. This is a simple pattern, but

18:50

it has got a massive impact. And in

18:52

Databricks, you can log every

18:54

open-closed transition in MLflow, so you

18:57

can see when an agent started flaking

19:00

out. Now, let's talk about another

19:02

pattern. We call it the compensation

19:04

pattern. Also called saga pattern. Every

19:07

agent has two methods, execute and

19:10

compensate. Execute does the work.

19:12

Compensate rolls it back, undoes it. The

19:15

orchestrator

19:17

agents have executed. If the execution

19:20

agent fails, the orchestrator walk walks

19:23

backward through the executed

19:25

agents.

19:27

And it calls compensate for each one.

19:30

Analysis agent compensates, it deletes

19:33

the draft recommendation from the system

19:34

that it has written originally. And then

19:37

the research agent compensates by

19:38

clearing the cached research data that

19:41

it gathered previously. So, you're back

19:43

to the initial state. No partial

19:45

transactions. No stuck workflows. This

19:48

is a simple rollback pattern that you

19:50

can implement in multi-agent system.

19:52

Compensation gives distributed agents.

19:55

It is not sexy, but it's how production

19:57

systems handle partial failures. Every

19:59

orchestrated workflow needs this kind of

20:02

compensation pattern, and you need to

20:04

plan for it depending on what you're

20:05

doing with your workflows. Here's how

20:07

compensation looks in code. Every agent,

20:10

as I mentioned earlier, has got two

20:12

methods, the execution method and the

20:15

compensate method. The execution does

20:17

the work, the compensate undoes it. Uh

20:20

that's the contract. Every operation

20:23

must be reversible. The orchestration

20:25

tracks which

20:26

uh the orchestrator tracks which agents

20:29

have run successfully, and then it keeps

20:31

a list. Agent A executes, gets added.

20:34

Agent B executes, gets added. Agent C

20:37

fails, now we walk backward through the

20:39

list in reverse order. Agent B

20:41

compensates first, it undoes the work

20:44

that it has done. Agent A compensates

20:46

next, it undoes the work that Agent A

20:48

has done, and it goes back to the

20:50

initial state. This is saga pattern from

20:52

distributed databases. Financial

20:54

services requires this. Now that we have

20:57

covered these different patterns, I

20:59

wanted to show you what a production

21:00

architecture would look like when you

21:01

bring these things together. You've got

21:03

the orchestrator at the left-hand side.

21:05

Um

21:07

it's the brain of the workflow. It

21:09

contains the workflow engine, it

21:11

contains the state store uh holding

21:14

versions through zero to n, and it has

21:16

uh it it it can look into the

21:18

observability layer. It handles the

21:20

observability data. Every call goes

21:22

through the orchestrator. Orchestrator

21:24

calls Agent A, Agent B, Agent A returns

21:27

state version one to the orchestrator.

21:29

Orchestrator then calls Agent B and C in

21:32

parallel if they need to run in

21:33

parallel.

21:34

Uh both receives state version one from

21:37

the orchestrator. They return results.

21:40

Orchestrator stores at version three two

21:42

and three. Finally, orchestrator calls D

21:44

with these combined results. Agents

21:46

never call each other. All coordination

21:48

happens through the orchestrator.

21:50

And this is what gives us control,

21:53

observability, capability to roll back.

21:56

This runs 24/7 across billions of

21:58

transactions because the orchestrator is

22:01

the single source of truth. All right,

22:03

here's a production architecture that

22:06

you could implement with the Databricks

22:08

Data Intelligence Platform.

22:10

In the orchestration layer, you can have

22:12

LangGraph wired into Mosaic AI Agent

22:15

Framework. It handles multi-agent

22:17

orchestration. It manages the workflow

22:19

graph and knows which agents to call in

22:21

what order. Each agent is implemented as

22:24

a Unity Catalog function. It could be

22:27

written in SQL or Python, or it could be

22:29

a model registered in a Unity Catalog.

22:32

Um they are When you register these

22:35

assets in Unity Catalog, they are set

22:38

discoverable centrally within the

22:40

organization. Uh they can be governed in

22:42

one place, and they can be versioned,

22:44

which is really critical uh in terms of

22:46

operating these uh workflows in

22:49

production. We expose these agents

22:51

through a Databricks Model Serving or

22:53

Function Serving, and that's where we

22:55

enforce these circuit breaker style

22:57

policies like retries or timeouts or

23:00

rate limits uh at the serving layer,

23:02

typically via AI Gateway configuration.

23:05

Now when we talk about the data layer,

23:06

Delta Lake stores everything. It not

23:09

only stores the state versions from the

23:12

agent, it also stores customer data and,

23:15

you know, all all all the data that you

23:18

need for your workflows to work.

23:21

Um

23:23

Talking about the snake state snapshots,

23:25

Delta table

23:27

uh is immutable and versioned. For us,

23:30

those state versions are just rows in a

23:32

Delta table. Uh we never update them in

23:35

place. Each agent run is tied to a state

23:38

version via MLflow Traces, so we can

23:40

step through the evolution when

23:42

something breaks. Now, uh I just wanted

23:45

to touch upon uh Unity Catalog. It It

23:48

governs everything access control,

23:50

lineage, audit trail for both data and

23:53

agents. MLflow gives us per agent

23:56

tracing evaluation capabilities with

23:58

out-of-the-box LLM as judges and

24:03

and metrics on every call. And as I

24:05

mentioned earlier, um tools like Agent

24:08

Bricks

24:09

is the higher level way of Databricks

24:12

packaging these orchestration patterns

24:14

for common multi-agent use cases, so you

24:17

don't need to rebuild them every time.

24:19

So just to wrap up this workflow, I see

24:22

the LangGraph orchestrator calls Agent

24:24

A, a Unity Catalog function or model. It

24:27

gets a result, writes version one state

24:30

to Delta. It then calls Agent B with

24:34

state version one, writes version two,

24:36

and so on.

24:37

MLflow traces every call, latency,

24:40

inputs, outputs, token usage. A circuit

24:43

breaker at the serving layer guards each

24:45

call. If Agent C fails, LangGraph

24:48

triggers compensation logic and walks

24:51

backward, calling the compensate

24:53

functions for previous successful steps.

24:55

These kind of patterns run in production

24:57

day in and day out. So thank you for

24:59

hearing me out. You can reach out to me

25:01

over LinkedIn. You can scan this keyword

25:03

that will take you directly to my

25:05

LinkedIn profile.

25:07

Uh

25:07

I I would like to like to leave you with

25:09

three final thoughts. First of all,

25:11

agent chaos is inevitable. When you

25:14

scale past one agent, you will you will

25:18

hit coordination problems, race

25:20

conditions, cascading failures. That's

25:22

guaranteed. The complexity curve doesn't

25:25

lie. Your agent choreography is a

25:27

choice. You can build systems with

25:30

proper patterns, orchestration,

25:32

choreography, immutable state, circuit

25:35

breakers, compensation patterns, data

25:37

contracts. Make sure you understand

25:40

these patterns and bring them to your

25:42

production architecture. Doing so will

25:44

help you build systems, not demos. Demos

25:47

are easy. You use an LLM to show

25:50

something cool. Everyone can do it.

25:52

These things don't work in production.

25:54

In production, you have to build

25:56

systems, and systems are hard. Systems

25:59

are what create value for businesses.

26:02

Everything I showed you today,

26:03

choreography versus orchestration,

26:05

immutable state, circuit breakers, these

26:07

are all unsexy infrastructure work. You

26:10

won't get applause for implementing a

26:12

circuit breaker, but you make your

26:14

systems more reliable. They don't fail

26:16

at 2:00 a.m. in the night. That is what

26:18

people notice over time. Be a systems

26:20

engineer. The patterns here, they work.

26:23

Apply these patterns in your production

26:25

architecture. Thank you very much for

26:27

watching. Bye.

More transcripts

Explore other videos transcribed with YouTLDR.

Get the TLDR of any YouTube video

Transcribe, summarize, and repurpose videos in 125+ languages — free, no signup required.

Try YouTLDR Free