Multi-Agent Orchestration

Multi-agent systems are, at their core, a bet on decomposition: split a hard task into narrower subtasks, assign each to a specialised agent, and coordinate the results. Whether that coordination is worth the overhead depends almost entirely on how you orchestrate the pieces. This session introduces two fundamentally different approaches to that coordination, explains why we focus on one of them for practical work with local models, and then builds a working constitutional prompt-advisor pipeline step by step using Google ADK.

Two kinds of orchestration

Before writing a single line of code it helps to name the landscape. When people talk about “multi-agent systems” they usually mean one of two very different things, and the choice between them has large practical consequences.1

1 The terms LLM pipeline, multi-step pipeline, and agentic pipeline are all used in the field for multi-step LLM workflows; the distinction between deterministic and LLM-driven routing is rarely explicit in the naming.

LLM pipelines with deterministic routing Agent-driven orchestration
Who decides routing? Your Python code An LLM at runtime
Predictability Generally more predictable Less predictable
Traceability Routing is explicit and inspectable LLM decisions are implicit
Feasible with small models? More feasible per node Needs a capable orchestrator
Cost Generally lower Generally higher

Both approaches can involve the same set of agents and components. The distinction is what controls which one runs next, not the agents themselves.

Agent-driven orchestration: a motivating example

Agent-driven orchestration is useful when exact routing of a workflow cannot be specified upfront. A deep-research agent is a good illustration: the orchestrator LLM decides dynamically which sub-agents to call based on what it has learned so far.

flowchart TD
    user[User query] --> orch(Orchestrator LLM)
    orch -->|calls SearchAgent| search(Search Agent)
    orch -->|calls SummaryAgent| sum(Summary Agent)
    orch -->|calls CritiqueAgent| crit(Critique Agent)
    search --> orch
    sum --> orch
    crit --> orch
    orch -->|decides: enough?| out[Final answer]

The orchestrator selects which sub-agent to invoke based on current state rather than a fixed plan (Anthropic Engineering, 2025; Wang et al., 2024).

Why we focus on deterministic pipelines today

Smaller local models (Ollama, LM Studio) are not reliable orchestrators. They tend to lose track of state, enter routing loops, or ignore instructions under context pressure. Deterministic pipelines achieve the same task decomposition with predictable, auditable behaviour. For the majority of practical use cases, a well-designed fixed graph is the better engineering choice.

There are real gains to this decomposition. Routing queries to a cheaper model for simpler subtasks rather than sending everything to one large generalist can reduce inference costs by more than 2x without measurable quality loss (Ong et al., 2024). The tradeoff is real: you give up the emergent flexibility of a single capable model and invest more effort in upfront design. The same tradeoff exists between microservices and monoliths, and neither is universally the right answer.

Google ADK: key types

The four types used throughout this session:

Type Role
Agent Leaf node: one LLM call, one system prompt
Workflow Graph of nodes; edges define sequential, conditional, and looping flows
InMemoryRunner Executes a pipeline against an in-memory session; suitable for notebooks and local dev
Session Holds conversation and pipeline state across agent turns
Note

LM Studio users: the lm_studio/ prefix is not pre-registered in ADK’s model registry. Add this before creating any agent:

from google.adk.models.lite_llm import LiteLlm
from google.adk.models.registry import LLMRegistry

LLMRegistry._register(r"lm_studio/.*", LiteLlm)

A prompt advisor

The domain for today is prompt writing: an agent that takes a description of a desired AI assistant and produces a well-structured system prompt for it. This is deliberately self-referential: you are building an agent that writes prompts for agents.

Good system prompts share a common structure. Mao et al. (2025) analysed 2,163 real production system prompts from repositories of companies including Uber, Microsoft, and LAION-AI, and identified seven components that appear with measurable frequency in deployed LLM applications. OpenAI, Anthropic, and Google each independently converge on the same set in their prompt engineering documentation.

Component Frequency Meaning
Directive / Task 87% The core instruction: what the agent should do
Context / Background 56% Information the agent needs to do the task well
Output Format 40% How the response should be structured or formatted
Constraints 36% What the agent must not do, or limits on behaviour
Role / Profile 28% The persona or expertise the agent should adopt
Workflow / Steps 28% Sequential process when task order matters
Examples 20% Demonstrations of the desired output

Your agent receives a Python dict with these fields and produces a complete system prompt.

Note📝 Task
  1. Define an Agent whose system prompt instructs it to write system prompts. Something like:
You are an expert prompt engineer. Given a description of an AI assistant
across up to seven dimensions (Role, Context, Task, Output Format,
Constraints, Workflow, Examples), write a complete system prompt that
addresses each provided dimension clearly. Output only the system prompt text.
  1. Call it with a test input, for example, a customer support agent for a software company.
  2. Inspect the output: does it address the dimensions you provided?
  3. Open adk web and examine the agent graph.

Harnessing

The prompt advisor works, but it is unguarded. It will happily write a prompt that omits the persona, ignores safety, or produces something grammatically fine but semantically wrong. To make it reliable we need a harness.

A harness is the complete scaffolding around an LLM pipeline: the class or module that holds the rules, runs the agents in order, enforces checks, and decides when to stop. It is everything except the model weights. The entire pipeline we are building today is the harness; the constitution (defined below) is one component inside it.

The Claude Code source leak (April 2026) gave an unusually detailed public view of what a production-grade harness looks like.2 The Claude Code runtime turned out to be a sophisticated harness with three distinct control layers: guides (feedforward instructions, conventions, and bootstrap prompts injected before the model acts), sensors (linters, type checkers, and test suites that catch bad output after execution), and control checks (deterministic format and length validation first, then slower LLM-powered semantic review). The key observation from the leak analysis: agents do not mind being micromanaged: more constraints, more checks, and more structure tend to improve performance rather than limit it.3

2 The leak was analysed at https://github.com/0PeterAdel/ClaudeCode-Leak and https://paddo.dev/blog/claude-code-leak-harness-exposed/

3 A simpler illustration is the Caveman plugin for Claude Code (https://github.com/JuliusBrussee/caveman), which uses three lifecycle hooks and a flag file to constrain the model’s output style, achieving 65–75% token savings per session through prompt injection alone and without any model changes.

Harnessing with ADK Graph API

A Workflow runs its nodes in the order defined by its edges. Within a single edge chain (a tuple), each node’s text output is passed as node_input to the next function in the chain, and as the user message to the next Agent. This is the primary mechanism for sequential data flow.

Stopping conditions are encoded as conditional edges: an edge is only followed when the source node’s routing function returns an Event with a matching route value. Routing is handled by plain Python functions in the edge chain, not by callbacks attached to agents. The tuple syntax makes this concrete: a 4-tuple (a, b, router_fn, {"x": c, "y": d}) runs a, passes its output to b, then calls router_fn to route to c or d. A plain 2-tuple (a, b) is an unconditional edge. A 3-tuple (a, router_fn, {"x": c}) routes a’s output directly without an intermediate agent.

For cross-branch data sharing — when two separate branches of the graph both need access to the same value — use ctx.state. A function writes to ctx.state["key"] and any downstream agent can read it via a template variable {key} in its instruction string. Template variables in {curly_braces} are filled from ctx.state before the instruction reaches the model. Literal braces that should appear in the output (such as in a JSON schema) must be doubled: {{ and }}.

Hard checks

The first line of defence is a deterministic hard check: a plain Python function that rejects obviously bad output before any LLM evaluation happens. ADK treats plain functions placed in an edge chain as nodes automatically, no wrapper class needed. Such a function can declare node_input: str to receive the previous node’s text output and ctx to read and write ctx.state. It returns an Event to control routing: Event(route="pass") follows the matching conditional edge; Event(message="...") emits a terminal message and halts the workflow.

The same mechanism works before an agent: a function placed before the advisor in the chain receives the raw user message as node_input. This is the natural place to capture the original spec before the model transforms it.

from google.adk.events.event import Event

def store_spec(node_input: str, ctx):
    ctx.state["original_spec"] = node_input
    return node_input

def hard_check(node_input: str, ctx):
    if len(node_input.strip()) < 50:
        return Event(message="Hard check failed: output shorter than 50 characters.")
    ctx.state["draft_prompt"] = node_input
    return Event(route="pass")

The edge chain ("START", store_spec, advisor, hard_check, {"pass": ...}) runs store_spec first — capturing the spec — then the advisor, then the hard check on the advisor’s output.

Note📝 Task
  1. Implement hard_check (or extend it with your own criteria) and store_spec as above.
  2. Wire both into a Workflow: ("START", store_spec, advisor, hard_check, {"pass": pass_event}) where pass_event returns Event(message="Passed!").
  3. Test it: what happens when you pass a minimal spec that produces a short output?
  4. Design sketch — embedding check: You now have both the original spec (ctx.state["original_spec"]) and the generated prompt available. Sketch how an embedding model could check semantic alignment between the two — independently of the LLM judge. In 3–5 sentences, address: what you would embed, what similarity measure and threshold you would use, and which failure modes this catches that the constitution judge misses. No code required.

The constitution

Hard checks catch structural failures but cannot evaluate meaning. For that we encode quality criteria as an explicit, auditable Python object — the constitution. The term comes from Anthropic’s Constitutional AI work (Bai et al., 2022), where a model is evaluated against a written list of principles rather than vague notions of quality. We borrow the concept at the pipeline level, not for training but for evaluation: the judge applies the constitution’s rules to each generated output. The full treatment of Constitutional AI as a training technique is session 11 material; here it functions as an evaluation design pattern.

Rules defined in a dataclass are inspectable, version-controllable, and changeable without touching any agent’s system prompt. The harness injects this constitution into the judge at call time.

from dataclasses import dataclass, field

@dataclass
class PromptConstitution:
    rules: list[str] = field(default_factory=lambda: [
        "Must clearly specify a Role or persona",
        "Must include relevant context or background knowledge",
        "Must state the agent's core task explicitly",
        "Must define output format or communication style",
        "Must include constraints on behaviour or outputs",
    ])

    def as_text(self) -> str:
        return "\n".join(f"- {r}" for r in self.rules)

These five rules map to the five most frequent components from Mao et al. (2025), making the constitution the machine-readable equivalent of the framework.

Note📝 Task

Define your own PromptConstitution dataclass with 3–10 rules. You can use the component mapping above as a starting point, but also add quality criteria: a prompt should not only cover the structural aspects but be sensible and appropriately scoped for whatever context the user described.

LLM as a judge

This contitution now has to be plumbed into the pipeline. The go to concept for this operation is using an LLM to judge something based on the criteria we just set.

The idea behind LLM-as-a-judge is straightforward: use a language model to evaluate the quality of some text, e.g., the outputs produced by another language model (Zheng et al., 2023). This is attractive because it scales to large volumes of texts, can approximate nuanced qualitative judgement when prompted carefully, and requires no separate annotation pipeline. Zheng et al. (2023) showed that strong LLM judges can achieve over 80% agreement with human raters on instruction-following tasks, comparable to inter-human agreement rates. The pattern has known limitations, however. Models may prefer their own output style, reward verbosity over accuracy, or be manipulated by superficially confident-sounding responses. LLM judgements should therefore be treated as approximate signals rather than ground truth, and ideally calibrated against human ratings for high-stakes applications.

In the constitutional setting introduced by Bai et al. (2022), the judge does not evaluate against vague notions of quality but against an explicit, enumerable list of principles. This makes the evaluation more consistent and the failure modes more interpretable. Our judge receives the generated system prompt directly as node_input (the previous node’s text output passed through the edge chain), and the constitution’s rules are embedded into its instruction string via a Python f-string at definition time. The judge evaluates each rule and returns a structured verdict.

constitution = PromptConstitution()

judge = Agent(
    name="Judge",
    model="ollama_chat/gemma3:latest",
    instruction=f"""You are a strict quality evaluator for AI system prompts.
Evaluate the following system prompt against each rule and return a JSON object
with keys "rule_scores" (dict mapping each rule to 0 or 1),
"approved" (true only if all scores are 1), and
"feedback" (one sentence per failed rule).

Rules:
{constitution.as_text()}

Return only valid JSON: {{"approved": true/false, "rule_scores": {{...}}, "feedback": "..."}}""",
)

The system prompt to evaluate is not a template variable here — the judge receives it as the user message of its conversation turn, delivered automatically by ADK from the previous node’s output. If the judge returns output that cannot be parsed as JSON, the pipeline defaults to approved=False. This deterministic fallback ensures the harness never silently passes a bad output because the judge misbehaved.

Note📝 Task
  1. Define a judge (Agent) with the evaluation instruction above, embedding the constitution rules via Python f-string.
  2. Write a route_after_judge routing function that parses the judge’s JSON output, writes judge_feedback to state, and returns Event(route="approved") or Event(route="rejected"). If parsing fails, force approved=False.
  3. Wire the hard check and judge into the graph using the chain tuple syntax: ("START", advisor, hard_check, {"pass": judge}) and continue from there with route_after_judge. The full workflow is now START → advisor →[hard_check]→[pass]→ judge →[route_after_judge]→ ....
  4. Run the full pipeline on two inputs: one you expect to pass and one you expect to fail. Inspect the judge’s feedback for each.

The binary judge (pass/fail) is the simplest form of LLM evaluation. A natural extension is to evaluate each rule independently on a scale, giving partial credit and a more informative failure signal. This rubric approach has a precise analogue in qualitative research methodology.

Rubric-based LLM judging is structurally identical to content coding: a coder (the judge) applies a coding scheme (the rubric/constitution) to a unit of text (the output), producing a code (the score). Qualitative research has decades of accumulated best practice for this workflow, centred on inter-rater reliability. The standard measure is Krippendorff’s alpha (Krippendorff, 2004), a coefficient that accounts for chance agreement across nominal, ordinal, and interval scales. When multiple LLM judges disagree (with human coders), alpha quantifies whether the disagreement is systematic (a codebook problem) or random (noise). Hayes & Krippendorff (2007) provide the computational procedure and a reference implementation; the practical target is alpha > 0.8 before treating coded data as reliable.

The design principles that make content coding reliable translate directly to constitution design. Rules should be mutually exclusive, covering distinct non-overlapping aspects of quality; collectively exhaustive, together covering all dimensions that matter; and anchored clearly enough that it is unambiguous what counts as a 0 and what counts as a 1. The constitution in our pipeline is the machine-readable equivalent of a codebook. If your judge scores are noisy or inconsistent, the remedy is the same as in qualitative research: refine the codebook until the criteria are unambiguous.

Editor and improvement loop

A single judge pass identifies problems but does not fix them. The natural next step is an improvement loop: the judge evaluates, a feedback agent proposes concrete changes, and an editor applies them. This loop repeats until the judge approves or a maximum iteration count is reached.

This three-role structure is worth keeping explicit rather than merging roles. The judge computes what is wrong with the current output; the feedback agent computes what to change in response; the editor applies the change to produce the next candidate. Keeping them separate makes each role narrower and easier to prompt precisely, and it maps naturally to the structure of Constitutional AI (Bai et al., 2022), where critique and revision are distinct operations.

flowchart LR
    advisor(Prompt\nAdvisor)
    check{Hard\nCheck}
    judge(Judge)
    feedback(Feedback\nAgent)
    editor(Editor)
    out((Approved\nPrompt))
    fail[Reject]

    advisor --> check
    check -->|pass| judge
    check -->|fail| fail
    judge -->|approved| out
    judge -->|rejected| feedback
    feedback --> editor
    editor --> judge

from google.adk import Agent, Workflow
from google.adk.events.event import Event
import json

MAX_ITERATIONS = 3

def route_after_judge(node_input: str, ctx):
    try:
        verdict = json.loads(node_input)
    except (json.JSONDecodeError, ValueError):
        verdict = {"approved": False, "feedback": "Could not parse verdict."}
    ctx.state["judge_feedback"] = verdict.get("feedback", "")
    return Event(route="approved" if verdict.get("approved") else "rejected")

def store_feedback(node_input: str, ctx):
    ctx.state["feedback"] = node_input
    return Event(route="next")

def loop_or_stop(node_input: str, ctx):
    ctx.state["draft_prompt"] = node_input
    ctx.state["iteration"] = ctx.state.get("iteration", 0) + 1
    if ctx.state["iteration"] < MAX_ITERATIONS:
        return Event(route="continue", output=node_input)

This pattern generalises: routing functions receive the previous node’s text output as node_input, update state as needed, and return an Event to direct the graph. The same function signature works for both branching and looping.

review_loop = Workflow(
    name="ReviewLoop",
    edges=[
        ("START", judge, route_after_judge, {
            "rejected": feedback_agent,
        }),
        (feedback_agent, store_feedback, {
            "next": editor,
        }),
        (editor, loop_or_stop, {
            "continue": judge,
        }),
    ],
)

root_agent = Workflow(
    name="PromptPipeline",
    edges=[
        ("START", advisor, hard_check, {
            "pass": review_loop,
        }),
    ],
)
Note📝 Task
  1. Define a feedback_agent (Agent) whose instruction reads from {judge_feedback} and {draft_prompt} (via state template variables) and proposes specific, concrete rewrites for each failed rule — a numbered list of changes, not a revised prompt.
  2. Define an editor (Agent) whose instruction reads from {draft_prompt} and {feedback} (via state template variables) and returns the revised system prompt. Write a loop_or_stop routing function that increments the iteration counter, writes the revised prompt back to ctx.state["draft_prompt"], and returns Event(route="continue") until MAX_ITERATIONS is reached.
  3. Build the ReviewLoop as a Workflow using chain tuple syntax: ("START", judge, route_after_judge, {"rejected": feedback_agent}) for conditional branching; (editor, loop_or_stop, {"continue": judge}) for the back-edge.
  4. Build the outer Workflow: ("START", advisor, hard_check, {"pass": review_loop}).
  5. Run the full pipeline end-to-end. Open adk web and trace the execution: how many iterations did it take? Where did the judge’s score change?
Note📝 Task (extension): LLM jury

Instead of a single judge, run multiple judges in parallel, each focusing on a different subset of the constitution’s rules. Aggregate their verdicts by majority vote.

  1. Define three judge (Agent) instances, each responsible for one or two rules.
  2. Wire them in parallel using Workflow’s fan-out syntax: (advisor, (judge1, judge2, judge3)) fans out to all three simultaneously; ((judge1, judge2, judge3), aggregator) fans back in.
  3. Add a plain def aggregator(ctx) function that combines the three verdicts from ctx.state.
  4. Compare the jury’s behaviour to the single judge on a set of test prompts.

“Who Judges the Judge? LLM Jury-on-Demand” (2025) discuss dynamic jury selection; Chan et al. (2023) show that diverse role prompts are essential; judges with identical instructions tend to degrade toward noise.

NoteSide note: safety as a constitution

The same pipeline structure applies to safety evaluation. Replace or extend the quality rules in the constitution with safety criteria: for example, “must not produce instructions for harmful activities”, “must not impersonate real individuals”, or “must not claim capabilities the model does not have”. The judge becomes a safety gate; the constitution is the policy; the harness enforces it. Structurally identical to the quality pipeline: different constitution, same code.

Observability

A multi-step pipeline is harder to debug than a single LLM call. Failures can be buried two or three hops deep in the graph: the editor may have applied a change that satisfied the judge locally but broke a different rule in the next iteration.

The ADK Dev UI (launched with adk web) is the primary debugging tool for today’s session. It shows the full agent tree, per-node input and output, and execution timing without any configuration.

For production deployments, Langfuse (https://langfuse.com) provides open-source LLM observability with a self-hostable backend. ADK emits OpenTelemetry traces natively, so connecting it to Langfuse requires only setting two environment variables; no code changes to your agents are needed:

OTEL_EXPORTER_OTLP_ENDPOINT=https://your-langfuse-instance/api/public/otel
OTEL_EXPORTER_OTLP_HEADERS=Authorization=Basic <base64-encoded-key>

Systematic variation and testing of your pipeline (different constitutions, different models, different iteration limits) is also where the Examples component from Mao et al. (2025) becomes practically relevant: running the pipeline across a set of test cases and inspecting the traces is how you discover whether your prompt spec produces consistent outputs. Tools like Langfuse make that process tractable by preserving the full trace of each run.

ADK’s Session object carries a state dict that persists across agent turns. The pipeline can accumulate previously approved prompts there, giving the editor access to examples of what has worked before — a lightweight form of episodic memory with no vector store or embeddings required.

# Add this inside loop_or_stop when the judge approved:
ctx.state.setdefault("approved_prompts", []).append(ctx.state.get("draft_prompt", ""))

# Inject history into the editor's instruction:
editor = Agent(
    name="Editor",
    model="ollama_chat/gemma3:latest",
    instruction="""Apply the proposed changes to the current prompt.

Current prompt: {draft_prompt}
Proposed changes: {feedback}
Previously approved prompts for reference: {approved_prompts}

Return only the revised system prompt.""",
)
Note📝 Task

Extend your pipeline to accumulate approved prompts in session.state and pass them to the editor. Run the pipeline on three different input specs in sequence and observe whether the editor’s revisions improve.

Further readings

  • Mao et al. (2025): empirical analysis of 2,163 real-world system prompt templates, identifying seven components and their frequencies
  • Huizenga & Yang (2025): the official ADK announcement with architecture overview
  • Anthropic Engineering (2025): Anthropic’s account of building a production multi-agent research system
  • Wang et al. (2024): the Mixture-of-Agents paper, showing that layered multi-agent pipelines outperform single large models on reasoning benchmarks
  • Bai et al. (2022): the Constitutional AI paper introducing the principle of explicit enumerable rules for LLM evaluation
  • Zheng et al. (2023): the foundational LLM-as-a-judge paper with MT-Bench and Chatbot Arena
  • Masterman et al. (2024): a survey of emerging agent architectures with a taxonomy of scaffolding roles
  • Krippendorff (2004): the standard reference for content analysis methodology and inter-rater reliability

References

Anthropic Engineering. (2025). How we built our multi-agent research system. https://www.anthropic.com/engineering/multi-agent-research-system.
Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv Preprint arXiv:2212.08073. https://arxiv.org/abs/2212.08073
Chan, C.-M. et al. (2023). ChatEval: Towards better LLM-based evaluators through multi-agent debate. arXiv Preprint arXiv:2308.07201. https://arxiv.org/abs/2308.07201
Hayes, A. F., & Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1(1), 77–89. https://doi.org/10.1080/19312450709336866
Huizenga, E., & Yang, B. (2025). Agent development kit: Making it easy to build multi-agent applications. https://developers.googleblog.com/en/agent-development-kit-easy-to-build-multi-agent-applications/; Google.
Krippendorff, K. (2004). Content analysis: An introduction to its methodology. Sage.
Mao, Y., He, J., & Chen, C. (2025). From prompts to templates: A systematic prompt template analysis for real-world llmapps. https://doi.org/10.48550/arXiv.2504.02052
Masterman, T. et al. (2024). The landscape of emerging AI agent architectures for reasoning, planning, and tool calling. arXiv Preprint arXiv:2404.11584. https://arxiv.org/abs/2404.11584
Ong, I. et al. (2024). RouteLLM: Learning to route LLMs with preference data. arXiv Preprint arXiv:2406.18665. https://arxiv.org/abs/2406.18665
Wang, J. et al. (2024). Mixture-of-agents enhances large language model capabilities. arXiv Preprint arXiv:2406.04692. https://arxiv.org/abs/2406.04692
Who judges the judge? LLM jury-on-demand. (2025). arXiv Preprint arXiv:2512.01786. https://arxiv.org/abs/2512.01786
Zheng, L. et al. (2023). Judging LLM-as-a-judge with MT-bench and chatbot arena. NeurIPS 2023 Datasets and Benchmarks.