flowchart LR data["`dataset, metadata, columns, ...`"] gen(Generator) hyp[Hypothesis] data --> gen gen --> hyp
LLM pipelines
Motivation and learning objectives
In session 5 we focused on agentic decision-making and the EDA agent as a motivating example. In this session, we will:
- distinguish more clearly between agents and LLM pipelines,
- understand the concept of LLM as a judge and why evaluation is a bottleneck in generative AI systems,
- design and discuss evaluation pipelines for our EDA agent and related use cases, and
- connect these ideas to your own project work.
After this session you should be able to:
- design a simple pipeline of LLM calls and tools,
- use an LLM to evaluate model outputs along explicit criteria, and
- reflect critically on the strengths and weaknesses of LLM-based evaluation.
Pipelines: from input to output
As discussed, an agent decides dynamically which tools to use. A pipeline, in contrast, follows a pre-defined sequence of steps:
- Receive input.
- Transform it through a series of modules (some LLM calls, some classic code).
- Produce an output – optionally with internal loops, but without LLM-driven control flow.
Pipelines are attractive because they are:
- easier to test and debug,
- often more robust,
- and usually more efficient than agentic setups when the workflow is well understood.
Example: simple EDA pipeline (no agent)
A non-agentic pipeline for our EDA use case could be:
- Preprocessing: load the CSV, detect column types, basic validation.
- Analytics: compute standard statistics and sample plots using Python code.
- Report generation: call an LLM once to turn the statistics and plots into a narrative EDA report.
No tool selection is delegated to the LLM; it only writes the report. This is already powerful – but the quality of the report still varies, and we need a way to evaluate it.
LLM as a judge
The core idea of LLM as a judge is simple:
Use a (possibly different) LLM to evaluate the quality of outputs produced by an LLM-based system.
Why is this useful?
Consider a small experiment:
- Open a notebook and connect it with a local LLM using LM Studio (or other).
- Ask it to generate one story containing fried eggs on a sunrise.
- Informally evaluate the story: What works well? What doesn’t? How did you decid
- Now ask the model to generate 50–100 such stories (use a higher temperature).
- Try to evaluate all of them manually…
Reflect: How long does it take? How consistent are your own judgements?
Typical pattern:
- We generate large amounts of text, code, or structured output.
- We suspect that quality is variable.
- We need a way to assess that quality.
- Reading and judging everything manually is too slow and doesn’t scale. (We don’t have time for this.)
Instead, we can use an LLM in an evaluation role:
- It receives the output to be evaluated (e.g. an EDA report).
- It receives criteria (e.g. correctness, completeness, clarity).
- It returns scores, labels, or structured feedback.
The evaluation can then be used to:
- accept or reject outputs,
- select the best out of several candidates,
- give automated feedback for improvement,
- or serve as a signal for fine-tuning or reinforcement learning (e.g. Constitutional AI (Constitutional AI, n.d.)).
This approach is called LLM as a judge.
Benefits and drawbacks
Benefits:
- Scales to large numbers of outputs.
- Can approximate nuanced, qualitative judgements when well-prompted.
- Easy to combine with existing systems (few additional components).
- Flexible: different prompts and criteria for different tasks.
Drawbacks:
- Cost: multiple LLM calls per sample.
- Bias: models might prefer their own style or longer answers.
- Subjectivity: evaluation depends on the prompt and the judge model.
- Mismatch with humans: high scores do not always mean humans agree.
Because of these limitations, LLM-based judgements should be:
- designed carefully (clear rubrics),
- and, ideally, calibrated against human ratings.
Evaluation pipelines: general pattern
Many applications follow a similar structure:
- Generator: produces a candidate output (text, code, EDA report, answer, …).
- Judge: evaluates the output (via an LLM with evaluation prompt).
- Editor (optional): improves the output based on the feedback.
- Loop (optional): repeat 2 and 3 until a stopping criterion is met.
This pattern applies to:
- EDA reports,
- answers in Q&A systems,
- code generation,
- many more
Example: Dataset hypothesis creation pipeline
We will now modify our EDA agent/pipeline from earlier to illustrate this. We will build a pipeline that will generate a hypothesis based on superficial knowledge of the dataset.
A hypothesis is a testable prediction or proposed explanation for a phenomenon, based on limited evidence that serves as a starting point for further investigation. It typically takes the form of an “if-then” statement that can be supported or refuted through experimentation and observation. A good scientific hypothesis must be falsifiable — meaning there must be a possible way to prove it wrong through empirical evidence.
For now, we will not try to falsify (or verify) the hypothesis, just checking, if the generated output makes a good hypothesis. Our workflow looks thusly:
- Hypothesis generator
- Receives the dataset and additional metadata about the data
- Produces a hypothesis
- Judge-LLM
- Receives the hypothesis + a rubric, e.g. with criteria:
- Is this a prediction?
- Does it explain something?
- Is it falsifiable?
- Returns:
- scores (e.g. 1–5) per criterion, and
- short textual feedback for each.
- Receives the hypothesis + a rubric, e.g. with criteria:
- Optional editor step
- Another LLM call (possibly with a different model) revises the hypothesis
- Stopping condition
- For example: stop after 2–3 improvement iterations or if all scores exceed a threshold.
Generator
Let’s have a closer look at the generator:
A system prompt for the generator could be:
You are a research asssistant tasked with generating a first hypothesis on a given dataset.
You will receive a brief summary of the dataset in question. this might include metadata, example rows, a list of columns etc.
Your Task is to generate a hypothesis based on this superficial knowledge, that will be testet at a later stage.
When calling the generator, you should give the relevant information in the user prompt:
Metadata:
{dataset_metadata}
columns:
{dataset.columns}
example rows:
{dataset.head()}
Your turn!
In your notebook, implement a hypothesis generator.
- Get a dataset. You can use the the titanic dataset from kaggle again or something else.
- Compile some relevant information e.g.
df.head(),df.describe()etc. along other information. The titanic dataset has a very good description at the website. - Send a good system prompt and the relevant information to an LLM.
- Let it generate a hypothesis (or several). You may want to increase the model’s temperature to get some variation.
- Evaluate the results.
Reviewer (LLM as a judge)
flowchart LR hyp[Hypothesis] rev(Reviewer) feedback[Feedback] hyp --> rev rev --> feedback
The reviewer is a judge-LLM with a rubric for a good hypothesis. Here is an example:
You are a research assistant. Your task is to review a hypothesis based on these rules:
1. Is this a prediction?
2. Does it explain something?
3. Is it falsifiable?
4. Is it specific?
Provide concise feedback on how to improve the hypothesis.
You may want to adapt the rubric to get the desired result. You can use one reviewer as shown here, or do several reviewers each with a specific criterion of the rubric.
Let’s build us a very judgemental robot!
- In the same notebook, initialize a reviewer as well.
- Let the reviewer review the hypothesis generated by the generator.
- Adjust the rubric until the reviewer finds stuff to improve.
Editor
flowchart LR data["`dataset, metadata, columns, ...`"] hyp[Hypothesis] feedback[Feedback] editor(Editor) new_hyp[Hypothesis] data --> editor hyp --> editor feedback --> editor editor --> new_hyp
The editor uses the original input + hypothesis + feedback to produce a revised hypothesis. Its prompt encodes this behaviour explicitly (incorporate feedback, keep focus, improve wording, etc.).
Time to improve!
- In the same notebook, implement the editor as well. Make a new LLM call/prompt or adapt your original generator.
- Let the editor generate a new hypothesis or improve an existing one based on the feedback from the reviewer.
- Get the editor to actually generate something that is different from the generators version!
Full pipeline
We now basically have a working LLM-based pipeline.
flowchart LR data["`dataset, metadata, columns, ...`"] gen(Generator) hyp[Hypothesis] rev(Reviewer) feedback[Feedback] editor(Editor) new_hyp[Hypothesis] data --> gen gen --> hyp hyp --> rev rev --> feedback data --> editor hyp --> editor feedback --> editor editor --> new_hyp
A good next step would be to add a loop around the review process, so that it only stops if the reviewer is happy. This could look like this:
hypothesis = generator.generate(dataset_description)
for _ in range(max_iter):
review = reviewer.review(hypothesis)
if review_is_positive_enough(review):
break
hypothesis = editor.edit(dataset_description, hypothesis, review)
return hypothesisflowchart LR
data["`dataset,
metadata,
columns,
...`"]
gen(Generator)
hyp[Hypothesis]
rev(Reviewer)
feedback[Feedback]
editor(Editor)
if{good enough?}
stop((stop))
data --> gen
gen --> hyp
hyp --> rev
rev --> feedback
data --> editor
hyp --> editor
feedback --> if
if -- no --> editor
if -- yes --> stop
editor --> hyp
The missing piece is review_is_positive_enough, where you might:
- search for a specific “OK” marker in the review,
- ask the reviewer to output a score and threshold it,
- or run an additional small LLM classification step.
Alternatively, you can just generate a lot of hypotheses and choose the best one (or the best n).
Exactly the same pattern can be transferred to your own project outputs.
Build it!
- Think of a way to stop the review process once the hypothesis is good enough.
- Implement the loop.
This is the minimum goal for today, so when you are done, you can upload your notebook to moodle. Alternatively, go on and implement the agent (see below) as well!
From pipeline to orchestrated agents
The example above is still essentially a pipeline: the order of steps is fixed (Generator → Reviewer → Editor …). We could, however, introduce an orchestrator agent that decides which component to call next1:
1 Note that this is, of course, only one way to implement this as an agent system.
You are tasked with creating high-quality scientific hypothesis for a given dataset. Use the tools at your disposal to ensure good quality.
Available tools:
– Generator – creates hypotheses
– Reviewer – evaluates hypothesis quality
– Editor – improves hypothesis based on feedback
Decision guidelines:
– Use Generator to create hypothesis.
– Use Reviewer to generate feedback.
– Use Editor to improve hypothesis.
– Choose END when the hypothesis is good enough.
Output only the next agent to run ("Generator", "Reviewer", "Editor", or "END").
The agent is now just a loop around this orchestrator:
# pseudocode
initialize:
tools
history
while true:
response = orchestrator(history)
if response == "END"
break
if response == "generator"
history += generator.generate()
# same for other tools
This is essentially the agentic version of the same pipeline. The pros and cons are analogous to those discussed in session Agents and Pipelines:
- more flexibility,
- but also more complexity and less predictability.
For your projects, you should decide consciously whether you need this flexibility – or whether a simple pipeline with a clear evaluation step is sufficient.
- In the same notebook, initialize the orchestrator agent as well.
- Implement the workflow shown above in real code.
- Watch happily as it all works without any issues whatsoever.
- Upload to Moodle.
Project work
Time to work on your projects!
- Discuss your project.
- Set up the repository
- Plan the project
- Start collecting data or finding a data set
- Start implementing
Happy coding!
Further Readings
- the Tiny Agents blogpost really helps understanding agents.
- Here is a video describing other multi-agent systems, including an agent hospital and a multi-agent translator
