LLM pipelines

Pipelines vs. agents

As already mentioned, an agent is defined by its ability to make decisions. This means the agent decides if and what tool to use or not to use. One potential problem with this is that, a lot of the time, the LLM may decide not to use any tool. This is especially true for retrieval augmented generation (RAG) tasks, where looking up the answer in a document store is not really optional, but the whole reason to have the system in the first place.

At other times, the workflow may just not be complex enough to warrant the additional complexity of using agents in your application. When you find yourself having a workflow where you basically iterate through a number of steps in always the same order, you may skip the agent altogether and just hardcode a series of LLM calls instead. This is called an LLM pipeline.

We will discuss an example pipeline and the difference to agent systems in detail below. But before we get to this, we want to introduce a concept, which is widely used in a lot of LLM-based systems.

LLM as a judge

The basic idea is to use a an LLM to criticize (to judge) a text snippet. To understand why this would be useful, please do the following task:

📝 Task

Open a notebook and connect it with a local LLM using LM Studio (or other).
Ask it to generate a story containing fried eggs on a sunrise.
How good is the story? Do you like it? What is good and what is not so good? How did you come to this conclusion?
Now generate 100 such stories. Dial up the temperature so you do not get the same story every time.
Repeat step 3 for all of the 100 stories…

… or don’t.

Based on our experience, we can formulate a problem description.

We generate text (be it natural language or structured output) using LLMs.
The generated text is not always correct or appropriate for our use case.
We need a way to evaluate the quality of the generated text.
To do this, we have to read it.
We don’t have time for this.

The solution to this problem is, of course, to use an LLM to read and evaluate the text. This is only fair and proper, since it was an LLM that generated the text in the first place. The generated evaluation can then be used

to decide whether to accept or reject the generated text. This is the most basic form of judgement.
to improve the model itself e.g., for fine-tuning it on the generated text and its evaluation. This was used in fine tuning the “Claude” model family by anthropic (Constitutional AI, n.d.)
to get an LLM to improve the text based on the evaluation. This is used in a variety of frameworks, e.g. textgrad (Yuksekgonul et al., 2024).

This approach is called LLM as a judge. It is an example of a system that uses several calls to one or several LLMs to solve a problem.

This approach has a number of benefits as well as drawbacks.

Benefits:
- The evaluation can be very accurate and fast.
- It is easy to implement.
- It is easy to scale up the number of LLMs used for evaluation.
- It is easy to use different LLMs for generation and evaluation.
- It is easy to use different prompts for generation and evaluation.
Drawbacks:
- The evaluation can be very expensive, since it requires several calls to the LLM.
- The evaluation can be biased, especially when it is evaluating its own text. Indeed many LLMs tend to like their own creations.
- The evaluation can be subjective, since it is based on the LLMs’ interpretation of the prompt.
- The evaluation can be misleading, since it is based on the LLMs’ interpretation of the generated text, which may not be the same as the human interpretation. For example, many LLMs seem to prefer long answers over shorter ones.

A basic pipeline

Let us now look at a simple example of such a system. We will use the following scenario: We want to generate Anki flashcards from text.¹ To do this, we will build a system that consists of a series of specialized LLM calls (we will call them agents for now, even though this may not be completely correct). Later, we will expand it to a full agentic system to illustrate the differences between the two concepts. The Agents we will use here are:

¹ The following is loosely based on “Building a Multi-Agent Framework from Scratch with LlamaIndex” (2024), though I took the liberty to streamline and simplify it a bit.

An Anki card generator that generates Anki flashcards from a text.
A Reviewer, that reviews the generated Anki flashcards and gives tips on how to improve them.
An Editor, that generates a new set of Anki flashcards based on the reviewer’s feedback.

Optionally, when building an agent system, not a pipeline, there is also

An Orchestrator, that serves as the decision maker, managing the other agents and deciding when to stop.

We could also add more specialized agents, like a fact checker agent, that checks the generated cards for factual correctness, a translator that translates either the input text or the generated cards, or a topic analyzer that breaks down down complex topics into manageable parts before card generation.

Generator

Let us first implement the Anki card generator. It will take text as input and return a card. A system prompt for the generator could look like this:

You are an educational content creator specializing in Anki flashcard generation.
Your task is to create one clear, concise flashcards following these guidelines:

1. The card should focus on ONE specific concept
2. The question should be clear and unambiguous
3. The answer should be concise but complete
4. Include relevant extra information in the extra field
5. Follow the minimum information principle

Format the card as:
<card>
    <question>Your question here</question>
    <answer>Your answer here</answer>
    <extra>Additional context, examples, or explanations</extra>
</card>

Instead of the complicated formatting above, you can simply use a pydantic class, of course.

📝 Task

Your turn!

In your notebook, implement a card generator. You can use a simple wrapper around an LLM call or use one of the agent frameworks to implement it as an agent with no tools.
Discuss: is it still an agent, if it does not have tools? Ask an LLM about its opinion on that 😉.
Let it generate cards from the text below (or any other text of your choice).

LLM-as-a-Judge is an evaluation method to assess the quality of text outputs from any LLM-powered product, including chatbots, Q&A systems, or agents. It uses a large language model (LLM) with an evaluation prompt to rate generated text based on criteria you define.
Evaluate the results.

Reviewer

Let us now implement the reviewer. It will take a card as input and return feedback on how to improve it. A system prompt for the reviewer could look like this:

You are an expert in educational content creation, specializing in Anki flashcard generation.
You are the Reviewer agent. Your task is to review an Anki flashcard based on the following rules:

1. The card should test ONE piece of information
2. The question must be:
    - Simple and direct
    - Testing a single fact
    - Using cloze format (cloze deletion or occlusion) when appropriate
3. The answers must be:
    - Brief and precise
    - Limited to essential information
4. The extra field must include:
    - Detailed explanations
    - Examples
    - Context
5. Information should not be repeated, i.e. the extra information should not repeat the answer. 

Please give brief and concise feedback to the card you received in natural language.

📝 Task

Let’s build us a very judgemental robot!

In the same notebook, initialize a reviewer as well.
Let the reviewer review the cards generated by the generator. You may find that the reviewer always thinks the cards are great. This happens a lot. So:
Get the reviewer to actually find stuff to improve.

Editor

Let us now implement the Editor. It will take a card and feedback as input and return a new card based on the feedback. A system prompt for the editor could look like this:

You are an expert in educational content creation, specializing in Anki flashcard generation.
You are the Editor agent. Your task is to generate a new Anki flashcard based on the original card and the feedback you received from the Reviewer.
Follow these guidelines:

1. Incorporate the feedback into your new card
2. The new card should still focus on ONE specific concept
3. The question should be clear and unambiguous 
4. The answer should be concise but complete 
5. Include relevant extra information in the extra field 
6. Follow the minimum information principle
7. If no feedback is provided, return the original card
8. Format the card as:

<card>
    <question>Your question here</question>
    <answer>Your answer here</answer>
    <extra>Additional context, examples, or explanations</extra>
</card>

📝 Task

Time to improve!

In the same notebook, initialize the editor as well.
Let the editor generate new cards based on the feedback from the reviewer.
Get the editor to actually generate something that is different from the generators version! (Play around with models, prompts and/or input text. In this example, this only worked for me when using a weaker model as a generator and a larger one as reviewer and editor.)

Full pipeline

We now basically have a working LLM pipeline.

A good next step would be to add a loop around the review process, so that it only stops if the reviewer is happy. In pseudocode, this could look like this:

# Pseudocode
initialize:
    Generator
    Reviewer
    Editor

card = Generator.generate(input text)

# now the loop
while n < max_iter:  # define maximum number of review steps
    n += 1
    review = Reviewer.review(card)
    if review is positive:  # this has to be defined
        stop
    else:
        card = Editor.edit(card, review)
return card

For this to work, you need to find a way to determine if the reviewer feedback is positive enough to stop and return the card. You could, for instance, prompt the reviewer to return a certain phrase when satisfied and than parse the response for that. Or you could make a sentiment analysis, or you could make an LLM call!

📝 Task

Build it!

Think of a way to stop the review process once the card is good enough.
Implement the loop.

This is the minimum goal for today, so when you are done, you can upload your notebook to moodle. Alternatively, go on and implement the agent (see below) as well!

Orchestrator agent

While we’re at it, we can implement the orchestrator as well. Let us think for a moment what the orchestrators job should be. Its task should be decision making. That is, it’s the orchestrators job to decide which of the other agents to call next. It is also responsible for deciding whether the job is finished or not, i.e. whether to call any more agents. In terms of input and output, the orchestrator should get a current state of affairs along with the current chat history and output a decision. So the output can be one of the other agents or a stop signal.

An example prompt for our case would be:

You are the Orchestrator agent. Your task is to coordinate the interaction between all agents to create high-quality flashcards.

Available agents:
* Generator - Creates flashcards
* Reviewer - Improves card quality
* Editor

Decision Guidelines:
- Use Generator to create cards
- Use Reviewer to generate feedback
- Use Editor to improve cards based on feedback.
- Choose END when the cards are ready

Output only the next agent to run ("Generator", "Reviewer", "Editor", or "END")

Agent workflow

Now, all we have to do is integrate our agents into a pipeline. The basic idea is to call the orchestrator at each step and let it decide which agent to call next or wether to stop. For this, the agent will need an understanding of the basic state of affairs and the past interaction. This is easily implemented like this:

state = {
    "input_text": text,
    "qa_card": "",
    "review_status": "pending",
    "edit_status": "pending"
    }

memory = ChatMemoryBuffer.from_defaults(token_limit=8000) # using LLamaindex here

The memory can be read using the memory.get() method.

Then we define our workflow as an iterative process. Below is a piece of pseudocode illustrating the basic idea:

# pseudocode
initialize:
    generator
    reviewer
    editor
    orchestrator
    state
    memory

while true
    send state and memory to orchestrator -> response
    if response == "end" 
        stop
    if response == "generator" 
        send input text to generator -> card, change state and memory
    (same for the other agents)
return state

📝 Task

Time to play!

In the same notebook, initialize the orchestrator as well.
Implement the workflow shown above in real code.
Watch happily as it all works without any issues whatsoever.
Upload to Moodle.

What we did not cover but what would be a great idea:

Right now, we just assume that generator and editor return valid output. It would be better to build an automated check using a pydantic class for that.
We let the orchestrator agent decide for how long this process lasts. <sarcasm>I cannot imagine that leading to problems under any circumstances.</sarcasm> It would be better to give it a timeout or maximal number of iterations.

Project work

Time to work on your projects!

📝 Task

Discuss a topic for your project.
Set up the repository
Plan the project
Start collecting data or finding a data set
Start implementing

Happy coding!

References

Building a Multi-Agent Framework from Scratch with LlamaIndex. (2024). In DEV Community. https://dev.to/yukooshima/building-a-multi-agent-framework-from-scratch-with-llamaindex-5ecn.

Constitutional AI: Harmlessness from AI Feedback. (n.d.). https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback.

Yuksekgonul, M., Bianchi, F., Boen, J., Liu, S., Huang, Z., Guestrin, C., & Zou, J. (2024). TextGrad: Automatic "Differentiation" via Text (arXiv:2406.07496). arXiv. https://doi.org/10.48550/arXiv.2406.07496

Pipelines vs. agents

LLM as a judge

A basic pipeline

Generator

Reviewer

Editor

Full pipeline

Orchestrator agent

Agent workflow

Project work

Further Readings

References