flowchart LR
A[Data desription] --> B(Hypothesis Generator)
B --> C[Low-level questions]
C --> D(Query Agent)
D --> E["`Intermediate answers
(tables, stats, text)`"]
E --> F(Summarizer)
F --> G[Insight]
LLM-based data analysis
Introduction & Motivation
In this session we focus on LLM-based data analysis: using a Large Language Model as a core component in a data analysis pipeline.
By now you have seen:
- basic LLM usage and prompting,
- chatbots and function calling,
- embedding-based retrieval and RAG,
- agent systems and LLM pipelines,
- LLM-as-a-judge style evaluation.
Today we zoom in on the “data analysis” capability that your final project should support:
A system that can talk to a dataset and decide which analysis steps to perform, then execute them and turn the results into human-readable insights.
We will:
- clarify key concepts (hypotheses, questions, insights, claims),
- design and implement a simple LLM-driven data analysis pipeline with:
- a hypothesis generator,
- a query agent (natural language → pandas),
- a summarizer,
- an optional verifier (LLM as judge),
- and connect this pipeline explicitly to your own project.
Throughout, we use a running example from e-commerce analytics.
Key Concepts
In LLM-based data analysis we need clear language to talk about what the system is doing. We will use the following terms (adapted from (Pérez et al., 2025) and (Bai et al., 2025)):
Hypothesis
A hypothesis is a testable, domain-specific assumption about the data.
- It usually relates multiple variables and suggests a pattern.
- It should be falsifiable – there must be a way to check whether it is wrong.
Example (e-commerce):
“Mobile users have a higher return rate than desktop users.”
This is a prediction about the relationship between device_type and is_return.
Question
A question is a concrete, answerable request about the dataset.
- It is closer to an actual query.
- It often mentions specific columns or aggregations.
Example:
“What is the average return rate by device_type?”
Questions are typically what the query agent will turn into pandas or SQL code.
Hypotheses vs. questions
- Hypotheses are high-level, interpretive.
- Questions are low-level, operational steps that help test or explore hypotheses.
Insight
An insight is a short textual finding derived from the data.
- It should be concise (e.g. no more than 2–3 sentences).
- It often synthesizes multiple low-level questions / analyses.
Example:
“Mobile users show a higher return rate (18%) than desktop users (12%), especially in the last quarter.
This suggests that mobile purchasing behaviour is more exploratory or less informed.”
Insights are what you ultimately want to show to users or stakeholders.
Claim
An insight consists of one or more claims.
A claim is an atomic factual statement that can be true or false.
Example, from the insight above:
- “Mobile users have an 18% return rate.”
- “Desktop users have a 12% return rate.”
- “The difference is more pronounced in the last quarter.”
Each of these can, in principle, be checked against the data.
Structuring insights as lists of claims is extremely helpful for evaluation and verification.
Correctness of an insight
Given that an insight is a set of claims, we can define its correctness as:
- the fraction of claims that are factually correct with respect to the data.
If an insight has 4 claims and 3 are true, correctness = 3/4.
In practice we often approximate this using:
- direct checks on numeric aggregates, and/or
- an LLM acting as a judge (given evidence).
Insightfulness of an insight
Insightfulness is more subjective.
Intuitively, an insight is more insightful if it is:
- relevant to the user’s goals,
- novel (not obvious),
- actionable (it could change a decision),
- or provides a meaningful explanation.
Insightfulness depends on:
- the user (a beginner vs. expert),
- the domain,
- and the context of use.
These metrics cannot be explicitly calculated given only the dataset as input and must be estimated by the user (Pérez et al., 2025).
Completeness of an answer
Completeness asks:
“Does this answer cover the important aspects of the question?”
For example, if the question is:
“How do return rates differ by device type and country?”
then an answer that only discusses device type but ignores country is incomplete.
Completeness becomes especially important when:
- multiple sub-questions are combined,
- or when the system is supposed to give a high-level summary.
Why these concepts matter for LLM-based data analysis
When you plug an LLM into a data analysis pipeline, it will happily produce text. Without structure, however, it is hard to tell:
- whether the text is factually correct,
- whether it is actually insightful,
- whether it is complete enough to support decisions.
By explicitly using hypotheses, questions, insights, claims, correctness, insightfulness and completeness:
- you can design better pipelines,
- you can build simple metrics for quality,
- and you can integrate LLM-as-a-judge components more systematically.
We will now see how these ideas shape a concrete pipeline architecture.
Pipeline Architecture
We will use a 4-component architecture:
- Hypothesis generator
- Query agent
- Summarizer
- Optional verifier
Data flow (intuitively):
- The hypothesis generator takes a high-level question or a description of the dataset and proposes high-level hypotheses and low-level questions.
- The query agent turns each low-level question into code (e.g. pandas), executes it safely and returns concise answers.
- The summarizer aggregates low-level answers into human-readable insights and explicit claims.
- The optional verifier (not shown) checks these claims for correctness and flags issues.
Running Example (Use Case)
We use a realistic e-commerce analytics scenario as a running example.
Dataset description
Imagine a transactional dataset of an online shop. Each row is a line item in an order.
Example columns:
order_id– unique order identifiercustomer_id– unique customer identifierorder_date– timestamp of the ordercountry– country of the customerdevice_type– “mobile” or “desktop”channel– “web”, “app”, “marketplace”, …product_category– category of the itemquantity– number of unitsunit_price– price per unitrevenue– computed revenue (quantity * unit_price)is_return– boolean, whether the item was returnedis_new_customer– boolean, first-time customer or not
We assume tens of thousands of rows.
Example high-level questions
- “Which customer segments contribute most to total revenue?”
- “How do return rates differ by country and device type?”
- “Are new customers more likely to return items than returning customers?”
- “Did revenue patterns change over the last 6 months?”
Example low-level questions
For the high-level question:
“Which customer segments contribute most to total revenue?”
low-level questions might include:
- Q1: “What is the total revenue per country?”
- Q2: “What is the total revenue per device_type?”
- Q3: “What is the average revenue per order for new vs. returning customers?”
- Q4: “How many unique customers are active in each country?”
For the hypothesis:
“Mobile users have a higher return rate than desktop users.”
we might ask:
- Q5: “What is the overall return rate per device_type?”
- Q6: “How has the return rate for mobile vs. desktop changed over time?”
We will use this fictitious dataset and these types of questions in the code examples.
Hypothesis Generator
We introduced a hypothesis generator in the last session. Today, we will expand it to additionally break a hypothesis down into smaller questions. The hypothesis generator is responsible for:
- turning a high-level task or dataset description into
- high-level, domain-relevant hypotheses and/or
- concrete low-level questions that the query agent can answer.
Concept
Input:
- a short description of the dataset (columns, domain),
- optionally a high-level question or user goal.
Output:
- a list of high-level hypotheses, and,
- for each hypothesis, a list of low-level questions.
The design of prompts is important:
- If you keep the dataset description too detailed, the model might stay “shallow”.
- If you keep it too vague, questions may not be answerable from the data.
Example prompts
Build a Hypothesis Generator. You already build the high-level generator last time. Expand it to include the low-level breakdown.
- Find an interesting dataset (e.g. fom the world bank) and load it. You can, of course, also use the fictitious e-commerce dataset from the example, but it is more fun with real data.
- Implement a function generate_low_level_questions similar to the example above.
- Test it with at least two different high-level questions, for example:
- “Which customer segments drive the most revenue?”
- “Do new customers behave differently from returning customers?”
- Inspect the generated questions:
- Are they answerable with the available columns?
- Do they match the high-level question?
Goal: Practice the distinction between high-level and low-level questions and build a first pipeline component.
Query Agent
The query agent is responsible for:
- turning low-level natural language questions into pandas code,
- executing this code safely,
- returning short, interpretable answers.
Concept
For each low-level question:
- Generate pandas code using an LLM.
- Execute the code in a restricted environment.
- Extract a small, human-readable answer (e.g.
result.head()).
Step 1: Natural language → pandas code
Typical pattern:
User question:
"What is the return rate per device_type?"
Expected code:
result = (
df.groupby("device_type")["is_return"]
.mean()
.sort_values(ascending=False)
)We want the LLM to:
- use the correct column names,
- use simple, safe
pandasoperations, - assign the final result to a variable
result.
Step 2: Safe execution
The dangerous version would be:
exec(generated_code)This allows arbitrary code execution. We need safeguards:
- Restrict the globals and locals dictionaries passed to
exec. - Forbid imports and access to sensitive modules like
os,sys,subprocess. - Optionally inspect the generated code before execution.
Minimal example idea:
SAFE_GLOBALS = {
"__builtins__": {}, # or a small subset
"pd": pd,
}
SAFE_LOCALS = {
"df": df,
}
exec(code, SAFE_GLOBALS, SAFE_LOCALS)
result = SAFE_LOCALS["result"]This is still not bullet-proof, but much safer than unconstrained exec.
Alternative safeguards
- Whitelist allowed functions and methods (e.g.
df.groupby,mean,sum,count,value_counts). - Blacklist certain patterns (
import,__,open(,os.). - Let an LLM-as-a-judge check whether the code is safe before execution.
- Use sandboxing (separate process / container (docker env)) with resource limits.
Each approach trades off safety, flexibility and implementation complexity. The safest option, of course, is to ask an expert. (On the other hand, the expert will almost certainly just say “NO”.)
Example
- Load your dataset into a DataFrame
df. Here, you will want an actual dataset. - Implement functions:
generate_pandas_code(question, columns)safe_execute(code, df)run_query_agent(question, df)
- Test your query agent on at least three questions, for example:
- “What is the total revenue per country?”
- “What is the return rate per device_type?”
- “What is the average order revenue for new vs. returning customers?”
- Inspect:
- Does the generated code execute without errors?
- Are the results correct and interpretable?
- Did your safety checks catch any problematic code?
Goal: See how LLM-driven code generation and controlled execution can work together.
Summarizer
The summarizer turns multiple low-level answers into:
- more coherent, high-level insights and
- explicitly structured claims.
Input:
- a set of (low-level question, answer) pairs.
Output:
- 1–3 short insight texts,
- for each insight, a list of claims.
Why structure matters
Without structure:
- it is hard to evaluate correctness and completeness,
- we cannot easily let another component (verifier) check each statement.
With explicit claims:
- we can later run a verification step claim-by-claim,
- we can trace each claim back to the questions and data that support it.
- Use your query agent (or hand-written pandas code) to generate answers to at least 2–3 low-level questions.
- Build a dictionary
answers = {"Q1": "...", "Q2": "...", ...}. - Implement
summarize_insights(answers). - Inspect the output:
- Are the insights short and understandable?
- Are the claims atomic and clearly grounded in the answers?
- (Optional) Manually label each claim as correct/incorrect/uncertain based on the data.
Goal: Practice turning fragmented analysis results into structured insights and claims.
Verifier
Even with structured claims, we often lack a clear ground truth. We can still ask an LLM to act as a verifier or judge.
Concept
The verifier receives:
- the list of claims,
- the low-level answers or relevant tables/statistics as evidence.
It outputs, for each claim:
- a label: “correct”, “incorrect”, or “uncertain”,
- a short explanation.
This is a form of LLM-as-a-judge applied to data analysis.
- Take 3–4 claims generated by your summarizer.
- Collect the corresponding low-level answers as evidence.
- Design a prompt (or implement a function) similar to verify_claims above.
- Run the verifier and inspect the labels and explanations.
- Reflect:
- Does the verifier correctly identify clearly true/false claims?
- When does it answer uncertain?
- Where could this be integrated into your project (e.g. as a filter, a warning, or part of an evaluation pipeline)?
Goal: Explore the idea of using an LLM as a judge for data analysis results.
Reflection & Project Transfer
In your final projects you can integrate a non-trivial data analysis as a core component. The pipeline in this session is one candidate building block.
Use the following questions to reflect and sketch how LLM-based data analysis can fit your project.
In your project teams:
- Data & scope
- Which tables or datasets in your project are suitable for LLM-based hypothesis generation and exploration?
- Are they clean and structured enough, or do you need a preprocessing step first?
- Role of the LLM
- Which parts of the analysis do you want the LLM to handle (hypotheses, questions, code generation, interpretation)?
- Which parts should remain classical, hand-written data analysis code?
- Error impact & responsibility
- What happens if your query agent generates wrong code or your summarizer produces incorrect insights?
- How critical would these mistakes be for your intended users?
- Interpretability & transparency
- How will you expose underlying questions, code and data to users so they can understand where an insight comes from?
- Will you show claims, confidence scores, or links to data?
- Safeguards
- Which safeguards will you implement (whitelists, sandboxing, LLM-based verification, human review loops)?
- At which points in the pipeline (code generation, execution, summarization) are safeguards most important?
Summary
In this session you:
- learned a vocabulary for LLM-based data analysis:
- hypotheses, questions, insights, claims,
- correctness, insightfulness, completeness,
- designed and implemented a simple pipeline:
- Hypothesis generator → Query agent → Summarizer → (Verifier),
- explored safeguards for executing LLM-generated code, and
- reflected on how to integrate such pipelines into your final project.
In your project work, you can now:
- embed a “talk to your dataset” feature,
- evaluate it using LLM-as-a-judge components,
- and reason more systematically about the quality and safety of LLM-based data analysis.