Prompting

Prompting describes the utilization of the ability of language models to use zero or few-shot instrutions to perform a task. This ability, which we briefly touched on when we were discussing the history of language models (i.e., the paper by Radford et al. (2019)), is one of the most important aspects of modern large language models.

Prompting can be used for various tasks such as text generation, summarization, question answering, and many more.

Instruct-tuned models

Instruct-tuned models are trained on a dataset (for an example, see Figure 4.1) that consists of instructions and their corresponding outputs, seperated by special tokens. This is different from the pretraining phase of language models where they are trained on large amounts of text data without any specific task in mind. The goal of instruct-tuning is to make the model better at following instructions and generating more accurate and relevant outputs.

Fig 4.1: An example for a dataset that can be used for instruct-finetuning. This dataset can be found on huggingface

These finetuning-datasets are formatted into a specific structure, usually in the form of a chat template. As you can see in the following quite simple example from the SmolLM2-Huggingface-repo, the messages are separated by special tokens and divided into system-message and messages indicated by the relevant role:

{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
You are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>
' }}{% endif %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}

There are usually three types of messages used in the context of instruction tuning and the usage of instruct models:

  1. System prompts which tell the model its general role and behavior.
  2. User prompts, which contain the actual instructions or questions for the model to respond to.
  3. Assistant prompts, which are the responses generated by the model based on the user’s input, or, in terms of the training phase, the answers that the model should learn to generate.
Note📝 Task

Test the difference between instruct and non-instruct-models.

Do this by trying to get a gpt2-version (i.e., “QuantFactory/gpt2-xl-GGUF”) and a small Instruct-Model (i.e., “Qwen/Qwen3-0.6B” to write a small poem about the inception of the field of language modelling.

Use LM-Studio to test this. Do also play around with the system prompt and note the effect of changing it.

(a) A poem written by Qwen3 0.6B - a model with Instruct-Finetuning
(b) A “poem” written by GPT2 - a model without Instruct-Finetuning
Fig 4.2: A poem and a “poem”

Show answer

Prompting strategies

The results of a prompted call to a LM is highly dependent on the exact wording of the prompt. This is especially true for more complex tasks, where the model needs to perform multiple steps in order to solve the task. It is not for naught that the field of “prompt engineering” has emerged. There is a veritable plethora of resources available online that discuss different strategies for prompting LMs. It has to be said though, that the strategies that work and don’t work can vary greatly between models and tasks. A bit of general advice that holds true for nearly all models though, is to

  1. define the task in as many small steps as possible
  2. to be as literal and descriptive as possible and
  3. to provide examples if possible.

Since the quality of results is so highly dependent on the chosen model, it is good practice to test candidate strategies against each other and therefore to define a target on which the quality of results can be evaluated. One example for such a target could be a benchmark dataset that contains multiple examples of the task at hand.

Note📝 Task

1. Test the above-mentioned prompting strategies on the MTOP Intent Dataset and evaluate the results against each other. The dataset contains instructions and labels indicating on which task the instruction was intended to prompt. Use a python script to call one of the following three models in LM-Studio for this:

  1. Phi 4 mini
  2. Qwen3 0.6B
  3. Llama 3.3 1B

Use the F1-score implemented in scikit learn to evaluate your results.

Since the dataset has a whole series of labels, use the following python-snippet (or your own approach) to extract only examples using the “GET_MESSAGE”-label:

import json
data = []
with open('data/de_test.jsonl', 'r') as f:
    for line in f:
        data.append(json.loads(line))

possible_labels = list(set([entry['label_text'] for entry in data]))

texts_to_classify = [
    {'example': entry['text'],
     'label': 'GET_MESSAGE' if entry['label_text'] == 'GET_MESSAGE' 
              else 'OTHER'} for entry in data
]

2. You do sometimes read very specific tips on how to improve your results. Here are three, that you can find from time to time:

  • Do promise rewards (i.e., monetary tips) instead of threatening punishments
  • Do formulate using affirmation (“Do the task”) instead of negating behaviours to be avoided (“Don’t do this mistake”)
  • Let the model reason about the problem before giving an answer

Check these strategies on whether they improve your results. If your first instruction already results in near-perfect classification, brainstorm a difficult task that you can validate qualitatively. Let the model write a recipe or describe Kiel for example.

3. Present your results

3. Upload your code to moodle

Generation of synthetic texts

As we discussed before, small models can perform on an acceptable level, if they are finetuned appropriately.

A good way to do this is to use a larger model to generate synthetic data that you then use for training the smaller model. This approach, sometimes called “distillation” (Xu et al., 2024) has been used successfully in many applications, for example for improving graph-database queries (Zhong et al., 2024), for improving dataset search (Silva & Barbosa, 2024) or the generation of spreadsheet-formulas (Singh et al., 2024).

Since even the largest LLMs are not perfect in general and might be even worse on some specific niche tasks, evidence suggests that a validation strategy for data generated in this way is beneficial (Kumar et al., 2024; Singh et al., 2024).

Strategies to validate the synthetic data include:

  • Using a human annotator to label part of the data to test the models output
  • Forcing the model to answer in a structured way that is automatically testable (e.g., by using JSON - see Tip 4.1 for an example on how to generate structured output using an API that follows the OpenAI-API-scheme.)
  • Forcing the model to return 2 or more answers and checking for consistency
  • Combining the two approaches above (i.e., forcing the model to return multiple structured outputs (JSON, XML, YAML, …) and checking for consistency)
  • Using a second LLM/different prompt to rate the answers

There are ways on forcing a language model to only generate output conforming to a specific format. We have already seen one in the examples of models being instruct-tuned to conform to a given chat-template. Another often used method is to use regular expressions to set the probabilities of the next token to be zero if it would not conform to a given (JSON-)scheme. The nicest way to use this feature in python is to define a pydantic-dataclass that defines the possible output formats and describes the expected field content to the model:

import os
from openai import OpenAI
from pydantic import BaseModel, Field

class ClassificationResult(BaseModel):
    description: str = Field(description="The description of the classification result.")
    label: str = Field(pattern=r'^(GET_MESSAGE|OTHER)$', description="The classified label of the text.")



client = OpenAI(
    api_key='lm-studio',  
    base_url="http://localhost:1234/v1"
)

test_dict = {"example": "This is a test", "example2": "This is another test"}
def classify_one_sample(sample):
    chat_completion = client.beta.chat.completions.parse(
        messages=[
            {
                "role": "user",
                "content": f"Classify the label of the following text: '{sample}' as 'GET_MESSAGE', 'OTHER'. /no_think {json_string}"
            }
        ],
        model="qwen3-0.6B",
        response_format=ClassificationResult
    )
    return chat_completion.choices[0].message.content

classify_one_sample("Gib mir die Nachricht von Peter aus!")

The descriptions are the internally added as context to the prompt and the output is limited to the given data-range. For more possibilities on setting constraints see the pydantic-docs, for a more detailed explanation on how the token limitations work, see this article by outlined.

Note📝 Task

Using your script for batch-testing different prompts, generate synthetic data for a emotion detection task based on Paul Ekman’s six basic emotions: anger, disgust, fear, happiness, sadness and surprise1.

The generated data should consist of a sentence and the emotion that is expressed in it. Start by generating two examples for each emotion. Validate these results and adapt them if necessary. Then use these examples to generate 10 samples for each emotion.

Use one of the above mentioned (non-manual) strategies to validate the data you generated.

Upload your results to Moodle.

1 Though this nomenclature has fallen a bit out of fashion

Temperature

You might have encountered eerily similar answers from the language model, especially in the last task. Talking of it - why does the model return different answers to the same prompt at all if we do use pretrained-models in the first place? Shouldn’t the utilization of the frozen weight-matrix result in the same answer, every time we run the model with the same input?

Yes, it should. And it does.

Remember that a language model trained on language generation as we discussed in the first session ends in a softmax-layer that returns probabilities for each token in the vocabulary. The generation-pipeline does not just use the token with the highest probability though, but samples from this distribution. This means, that even if the input is identical, the output will be different every time you run the model.

The temperature parameter controls the steepness of the softmax-function and thus the randomness of the sampling process. A higher temperature value results in more random outputs, while a lower temperature value results in more “deterministic” outputs. The temperature, indicated as a float between 0 and 12, is used to modulate the probabilities of the next token. This is done by adding a \(\frac{1}{Temp}\) factor to the model-outputs before applying the softmax.

2 Depending on the implementation, temperatures above 1 are also allowed. Temperatures above 1 are resulting in strange behaviours - see Figure 4.3.

This effectively changes the Sofmax-fomula from

\[ p_{Token} = \frac{e^{z_{Token}}}{\sum_{i=1}^k e^{z_{i}}} \]

to \[ p_{Token}(Temp) = \frac{e^{\frac{z_{Token}}{Temp}}}{\sum_{i=1}^k e ^{\frac{z_{i}}{Temp}}} \]

Where

  • \(z_{Token}\) is the output for a given token
  • \(k\) is the size of the vocabulary
  • \(Temp\) is the temperature parameter (0 < \(Temp\) <= 1)

The effect of this temperature can be seen in Figure 4.3.

A heatmap illustrating the effect of the temperature parameter on the softmax-output for a given input. The x-axis represents the temperature, the y-axis represents the token-position and the color represents the probability of the token.
Fig 4.3: The effect of the temperature parameter on the softmax-output for a given input. The x-axis represents the temperature, the y-axis represents the token-position and the color represents the probability of the token.

Most generation-frameworks do additionally provide a parameter called top_k or top_p. These parameters are used to limit the number of tokens that can be selected as the next token. This is done by sorting the probabilities in descending order and only considering the top k tokens or the top p percent of tokens.

Temperature is the mayor setting to control a LLMs “creativity” though.

Note📝 Task

Using the script provided for generating synthetic data, test the effect of the temperature parameter on the output of the model.

  • Use the same prompt and the same model
  • Run the model with a temperature value of 0.1, 0.5, 1.0 and 2.0

Understanding and Mitigating Hallucinations

While temperature controls the randomness of model outputs, a more fundamental challenge emerges when language models generate plausible yet incorrect information - a phenomenon known as hallucination.

What are Hallucinations?

Hallucinations occur when language models produce confident, plausible-sounding outputs that are factually incorrect or unsupported by their training data. These aren’t random errors - they’re systematic failures that arise from the statistical nature of language modeling itself.

Consider this example: When asked “What is Adam Tauman Kalai’s birthday?”, state-of-the-art models confidently produce different incorrect dates across multiple attempts, even when explicitly asked to respond only if they know the answer (Kalai et al., 2025).

Why Hallucinations Occur: A Statistical Perspective

Kalai et al. (2025) demonstrate that hallucinations emerge from the fundamental objective of language model training. They show that generating valid outputs is inherently harder than classifying output validity - a task where errors are well-understood in machine learning.

The key insight: even with perfect training data, the cross-entropy objective used in pretraining naturally leads to errors on certain types of facts. Specifically:

Arbitrary Facts: Information without learnable patterns (like birthdays of obscure individuals) will be hallucinated at rates approximately equal to the fraction of such facts appearing exactly once in training data. If 20% of birthday facts appear only once, expect ~20% hallucination rate on birthdays.

Poor Models: When model architectures cannot adequately represent certain patterns, systematic errors emerge. For example, models using only token-based representations struggle with character-level tasks like counting letters in “DEEPSEEK”.

Hallucinations as Compression Failures

Chlon et al. (2025) provide a complementary information-theoretic perspective. They show that transformers minimize expected conditional description length over input orderings rather than the permutation-invariant description length. This makes them “Bayesian in expectation, not in realization.”

Their framework introduces practical metrics for predicting hallucinations:

  • Information Sufficiency Ratio (ISR): The ratio of available information to required information for a target reliability threshold
  • Bits-to-Trust (B2T): The amount of information needed to achieve a specific confidence level

A key finding: hallucinations decrease by approximately 0.13 per additional nat of information, making the phenomenon quantitatively predictable rather than mysterious.

Why Hallucinations Persist After Training

Beyond pretraining, Kalai et al. (2025) argue that post-training and evaluation procedures actively reinforce hallucinations. Most benchmarks use binary grading (correct/incorrect) with no credit for expressing uncertainty. This creates an “epidemic” of penalizing honest uncertainty - models that guess when unsure outperform those that appropriately abstain.

Consider two models:

  • Model A: Accurately signals uncertainty, never hallucinates
  • Model B: Same as A but guesses instead of expressing uncertainty

Model B will outperform A on most current benchmarks, despite being less trustworthy.

Note📝 Task

Test hallucination behavior on a small model using LM-Studio:

  1. Use a small model (Qwen3-0.6B or similar) to answer 10 factual questions about rare entities
  2. For each question, generate 3 responses with the same temperature
  3. Document:
    • How often does the model give confident but incorrect answers?
    • How often does it appropriately express uncertainty?
    • How does response consistency relate to likely correctness?
  4. Now modify the prompt to explicitly encourage uncertainty expression (e.g., “Only answer if you’re very confident, otherwise say you don’t know”)
  5. Compare the results

Upload your observations to Moodle.

Further Readings

References

Chlon, L., Karim, A., & Chlon, M. (2025). Predictable Compression Failures: Why Language Models Actually Hallucinate (arXiv:2509.11208). arXiv. https://doi.org/10.48550/arXiv.2509.11208
Heidloff, N. (2023). Fine-tuning small LLMs with Output from large LLMs. In Niklas Heidloff. https://heidloff.net/article/fine-tune-small-llm-with-big-llm/.
Kalai, A. T., Nachum, O., Vempala, S. S., & Zhang, E. (2025). Why Language Models Hallucinate (arXiv:2509.04664). arXiv. https://doi.org/10.48550/arXiv.2509.04664
Kumar, B., Amar, J., Yang, E., Li, N., & Jia, Y. (2024). Selective Fine-tuning on LLM-labeled Data May Reduce Reliance on Human Annotation: A Case Study Using Schedule-of-Event Table Detection (arXiv:2405.06093). arXiv. https://doi.org/10.48550/arXiv.2405.06093
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.
Silva, L., & Barbosa, L. (2024). Improving dense retrieval models with LLM augmented data for dataset search. Knowledge-Based Systems, 294, 111740. https://doi.org/10.1016/j.knosys.2024.111740
Singh, U., Cambronero, J., Gulwani, S., Kanade, A., Khatry, A., Le, V., Singh, M., & Verbruggen, G. (2024). An Empirical Study of Validating Synthetic Data for Formula Generation (arXiv:2407.10657). arXiv. https://doi.org/10.48550/arXiv.2407.10657
Xu, X., Li, M., Tao, C., Shen, T., Cheng, R., Li, J., Xu, C., Tao, D., & Zhou, T. (2024). A Survey on Knowledge Distillation of Large Language Models (arXiv:2402.13116). arXiv. https://doi.org/10.48550/arXiv.2402.13116
Zhong, Z., Zhong, L., Sun, Z., Jin, Q., Qin, Z., & Zhang, X. (2024). SyntheT2C: Generating Synthetic Data for Fine-Tuning Large Language Models on the Text2Cypher Task (arXiv:2406.10710). arXiv. https://doi.org/10.48550/arXiv.2406.10710