Prompting

Prompting describes the utilization of the ability of language models to use zero or few-shot instrutions to perform a task. This ability, which we briefly touched on when we were discussing the history of language models (i.e., the paper by Radford et al. (2019)), is one of the most important aspects of modern large language models.

Prompting can be used for various tasks such as text generation, summarization, question answering, and many more.

Instruct-tuned models

Instruct-tuned models are trained on a dataset (for an example, see Figure 4.1) that consists of instructions and their corresponding outputs, seperated by special tokens. This is different from the pretraining phase of language models where they are trained on large amounts of text data without any specific task in mind. The goal of instruct-tuning is to make the model better at following instructions and generating more accurate and relevant outputs.

Fig 4.1: An example for a dataset that can be used for instruct-finetuning. This dataset can be found on huggingface

These finetuning-datasets are formatted into a specific structure, usually in the form of a chat template. As you can see in the following quite simple example from the SmolLM2-Huggingface-repo, the messages are separated by special tokens and divided into system-message and messages indicated by the relevant role:

{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
You are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>
' }}{% endif %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}

There are usually three types of messages used in the context of instruction tuning and the usage of instruct models:

System prompts which tell the model its general role and behavior.
User prompts, which contain the actual instructions or questions for the model to respond to.
Assistant prompts, which are the responses generated by the model based on the user’s input, or, in terms of the training phase, the answers that the model should learn to generate.

📝 Task

Test the difference between instruct and non-instruct-models.

Do this by trying to get a gpt2-version (i.e., “QuantFactory/gpt2-xl-GGUF”) and a small Instruct-Model (i.e., “Qwen/Qwen3-0.6B” to write a small poem about the inception of the field of language modelling.

Use LM-Studio to test this. Do also play around with the system prompt and note the effect of changing it.

(a) A poem written by Qwen3 0.6B - a model with Instruct-Finetuning

Show answer

Prompting strategies

The results of a prompted call to a LM is highly dependent on the exact wording of the prompt. This is especially true for more complex tasks, where the model needs to perform multiple steps in order to solve the task. It is not for naught that the field of “prompt engineering” has emerged. There is a veritable plethora of resources available online that discuss different strategies for prompting LMs. It has to be said though, that the strategies that work and don’t work can vary greatly between models and tasks. A bit of general advice that holds true for nearly all models though, is to

define the task in as many small steps as possible
to be as literal and descriptive as possible and
to provide examples if possible.

Since the quality of results is so highly dependent on the chosen model, it is good practice to test candidate strategies against each other and therefore to define a target on which the quality of results can be evaluated. One example for such a target could be a benchmark dataset that contains multiple examples of the task at hand.

📝 Task

1. Test the above-mentioned prompting strategies on the MTOP Intent Dataset and evaluate the results against each other. The dataset contains instructions and labels indicating on which task the instruction was intended to prompt. Use a python script to call one of the following three models in LM-Studio for this:

Use the F1-score implemented in scikit learn to evaluate your results.

Since the dataset has a whole series of labels, use the following python-snippet (or your own approach) to extract only examples using the “GET_MESSAGE”-label:

import json
data = []
with open('data/de_test.jsonl', 'r') as f:
    for line in f:
        data.append(json.loads(line))

possible_labels = list(set([entry['label_text'] for entry in data]))

texts_to_classify = [
    {'example': entry['text'],
     'label': 'GET_MESSAGE' if entry['label_text'] == 'GET_MESSAGE' 
              else 'OTHER'} for entry in data
]

2. You do sometimes read very specific tips on how to improve your results. Here are three, that you can find from time to time:

Do promise rewards (i.e., monetary tips) instead of threatening punishments
Do formulate using affirmation (“Do the task”) instead of negating behaviours to be avoided (“Don’t do this mistake”)
Let the model reason about the problem before giving an answer

Check these strategies on whether they improve your results. If your first instruction already results in near-perfect classification, brainstorm a difficult task that you can validate qualitatively. Let the model write a recipe or describe Kiel for example.

3. Present your results

3. Upload your code to moodle

Generation of synthetic texts

As we discussed before, small models can perform on an acceptable level, if they are finetuned appropriately.

A good way to do this is to use a larger model to generate synthetic data that you then use for training the smaller model. This approach, sometimes called “distillation” (Xu et al., 2024) has been used successfully in many applications, for example for improving graph-database queries (Zhong et al., 2024), for improving dataset search (Silva & Barbosa, 2024) or the generation of spreadsheet-formulas (Singh et al., 2024).

Since even the largest LLMs are not perfect in general and might be even worse on some specific niche tasks, evidence suggests that a validation strategy for data generated in this way is beneficial (Kumar et al., 2024; Singh et al., 2024).

Strategies to validate the synthetic data include:

Using a human annotator to label part of the data to test the models output
Forcing the model to answer in a structured way that is automatically testable (e.g., by using JSON - see Tip 4.1 for an example on how to generate structured output using an API that follows the OpenAI-API-scheme.)
Forcing the model to return 2 or more answers and checking for consistency
Combining the two approaches above (i.e., forcing the model to return multiple structured outputs (JSON, XML, YAML, …) and checking for consistency)
Using a second LLM/different prompt to rate the answers

Tip 4.1: Structured Output

There are ways on forcing a language model to only generate output conforming to a specific format. We have already seen one in the examples of models being instruct-tuned to conform to a given chat-template. Another often used method is to use regular expressions to set the probabilities of the next token to be zero if it would not conform to a given (JSON-)scheme. The nicest way to use this feature in python is to define a pydantic-dataclass that defines the possible output formats and describes the expected field content to the model:

import os
from openai import OpenAI
from pydantic import BaseModel, Field

class ClassificationResult(BaseModel):
    description: str = Field(description="The description of the classification result.")
    label: str = Field(pattern=r'^(GET_MESSAGE|OTHER)$', description="The classified label of the text.")



client = OpenAI(
    api_key='lm-studio',  
    base_url="http://localhost:1234/v1"
)

test_dict = {"example": "This is a test", "example2": "This is another test"}
def classify_one_sample(sample):
    chat_completion = client.beta.chat.completions.parse(
        messages=[
            {
                "role": "user",
                "content": f"Classify the label of the following text: '{sample}' as 'GET_MESSAGE', 'OTHER'. /no_think {json_string}"
            }
        ],
        model="qwen3-0.6B",
        response_format=ClassificationResult
    )
    return chat_completion.choices[0].message.content

classify_one_sample("Gib mir die Nachricht von Peter aus!")

The descriptions are the internally added as context to the prompt and the output is limited to the given data-range. For more possibilities on setting constraints see the pydantic-docs, for a more detailed explanation on how the token limitations work, see this article by outlined.

📝 Task

Using your script for batch-testing different prompts, generate synthetic data for a emotion detection task based on Paul Ekman’s six basic emotions: anger, disgust, fear, happiness, sadness and surprise¹.

The generated data should consist of a sentence and the emotion that is expressed in it. Start by generating two examples for each emotion. Validate these results and adapt them if necessary. Then use these examples to generate 100 samples for each emotion.

Use one of the above mentioned (non-manual) strategies to validate the data you generated.

Upload your results to Moodle.

¹ Though this nomenclature has fallen a bit out of fashion

Temperature

You might have encountered eerily similar answers from the language model, especially in the last task. Talking of it - why does the model return different answers to the same prompt at all if we do use pretrained-models in the first place? Shouldn’t the utilization of the frozen weight-matrix result in the same answer, every time we run the model with the same input?

Yes, it should. And it does.

Remember that a language model trained on language generation as we discussed in the first session ends in a softmax-layer that returns probabilities for each token in the vocabulary. The generation-pipeline does not just use the token with the highest probability though, but samples from this distribution. This means, that even if the input is identical, the output will be different every time you run the model.

The temperature parameter controls the steepness of the softmax-function and thus the randomness of the sampling process. A higher temperature value results in more random outputs, while a lower temperature value results in more “deterministic” outputs. The temperatur, indicated as a float between 0 and 1², is used to modulate the probabilities of the next token. This is done by adding a \(\frac{1}{Temp}\) factor to the model-outputs before applying the softmax.

² Depending on the implementation, temperatures above 1 are also allowed. Temperatures above 1 are resultsing in strange behaviours - see Figure 4.3.

This effectively changes the Sofmax-fomula from

\[ p_{Token} = \frac{e^{z_{Token}}}{\sum_{i=1}^k e^{z_{i}}} \]

to \[ p_{Token}(Temp) = \frac{e^{\frac{z_{Token}}{Temp}}}{\sum_{i=1}^k e ^{\frac{z_{i}}{Temp}}} \]

Where

\(z_{Token}\) is the output for a given token
\(k\) is the size of the vocabulary
\(Temp\) is the temperature parameter (0 < \(Temp\) <= 1)

The effect of this temperature can be seen in Figure 4.3.

A heatmap illustrating the effect of the temperature parameter on the softmax-output for a given input. The x-axis represents the temperature, the y-axis represents the token-position and the color represents the probability of the token. — Fig 4.3: The effect of the temperature parameter on the softmax-output for a given input. The x-axis represents the temperature, the y-axis represents the token-position and the color represents the probability of the token.

Most generation-frameworks do additionally provide a parameter called top_k or top_p. These parameters are used to limit the number of tokens that can be selected as the next token. This is done by sorting the probabilities in descending order and only considering the top k tokens or the top p percent of tokens.

Temperature is the mayor setting to control a LLMs “creativity” though.

📝 Task

Using the script provided for generating synthetic data, test the effect of the temperature parameter on the output of the model.

Use the same prompt and the same model
Run the model with a temperature value of 0.1, 0.5, 1.0 and 2.0

References

Heidloff, N. (2023). Fine-tuning small LLMs with Output from large LLMs. In Niklas Heidloff. https://heidloff.net/article/fine-tune-small-llm-with-big-llm/.

Kumar, B., Amar, J., Yang, E., Li, N., & Jia, Y. (2024). Selective Fine-tuning on LLM-labeled Data May Reduce Reliance on Human Annotation: A Case Study Using Schedule-of-Event Table Detection (arXiv:2405.06093). arXiv. https://doi.org/10.48550/arXiv.2405.06093

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.

Silva, L., & Barbosa, L. (2024). Improving dense retrieval models with LLM augmented data for dataset search. Knowledge-Based Systems, 294, 111740. https://doi.org/10.1016/j.knosys.2024.111740

Singh, U., Cambronero, J., Gulwani, S., Kanade, A., Khatry, A., Le, V., Singh, M., & Verbruggen, G. (2024). An Empirical Study of Validating Synthetic Data for Formula Generation (arXiv:2407.10657). arXiv. https://doi.org/10.48550/arXiv.2407.10657

Xu, X., Li, M., Tao, C., Shen, T., Cheng, R., Li, J., Xu, C., Tao, D., & Zhou, T. (2024). A Survey on Knowledge Distillation of Large Language Models (arXiv:2402.13116). arXiv. https://doi.org/10.48550/arXiv.2402.13116

Zhong, Z., Zhong, L., Sun, Z., Jin, Q., Qin, Z., & Zhang, X. (2024). SyntheT2C: Generating Synthetic Data for Fine-Tuning Large Language Models on the Text2Cypher Task (arXiv:2406.10710). arXiv. https://doi.org/10.48550/arXiv.2406.10710

Instruct-tuned models

Prompting strategies

Generation of synthetic texts

Temperature

Further Readings

References