Prompting

Prompting describes the utilization of the ability of language models to use zero or few-shot instrutions to perform a task. This ability, which we briefly touched on when we were discussing the history of language models (i.e., the paper by Radford et al. (2019)), is one of the most important aspects of modern large language models.

Prompting can be used for various tasks such as text generation, summarization, question answering, and many more.

Instruct-tuned models

Instruct-tuned models are trained on a dataset (for an example, see Figure 4.1) that consists of instructions and their corresponding outputs. This is different from the pretraining phase of language models where they were trained on large amounts of text data without any specific task in mind. The goal of instruct-tuning is to make the model better at following instructions and generating more accurate and relevant outputs.

Fig 4.1: An example for a dataset that can be used for instruct-finetuning. This dataset can be found on huggingface
📝 Task

Test the difference between instruct and non-instruct-models.

Do this by trying to get a gpt2-version (i.e., “QuantFactory/gpt2-xl-GGUF”) and a small Llama 3.2 Instruct-Model (i.e., “hugging-quants/Llama-3.2-1B-Instruct-Q8_0-GGUF” to write a small poem about the inception of the field of language modelling.

Use LM-Studio to test this.

(a) A poem written by Llama 3.2 1B - a model with Instruct-Finetuning
(b) A “poem” written by GPT2 - a model without Instruct-Finetuning
Fig 4.2: A poem and a “poem”

Show answer

Prompting strategies

The results of a prompted call to a LM is highly dependent on the exact wording of the prompt. This is especially true for more complex tasks, where the model needs to perform multiple steps in order to solve the task. It is not for naught that the field of “prompt engineering” has emerged. There is a veritable plethora of resources available online that discuss different strategies for prompting LMs. It has to be said though, that the strategies that work and don’t work can vary greatly between models and tasks. A bit of general advice that holds true for nearly all models though, is to

  1. define the task in as many small steps as possible
  2. to be as literal and descriptive as possible and
  3. to provide examples if possible.

Since the quality of results is so highly dependent on the chosen model, it is good practice to test candidate strategies against each other and therefore to define a target on which the quality of results can be evaluated. One example for such a target could be a benchmark dataset that contains multiple examples of the task at hand.

📝 Task

1. Test the above-mentioned prompting strategies on the MTOP Intent Dataset and evaluate the results against each other. The dataset contains instructions and labels indicating on which task the instruction was intended to prompt. Use a python script to call one of the following three models in LM-Studio for this:

  1. Phi 3.1 mini
  2. Gemma 2 2B
  3. Llama 3.2 1B

Use the F1-score implemented in scikit learn to evaluate your results.

2. You do sometimes read very specific tips on how to improve your results. Here are three, that you can find from time to time:

  • Do promise rewards (i.e., monetary tips) instead of threatening punishments
  • Do formulate using affirmation (“Do the task”) instead of negating behaviours to be avoided (“Don’t do this mistake”)
  • Let the model reason about the problem before giving an answer

Check these strategies on whether they improve your results. If your first instruction already results in near-perfect classification, brainstorm a difficult task that you can validate qualitatively. Let the model write a recipe or describe Kiel for example.

3. Present your results

3. Upload your code to moodle

Generation of synthetic texts

As we discussed before, small models can perform on an acceptable level, if they are finetuned appropriately.

A good way to do this is to use a larger model to generate synthetic data that you then use for training the smaller model. This approach has been used successfully in many applications, for example for improving graph-database queries (Zhong et al., 2024), for improving dataset search (Silva & Barbosa, 2024) or the generation of spreadsheet-formulas (Singh et al., 2024).

Since even the largest LLMs are not perfect in general and might be even worse on some specific niche tasks, evidence suggests that a validation strategy for data generated in this way is beneficial (Kumar et al., 2024; Singh et al., 2024).

Strategies to validate the synthetic data include:

  • Using a human annotator to label part of the data to test the models output
  • Forcing the model to answer in a structured way that is automatically testable (e.g., by using JSON)
  • Forcing the model to return 2 or more answers and checking for consistency
  • Combining the two approaches above (i.e., forcing the model to return multiple structured outputs (JSON, XML, YAML, …) and checking for consistency)
  • Using a second LLM/different prompt to rate the answers
📝 Task

Using your script for batch-testing different prompts, generate synthetic data for a emotion detection task based on Paul Ekman’s six basic emotions: anger, disgust, fear, happiness, sadness and surprise1.

The generated data should consist of a sentence and the emotion that is expressed in it. Start by generating two examples for each emotion. Validate these results and adapt them if necessary. Then use these examples to generate 100 samples for each emotion.

Use one of the above mentioned (non-manual) strategies to validate the data you generated.

Upload your results to Moodle.

1 Though this nomenclature has fallen a bit out of fashion

Temperature

You might have encountered eerily similar answers from the language model, especially in the last task. Talking of it - why does the model return different answers to the same prompt at all if we do use pretrained-models in the first place? Shouldn’t the utilization of the frozen weight-matrix result in the same answer, every time we run the model with the same input?

Yes, it should. And it does.

Remember that a language model trained on language generation as we discussed in the first session ends in a softmax-layer that returns probabilities for each token in the vocabulary. The generation-pipeline does not just use the token with the highest probability though, but samples from this distribution. This means, that even if the input is identical, the output will be different every time you run the model.

The temperature parameter controls the steepness of the softmax-function and thus the randomness of the sampling process. A higher temperature value results in more random outputs, while a lower temperature value results in more “deterministic” outputs. The temperatur, indicated as a float between 0 and 12, is used to modulate the probabilities of the next token. This is done by adding a \(\frac{1}{Temp}\) factor to the model-outputs before applying the softmax.

2 Depending on the implementation, temperatures above 1 are also allowed. Temperatures above 1 are resultsing in strange behaviours - see Figure 4.3.

This effectively changes the Sofmax-fomula from

\[ p_{Token} = \frac{e^{z_{Token}}}{\sum_{i=1}^k e^{z_{i}}} \]

to \[ p_{Token}(Temp) = \frac{e^{\frac{z_{Token}}{Temp}}}{\sum_{i=1}^k e ^{\frac{z_{i}}{Temp}}} \]

Where

  • \(z_{Token}\) is the output for a given token
  • \(k\) is the size of the vocabulary
  • \(Temp\) is the temperature parameter (0 < \(Temp\) <= 1)

The effect of this temperature can be seen in Figure 4.3.

A heatmap illustrating the effect of the temperature parameter on the softmax-output for a given input. The x-axis represents the temperature, the y-axis represents the token-position and the color represents the probability of the token.
Fig 4.3: The effect of the temperature parameter on the softmax-output for a given input. The x-axis represents the temperature, the y-axis represents the token-position and the color represents the probability of the token.

Most generation-frameworks do additionally provide a parameter called top_k or top_p. These parameters are used to limit the number of tokens that can be selected as the next token. This is done by sorting the probabilities in descending order and only considering the top k tokens or the top p percent of tokens.

Temperature is the mayor setting to controll a LLMs “creativity” though.

📝 Task

Using the script provided for generating snthetic data, test the effect of the temperature parameter on the output of the model.

  • Use the same prompt and the same model
  • Run the model with a temperature value of 0.1, 0.5, 1.0 and 2.0

Further Readings

References

Heidloff, N. (2023). Fine-tuning small LLMs with Output from large LLMs. In Niklas Heidloff. https://heidloff.net/article/fine-tune-small-llm-with-big-llm/.
Kumar, B., Amar, J., Yang, E., Li, N., & Jia, Y. (2024). Selective Fine-tuning on LLM-labeled Data May Reduce Reliance on Human Annotation: A Case Study Using Schedule-of-Event Table Detection (arXiv:2405.06093). arXiv. https://doi.org/10.48550/arXiv.2405.06093
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.
Silva, L., & Barbosa, L. (2024). Improving dense retrieval models with LLM augmented data for dataset search. Knowledge-Based Systems, 294, 111740. https://doi.org/10.1016/j.knosys.2024.111740
Singh, U., Cambronero, J., Gulwani, S., Kanade, A., Khatry, A., Le, V., Singh, M., & Verbruggen, G. (2024). An Empirical Study of Validating Synthetic Data for Formula Generation (arXiv:2407.10657). arXiv. https://doi.org/10.48550/arXiv.2407.10657
Zhong, Z., Zhong, L., Sun, Z., Jin, Q., Qin, Z., & Zhang, X. (2024). SyntheT2C: Generating Synthetic Data for Fine-Tuning Large Language Models on the Text2Cypher Task (arXiv:2406.10710). arXiv. https://doi.org/10.48550/arXiv.2406.10710