Function Calling

Motivation

Sometimes you want your AI assistant to answer more complex tasks then the LLM can handle out of the box. For example, you might want to ask questions regarding documents or datasets on your hard drive, or ask about recent events. This is a hard task for an LLM, because, if it was not trained on this information, i.e. the information wasn’t part of its training data, it won’t be able to answer these questions correctly.

📝 Task

Try it!

Open a notebook and connect to a local LLM using LM Studio.
Ask the LLM about the current weather in your location.

(I mean, sure, you could just look out the window, but we are developers here, we don’t have windows!)

To solve this problem, we need to give the LLM some tools it can use to access this additional information. This is where function calling comes into play. Function calling allows the LLM to call a function with specific parameters to get the required information. The function then returns the requested data or performs the necessary task, and the LLM can continue generating its response based on that information.

A basic workflow may look like this:

The user inputs a question or task.
The LLM determines if it needs additional information or assistance from external tools.
If needed, the LLM calls a function with specific parameters to retrieve the required data or perform the necessary task. This means the LLM generates an answer containing either some executable code or a JSON object containing the name of the function and its parameters. These functions have to be defined in advance.
The LLM response is then scanned for these elements (code or JSON object) and these are executed if possible.
The response is then fed back into the LLM (usually as user input) for further processing.
The LLM uses the returned information to continue generating its response.

Note: As stated above, the functions need to be predefined. In theory, we could just give the LLM its task and let it generate code to be executed. This is, however not the best idea for reasons of security.

There are two main ways to implement function calling:

Structured output: Here, the LLM is tasked to generate a tool call in the form of a JSON object containing the name of the function and its parameters.
Code generation: Here, we ask the LLM to generate the function calls in the form of executable python code. Usually, we still want to restrict the LLM to use only predefined functions. Nevertheless, this can pose a severe security issue because this approach hinges on running generated code on your machine.

Still the challenge is to get the LLM to generate valid output. There are two main strategies to facilitate that:

using a large, generalized LLM (e.g. GPT-4) with good prompt engineering and
using a smaller model fine tuned to generate function calls.

Structured output

The traditional way¹ of doing function calling is to generate a JSON object containing the name of the function and its parameters. Until recently, all major agent frameworks (more on agents next time) used this approach. Here, the LLM response is scanned for a JSON object. Function name and arguments are extracted and the function is executed, if possible.

¹ In this context, traditional means: people have been using it for more than a year.

Function definition

The first step in using function calling is to define the functions that the LLM can call. This is done by providing a JSON schema that describes the name of the function, its arguments and their types. The JSON schema should be provided to the LLM in the system prompt. Here is an example: ²

² Note, that this is not an executable implementation but just a description of the function for the LLM.

{
    "name": "get_current_weather",  
    "description": "Get the current weather in a given location",  
    "arguments": {    
        "location": {"type": "string"},    
        "unit": {"type": "string"}  
        } 
}

Function name and description should be as clear as possible to make it easier for the LLM to decide which function to use and how to properly use it. Argument names and types should be as precise as possible to avoid ambiguity in the function call.

Prompting

The second step is to provide a good prompt. The prompt should make it clear to the LLM to only generate valid output and that it should follow the JSON schema. Here is an example of a prompt that can be used for function calling:

You are a helpful assistant that generates function calls based on user input. Only use the functions you have been provided with.

{function definition as described above}

User: What's the weather like in Berlin?

Assistant: {
    "name": "get_current_weather",
    "arguments": {"location": "Berlin", "unit": "celsius"}
}

Another way of forcing the LLM to output structured format is to use pydantic classes as described last time.

📝 Task

Try it!

Open a notebook and connect to a local LLM using LM Studio.
Define the function get_current_weather as shown above.
Write a prompt that asks the LLM to generate a function call based on user input. Use prompt engineering as shown above or pydantic classes as shown last time.
Test the prompt with an example input.
Define other functions and try other inputs and see if the LLM generates valid output.

Challenges, finetuned models and the influence of size

The main challenge is here to get the LLM to generate a valid answer. This is not always easy, as LLMs are not usually super safe coders 😃.

They can hallucinate functions or arguments that do not exist.
They can forget to call a function.
They can forget to provide all required arguments.
They can provide the wrong type for an argument.
They can provide invalid values for an argument.

There are several strategies to mitigate these issues:

Prompt engineering: A good prompt can help to guide the LLM towards generating valid output. This is especially true for larger models, as they have a better understanding of the world and can therefore generate more accurate responses.
Finetuning: Finetuning a model on a specific task can improve its performance on that task. This is especially useful for smaller models, as they are less likely to hallucinate functions or arguments that do not exist.
Size: Larger models are better at generating valid output than smaller models. However, larger models are also more expensive to run and require more computational resources.

📝 Task

Test it! (we can do it together, if your hardware does not allow you to run the model.)

As above, but this time

use a very small model (e.g a small Llama model)
use a model finetuned for the task (you could try this one)
a larger model (a larger llama in this case)

Code Generation

The exception mentioned above is the smolagents framework. Here, the default mode is code generation, but JSON mode is also supported. (We will get to know agents and the smolagents framework next time.) When using this approach, the function definition and description will be given to the LLM as python code. Additionally, the LLM is expected to generate the function call also as valid python code. As with structured output, function name and description should be as clear as possible. Typing might also help.

📝 Task

Try it!

In your notebook, define the weather function (and/or some other function of your choice) in python code.
Write an appropriate prompt that makes it clear that you expect python code calling the defined function(s).
Test your prompt with an example input.

As mentioned above (several times already), giving clear names and descriptions for functions, parameters, etc., will help the model generate more accurate code snippets. (PRO TIP: it will help your human coworkers as well in understanding your code.) Here, you have the opportunity to see the consequences in action in a save environment without angering fellow humans or yourself later on!

📝 Task

Try it!

In your notebook, write a well written python function using clear names, description text and typing hints. You can use the one you wrote earlier, because of course you wrote clean code!
Test the function with your prompt and example inputs.
Now write a badly written python function, without clear names, descriptions or typing hints. Test it with your example inputs too. Are the results better or worse? Why do you think that’s happening?
Upload your notebook to Moodle.

Motivation

Structured output

Function definition

Prompting

Challenges, finetuned models and the influence of size

Code Generation

Further Readings