The Essential Guide to Large Language Models Structured Output, and Function Calling

After almost a year of building production LLM-powered software systems I’ve collected some thoughts I’d like to share. Some information is hard to find, other information is incomplete, and there is some misunderstanding of concepts at both management and engineering roles regarding certain LLM capabilitiesThe base of all LLMs is the process called inference. Inference means generating the next token, one token at a time. Some models can be augmented with certain capabilities, such as function calling. Those are capabilities, and not features, which indicates how they were acquired. Capabilities are acquired during training and/or fine-tuning. .

LLM structured output and function calling are two of the most efficient and powerful ways of overcoming shortcomings and bottlenecks of LLMs when you build software systems using them. They truly enable a new level of LLM-powered software.

On my team, I design and build LLM-based Customer Success / Support software systems aimed to solve 80% of user tickets, which counts in the tens of thousands. Even though LLMs have been with us for some time now, sources on production systems that deliver actual business value are limited.

Production software systems with LLMs are novel, yet they hide a great deal of business value. The best practices, architectures, and approaches are quite greenfield even today, so let’s make the situation less obscure and dive deep into arguably one of the most powerful capabilities some LLMs have — structured output and function calling.

The first half of this article is non-technical and can be useful for both engineers and leaders who help their teams build LLM-based products.

The second half is technical. We’ll write code to see it all in action. Basic understanding of LLMs and some basics of Python are required. The code can be translated to any other language since we won’t use any specialized Python-exclusive libraries. It won’t contain in-depth mathematics of algorithms and can be followed by non-technical professionals as well.

At the end, we will implement two features:

Parse structured data out of natural language using structured output.
Build a Q&A system of stock prices using function calling.

This article occasionally uses annotationsAnnotations are little messages like this one which I use to bring more context on the topic without diverging from the main narration. They are complitely optional, yet they might improve your reading experience. . Whenever you see a blue underline under words, tap or click it to read them.

Getting it Straight: Structured Output and Function Calling

There are many fancy names and explanations to function calling, yet it all can be boiled down to one statement – “Function calling is a type of structured output capability of a large language model.”

“Function calling” naming is confusing. LLMs don’t call any functions themselves; they suggest which function you should call from pre-defined functions which you provide to the LLM in a prompt. It is the same inference process as output tokens. Therefore, function calling is nothing but structured output, or a special case of structured output. Yet, structured output is what enables LLMs to be useful as a part of software systems due to its capabilities. Largely, structured output enables us to integrate LLMs with classical software systems implemented in any programming language.

Citing OpenAI documentation on function calling:

Function calling allows you to more reliably get structured data back from the model.

Note the phrasing – more reliably does not equal to reliably. Keep that in mind while designing deterministic software. You might look into OpenAI JSON mode , and seed parameter to make your completions more consistent.

Before diving any deeper, remember, all an LLM does is next token prediction. Due to the principal LLM architecture all it possibly can output is one token with the highest probability. A token at a time.

In documentation OpenAI refers to function calling as a capability, and that is for a good reason. Function calling capability is achieved via fine-tuning of a model, which enables it to output data in a specific way. It is not an extension or a stand-alone feature. To oversimplify, a structured-output-capable model was just trained on more JSONThis statement is not entirely correct. Not only JSON has been used to train model, but many other techniques at fine-tuning step were involved. However, as simplification goes, this explanation will do. .

Therefore, function calling is not a real thing, it can’t hurt you, it is merely structured output — JSON formatted output which contains the name of a function to call and parameters for it. We will dive properly into it starting from the section How Function Calling Works. Nevertheless, structured output enables LLMs to be widely useful in production software systems, not only demos and fundraisers.

We will see both features in action soon, and I promise you, they are simpler than they sound.

Time to build intuition. Here is the simplest explanation of both capabilities:

Structured output fine-tune allows models to output JSONModels which don't support structured output still can output JSON, yet this JSON won't be very reliable and can differ quite a bit in structure and naming between each run. It is more like JSON-looking text output rather than actual JSON. more reliably, discussed at How Structured Output Works section
Function calling is structured output with extra steps that act as Retrieval Augmented Generation (RAG), discussed at How Function Calling Works section.

Why Do We Need All That

Three main reasons:

Models don’t have all the data for various reasons
Models need to be integrated with other systems
Architecture of LLMs has some limitations

Let’s unpack these statements in three short sections.

Models Don’t Have All The Data

Main reasons are:

Cut off date
Proprietary and private data

Models acquire their “knowledge” during the training process, and that “knowledge” is stored in the model’s weights, which cannot be easily updated on demand. To update “knowledge,” you effectively need to somehow update the model’s weights and run all the post-training routines, such as fine-tuning and human alignment.

This process is an operational and engineering-heavy effort; therefore, every model has a cut-off time point. Models don’t have an idea of what happened in the world after their training was done. Imagine that you have model X and you were training this model on January 12th, 2024. Then, a user asks the model about events from January 13th. The issue here is that the model knows about some past but nothing about the future. For the model, everything after January 12th is unknown.

Models get their training data largely from open sources, such as all public internet data. Yet, the world is full of proprietary data as well. To be useful at your organization, an LLM needs to have access to some of your data, but it was not exposed in the training set. Hopefully. Classical Retrieval Augmented Generation (RAG)RAG is a technique in which you add domain-specific data (normally text) into your prompt. In this way, you provide entirely new data to the LLM. It is a very powerful technique that can solve many business problems, especially in comparison to alternatives. can be used in such cases, but it is not always a direct fit. In fact function calling is essentially RAG-like in its purpose, and it solves shortcomings of existing RAG methodologies.

Example: you are building an LLM-driven financial assistant system and you need to know what is the current price of the S&P 500 index. You use function callingWhile not part of RAG itself, function calling extends RAG principle. It makes working with dynamic data easier. You can make the classic RAG approach work too, but it would be a bit awkward compared to function calling. to get up to the date prices.

Models Need To Be Integrated With Other Systems

Being part of some system means a need to receive input and produce output. That is generally all that programs do, turning an input into an output. Working with LLM output without a structured output capability in software systems is possible, but integration with existing parts of the system can be a separate problem on its own.

The elephant in the room is LLM testing: unit testing, integration testing, regression testing. It is hard to test LLM-powered or LLM-enabled systems. I strongly recommend thinking testing through beforehand. It will all get out of hand faster than you imagine, and manual testing won’t cut it.

Structured output capability enables output to be integrated with other parts of the system via JSON. You can even treat a structured output-enabled LLM as some sort of API which returns JSON.

Example: you are creating financial assistant system and you need to get user data from an opening conversation, such as name, email, age. Here are some options at your disposal: do some post-processing and extract data using classical NLPNatural Language Processing is a cross-field of computer science and lingustics which studies how computers can work with natural languages. algorithms, employ funky regular expressionIf you clicked here all you need to know is the follwing proverb: you had a problem and you decided to use regular expressions to solve it. Now you have two problems. , ask the LLM to process this natural language input for you, or use a structured-output-capable LLM to parse the natural language and put it into pre-defined JSON.

It is a world of a difference between output such as:

Sure, my name is Pavel Bazin, I'm 33, and my email is [email protected]

and

{
    "name": "Pavel Bazin",
    "age": 33,
    "email": "[email protected]"
}

Structured output not only makes integrations significantly simpler, but they enable tasks which normally take elaborate NLP systems with easy LLMs are able to do NLP tasks out of the box, at the expense of accuracy. The loss of accuracy is tolerable given the simplicity. Think sentiment analysis. With the standard NLP approach, you'd need to train your own model using your data. An LLM can do sentiment analysis out of the box. .

Architecture of LLMs Has Some Limitations

Models are great at semantics: meanings, concepts, and relations; but not numbers. Let’s first get some examples, and then discuss why it is the case.

Let’s take some different models, and ask them the same question: “What was the closing price of the S&P 500 on January 12th, 2021?” This question is particularly good because most of the models have financial data in their training set.

LLAMA 3.1 70B model says 3,803.47
Mixtral 8x7B model says 3,798.41

Correct answer is 3,801.19 .

How come? Weren’t models trained on financial data? They were. It is the case because models are lossy. You can think of an LLM as a compression algorithm. It takes petabytes of data and squeezes them into tens or hundreds of gigabytes of weights. Lossless compression means that you can get the exact same data back after decompression. Lossy data means that you can’t return to the initial data. This is one of the properties that make models hallucinate.

Closed-source models aren’t interesting here; they were fine-tuned and evoke online lookup to cover the downside. The latest LLAMA and somewhat outdated Mixtral both provide some approximations. These approximations are representations of lossy data, meaning that those values are imprinted in models’ weights, but they are not exact.

Get notified when post like this gets published.

Pavel Bazin writes about software engineering and engineering management. Some posts are highly technical dives with hands-on coding, others span into softs skills and carreer territory.

Another shade of the same issue is calculations. All a model does is generate one token at a time; it can’t calculate; it approximates.

Example: You are creating a financial assistant system, and you need to calculate $ 8.08^6 * 0.1^2 $. Let’s use two different models and see how they do that, and how consistently. The framework is: ask the LLM three times to compute the expression above. If the LLM uses a calculator tool, as ChatGPT does, ask it not to use it.

LLAMA 3.1 70B, question “What 8.08^6 * 0.1^2 equals to?”

Round 1: 2,797,941.29
Round 2: couldn’t reply, stuck in a loop
Round 3: 0.0004096

ChatGPT 4oChatGPT 4o uses calculator tool therefore you should ask it to not use any tools. First message was looking like: "Without using any tools or calculators reply this: What is 8.08^6 * 0.1^2 equals to?" Next one was "calculate yourself". , question “What 8.08^6 * 0.1^2 equals to?”

Round 1: 2621.44
Round 2: 2621.44
Round 3: 2621.44

The results for LLAMA 3.1 were horrible, and ChatGPT 4.0 consistently gave an incorrect answer. The correct answer is $2782.71$. That is a computational inaccuracy, off by $161.26$, which is enough to render plain LLMs unusable for any number-related work.

Frameworks or It All Simpler Than It Looks

Before diving into examples and code, let’s spend a few minutes talking about frameworks. At my job, I do not use them, and the results have been fantastic. Applied LLMs are simpler than frameworks make them look! Working with LLMs boils down to input and output; it is no harder than add(a: int, b: int) -> int. By understanding how it all works inside, you are doing yourself a favor. Instead of learning added complexityAny project, such as an LLM framework, has a learning curve, documentation, domain knowledge, and patterns. In some cases, it's really worth it; in others — proceed with caution. LLMs are so simple to work with that perhaps it would be a great simplification to understand their basics rather than learning an extra API for them. , focus on what delivers value itself.

I’m not saying you shall not use them, although there are quite a few reasons not to . My view is – there are frameworks which make difficult simple, and there are frameworks which make simple difficult.

Complexity comes from not understanding what the workflow with LLMs looks like “under the hood.” Engineers have gotten used to frameworks which hide great complexity under some abstractions which were in development for decades, be it Spring Framework or Unreal Engine. It is okay; we all, whenever facing an unknown task, tend to open Google up and type “PROBLEM_X frameworks” after all.

In most cases, the functionality you need from LLM frameworks can be built in under 100 lines of concise, clear, and documented code. In my production code, the RAG part takes under 40 lines. It removes all the magic out of already complex systems, making it ultimately simpler to grasp and explain. 100 lines of code here is a solid trade-off. In my experience, those frameworks collectively work absolutely great at the demo level, but start to fall short even before the first user-facing feature.

How Structured Output Works

💡

On August 6th OpenAI introduced a successor to structured output. The idea generally remains the same to what they call now JSON mode. Now OpenAI API can output already serialized data based on schema you've provided. However, I've tried that in one of the products and it is not working as documented yet. API namespace is suggests its a beta feature namespacing it client.beta.chat.completions.parse instead of a regular client.chat.completions.create. Additionally, the new capability is limited to GPT-4o family of models only. The last section will be about these caveats.

Structured output is a model’s capability to output JSON, acquired during fine-tuning. Let’s dive into it first. As you’ll see, Function Calling, which we will discuss in the next section, is structured output itself.

Use Case

Natural language processing used to be a privilege of companies with more engineering resources, bigger budgets, and stronger HR pipelines. LLMs make NLP tasks significantly easier.

Consider this problem: implement a natural language processing parser that allows users to create a grocery list out of natural language input. The user provides a list of groceries in written or spoken form, and the program outputs an HTML-formatted list.

Without LLMs, that is not such an easy task to tackle. It’s easy to build a demo, but not easy to build a high-quality product that handles edge cases well.

Today we have LLMs, so let’s do it: we pipe user input into the LLM, the LLM outputs JSON, and Python picks it up and formats the JSON into HTML. Easy enough.

Quickly set up the OpenAI API and write a helper function eval. Make sure that your environment has OPENAI_API_KEY, or substitute OPENAI_API_KEY with your API key.

To run all the code below you’ll need to install openai dependency pip install openai or python3 -m pip install openai. That is the only 3rd party dependency we’ll be using.

# Further down these imports will be ommited for brevity
import os
from openai import OpenAI


def eval(prompt: str, message: str, model: str = "gpt-4o") -> ChatCompletion:
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    messages = [
        {"role": "system", "content": prompt},
        {"role": "user", "content": message},
    ]

    return client.chat.completions.create(
        model=model,
        messages=messages
    )

Lets implement the first iteration of our task.

prompt = """
You are a data parsing assistant. 
User provides a list of groceries. 
Your goal is to output it as JSON.
"""
message = "I'd like to buy some bread, pack of eggs, few apples, and a bottle of milk."

res = eval(prompt=prompt, message=message)
json_data = res.choices[0].message.content

print(json_data)

LLM completion returned following data for us.

    ```json
    {
        "groceries": [
            "bread",
            "pack of eggs",
            "few apples",
            "bottle of milk"
        ]
    }
    ```

You can see that it didn’t return JSON; it returned a markdown formatted string containing JSON. The reason is that we didn’t enable structured output in the API call.

def eval(prompt: str, message: str, model: str = "gpt-4o") -> ChatCompletion:
    # ...
    return client.chat.completions.create(
        model=model,
        messages=messages,
        # Enable structured ouput capability
        response_format={"type": "json_object"},
    )

Now, running the same code will return plain JSON. That is great not only because we don’t need to parse anything extra, but it also guarantees that the LLM won’t include any free-form text such as “Sure, here is your data! {…}”.

Lets make a simple parser.

def render(data: str) -> str:
    data_dict = json.loads(data)

    return f"""
    <ul>
        {"\n\t".join([ f"<li>{x}</li>" for x in data_dict["groceries"]])}
    </ul>
    """

By running render(json_data) you’ll get valid HTML:

<ul>
    <li>bread</li>
    <li>pack of eggs</li>
    <li>apples</li>
    <li>bottle of milk</li>
</ul>

The problem is, we don’t have the data shape defined; let’s call it schema. Our schema is now up to the LLM, and it might change based on user input. Let’s rephrase the user query to see it in action. Instead of asking, “I’d like to buy some bread, a pack of eggs, a few apples, and a bottle of milk,” let’s ask, “12 eggs, 2 bottles of milk, 6 sparkling waters.”

message = "12 eggs, 2 bottles of milk, 6 sparkling waters"
res = eval(prompt=prompt, message=message)

json_data = res.choices[0].message.content
print(json_data)

Output now is quite different from the former example.

{
  "groceries": [
    {
      "item": "eggs",
      "quantity": 12,
      "unit": ""
    },
    {
      "item": "milk",
      "quantity": 2,
      "unit": "bottles"
    },
    {
      "item": "sparkling waters",
      "quantity": 6,
      "unit": ""
    }
  ]
}

Structured output not only enables JSON output, it also helps with schema.

💡

OpenAI introduced the next step for Structured Output. You can get the same results using response_format={"type": "json_object"} and parse data yourself without using beta version of API which was not performing reliably in our products. Beta version uses response_format as well, and refactoring would take a few minutes.

The way we can state and define schema is via prompt engineering.

old_prompt = """
You are data parsing assistant. 
User provides a list of groceries. 
Your goal is to output is as JSON.
"""

prompt = """
You are data parsing assistant. 
User provides a list of groceries. 
Use the following JSON schema to generate your response:

{{
    "groceries": [
        { "name": ITEM_NAME, "quantity": ITEM_QUANTITY }
    ]
}}

Name is any string, quantity is a numerical value.
"""

Let’s run test with two inputs and bring it all together:

I’d like to buy some bread, pack of eggs, few apples, and a bottle of milk.
12 eggs, 2 bottles of milk, 6 sparkling waters.

prompt = """
You are data parsing assistant. 
User provides a list of groceries. 
Use the following JSON schema to generate your response:

{{
    "groceries": [
        { "name": ITEM_NAME, "quantity": ITEM_QUANTITY }
    ]
}}

Name is any string, quantity is a numerical value.
"""

inputs = [
    "I'd like to buy some bread, pack of eggs, few apples, and a bottle of milk.",
    "12 eggs, 2 bottles of milk, 6 sparkling waters.",
]

for message in inputs:
    res = eval(prompt=prompt, message=message)
    json_data = res.choices[0].message.content
    print(json_data)

Input: “I’d like to buy some bread, pack of eggs, few apples, and a bottle of milk”.

# "I'd like to buy some bread, pack of eggs, few apples, and a bottle of milk"
{
    "groceries": [
        { "name": "bread", "quantity": 1 },
        { "name": "pack of eggs", "quantity": 1 },
        { "name": "apples", "quantity": 3 },
        { "name": "bottle of milk", "quantity": 1 }
    ]
}

Input “12 eggs, 2 bottles of milk, 6 sparkling waters”.

# "12 eggs, 2 bottles of milk, 6 sparkling waters"
{
    "groceries": [
        { "name": "eggs", "quantity": 12 },
        { "name": "bottles of milk", "quantity": 2 },
        { "name": "sparkling waters", "quantity": 6 }
    ]
}

Now we have a defined schema and some parsing rules. It is very important to write as many tests for your LLM applications as possible to ensure correctness and avoid regressions. Keep in mind that user input affects LLM output; therefore, it’s better to write more than one test for the same feature to see how it works with different inputs. Complexity tends to skyrocket, and debugging LLM completions is not fun, trust me. Do write tests — they will save you time, money, and sanity.

Serialization

Usually, JSON output won’t cut it in software systems. It is just a string after all. You have to ensure that the LLM indeed returns correctly formed data. When the system grows bigger, it will be increasingly harder to track all the moving pieces.

Normally, we use some domain schemas, DTOs, etc. Let’s serialize our output into a schema in a generic way.

# Define generic type variable
T = TypeVar("T")


# Immutable grocery item container
@dataclass(frozen=True)
class Item:
    name: str
    quantity: int


# Immutable groceries container
@dataclass(frozen=True)
class Groceries:
    groceries: List[Item]

    @staticmethod
    def serialize(data: Any) -> Self:
        """JSON serialization function."""
        json_data = json.loads(data)
        items = [Item(**item) for item in json_data["groceries"]]

        return Groceries(groceries=items)


# Edited `eval` function to handle types and serialization
def eval(
    prompt: str,
    message: str,
    schema: Generic[T],
    serializer: Callable = None,
    model: str = "gpt-4o",
    client: OpenAI = OpenAI(api_key=os.environ["OPENAI_API_KEY"]),
) -> Optional[T]:
    messages = [
        {"role": "system", "content": prompt},
        {"role": "user", "content": message},
    ]

    completion = client.chat.completions.create(
        model=model,
        response_format={"type": "json_object"},
        messages=messages,
    )

    try:
        completion_data = completion.choices[0].message.content
        json_data = json.loads(completion_data)

        if serializer is not None:
            return serializer(completion_data)
        else:
            return schema(**json_data)
    except TypeError as type_error:
        # Happens when dictionary data shape doesn't match provided schema layout.
        return None
    except json.JSONDecodeError as json_error:
        # Happens when LLM outputs incorrect JSON, or ``json`` module fails
        # to parse it for some other reason.
        return None

First we declare a type variable which will be used to pass schema class as a parameter. Then we describe our JSON the data using data class decoratorData class is a class with pre-defined magic methods such as __init__, __repr__, __eq__, and others. Frozed parameter makes them immutable. This concept is widely used in Scala with case classes, and nowadays in Java with records. Those classes are great for DTOs. . The Groceries has serialize defined on it to parse nested JSON correctly. Our eval function is, in its principle, the same as before. It gets a schema and an optional serializer for data serialization. If the serialization function is present eval will use it, if not it will attempt to destructure the dictionary into the data class.

# Evaluate user input
res = eval(
    prompt=prompt,
    message="I'd like to buy some bread, pack of eggs, few apples, and a bottle of milk.",
    schema=Groceries,
    serializer=Groceries.serialize,
)

# Pretty print it
pp(res)

Groceries(groceries=[Item(name='bread', quantity=1),
                     Item(name='pack of eggs', quantity=1),
                     Item(name='apples', quantity=3),
                     Item(name='bottle of milk', quantity=1)])

And that is all about it. This is how structured output works. It is simple enough, yet it opens up great power if used correctly.

I use this approach to build complex multi-agent systems in which all communication is done via Python code. This approach makes software testable and more transparent than alternatives. It makes Domain-Driven Design (DDD) easier. You don’t want to glue all the domains of logic together. Applications where LLMs deliver differentiation and business value are still software systems, like any other non-LLM application. Market estimates suggest that LLM-related code and functionality take up 10% to 20% of the codebase; the rest is our old “classical” systems.

How Function Calling Works

Function calling is the same mechanism as structured output, with a few extra steps. Instead of plain JSON, the LLM returns the function name and parameters for it from a pre-defined schema.

Workflow:

You provide a description of available functions to the LLM
Whenever the LLM needs any of them it returns request to call a selected functionFunction gets selected based on its description in schema definition. We will discuss this in greater details further on.
You call the function the LLM asked you for
You provide the return of the function back to the LLM
The LLM generates the final answer

You can see that it is a very similar mechanism to structured output. In essence, you provide some data specification to the LLM. You can do function calling without using the function calling API via structured output, but it won’t be pretty, yet it is possible. Function calling is merely a specialized use case of structured output.

Let’s build an intuition visually by drawing the simplest possible diagram of the function calling system. The callee, such as the user, calls completion from a controller. The controller is some abstract domain logic. The controller gets a prompt and LLM interface. The interface could be the OpenAI API, a remote HTTP call, or a local call, all depending on the LLM of your choice.

flowchart LR
    Callee ----- Controller
    Controller <--> Prompt
    Controller <--> LLM
    Controller <--> Function{{Function}}

Let’s translate this into some draft code. To be more concrete, let’s make the LLM respond the following question:

You: What is the S&P 500 index price today?
LLM: 🤷‍♂️

As discussed in Why Do We Need All That, LLMs can’t reply to that question for two reasonsRecap: LLMs have a data cut off, they aren't aware of the most recent events. LLMs are bad with numbers, they 'estimate' them rather than 'know'. , but function calling comes to the rescue. Let’s dive into it.

Get notified when post like this gets published.

Pavel Bazin writes about software engineering and engineering management. Some posts are highly technical dives with hands-on coding, others span into softs skills and carreer territory.

The best way to explain it is a sequence diagram. If you don’t know how to read these diagrams, don’t shy away; it’s simple. You have “headers” such as Callee, Controller, etc. Those are “participants” of the system; in this case, the Callee is an entity that calls the system, such as a user who asks for a stock price. All the other participants are modules. Vertical rectangular boxes represent participant activity.

sequenceDiagram
    autonumber
    
    Callee ->>+ Controller: What is the price of the S&P500 index today?

    Controller ->>+ Prompt: Fetch prompt
    Prompt -->>- Controller: Return prompt

    Controller ->>+ Function: Fetch available functions
    Function -->>- Controller: Return available functions

    Controller ->>+ LLM: Pass prompt, available functions, user message
    LLM -->>- Controller: Request to call `get_stock_price` function with `$SPX` argument
    Controller ->>+ Function: execute `get_stock_price(index="$SPX")`
    Function -->>- Controller: return `5,186.33`

    Controller ->>+ LLM: Provide `5,186.33` as the result of function calling
    LLM -->>- Controller: "S&P500 today valued at `5,186.33`"
    Controller -->>- Callee: "S&P500 today valued at `5,186.33`"

Here, the callee asks the LLM system to provide the price of the S&P500. The LLM gets supplied with: i) prompt, ii) available functions, iii) user message. During the inference process LLM realizesRealization comes from a system prompt and description of your function. For common use cases you don't need to provide much details, for more edge-casy scenarios you have to describe your intentions in details otherwise LLM might ignore your functions and produce a hallucination. that it needs to call a function to fulfill the request. LLMs can’t call functions; therefore, it provides a completion asking you to call a function and return its result back to the LLM. Once you provide the LLM back with the asked value, the LLM generates the final completion: “S&P500 today is valued at 5,186.33.”

Note:

LLM receives a list of functions it has at its disposal
LLM doesn’t call any function; it’s your responsibility
When LLM decides to call a function you need to run inferenceInference in LLM context means evaluation or computation. When you submit your input into LLM it runs inference to predict next tokens. twice to generate one completionCompletion in LLM context means output of LLM which is generated text based on the prompt you've provided.

Let’s translate all this into code by building a stock prices Q&A system.

The problem at hand is a bit more elaborate than structured output; therefore, let’s go in parts. Note that this code is not good for production; however, I’ll be making some notes along the way about what would be good to have in a user-facing system.

As the first step, let’s outline the actions we’ll need to take in a high level. For unknowns I use ellipsisEllipsis is a singletone object which can be either accessed by its name, Ellipsis, or via ... syntax. It is frequently used for placeholders. There are some other use cases for type hinting and slicing. syntax temporarily, as a placeholder.

# Get system prompt
def get_prompt() -> ...:
    pass

# Function which will power our Q&A system which takes ticker (stock name)
# and returns its price as a string
def get_stock_price(ticker: str) -> str:
    pass

# Get a list of functions specifications for function calling
def get_llm_functions() -> ...:
    pass

# Get completion from the messages. It is a helper function which wraps a call to LLM
def get_completion(messages: List[dict[str, str]], tools=None) -> ...:
    pass

# Controller which will execute the logic which receives user input and a list
# of functions where key is function name and value is a reference to a function
# which takes string and outputs string
def controller(user_input: str, functions: dict[str, Callable[[str], str]]) -> ...
    pass

# Entry point
if __name__ == "__main__":
    pass

First, let’s tackle get_prompt and get_stock_price. get_prompt returns a prompt, so let’s make it first. The prompt starts with a classical “You are helpful assistant” and continues with an indication that there is some function present. We do a similar thing with structured output where the use of the word “JSON” is mandatory. We’ll return the stock price from a local dictionary to avoid extra code which is not related to the topic. In real life, chances are you’d want to call a remote API, database, etc.

def get_prompt() -> str:
    return "You are a helpful assistant. Use provided function if response is not straightforward."

def get_stock_price(ticker: str) -> str:
    local_data = {"DJI": "40,345.41", "MSFT": "421.53", "AAPL": "225.89"}
    
    # This is not safe to do. In production its a good idea to return
    # Optional[str] and check whether key is present in dictionary
    # such as `if not ticker in local_data: None`.
    return local_data[ticker]

One caveat you should always check is what kind of response the LLM provides to your question via the API. Normally, chat platforms from OpenAI, Anthropic, etc., are pre-prompted, and sometimes they may use external calls under the hood to serve your request. Always verify with the API you use. Results might be quite different. You can do it either via cURL or via a provider platform, such as OpenAI Platform Playground . For example, “What is the price of X today?” In terms of stocks, one stock may have a variety of tickers. Dow Jones can be referred to by four trading symbols: ^DJI, $INDU.DJI, DJIA, and DJI. The LLM might use any of these, and it is the responsibility of engineers to consider such cases.

When you have a tiny time frame to deliver, automated tests help the most. Make them as comprehensive as possible and run them as often as possible.

Next stop is get_llm_functions. In this function, we describe available function schemas. In our particular example, we’ll use one function. You can add hundreds of them without any changes in the LLM logic code. First, you write a schema or function definition for the LLM. It’s done via a JSON structure. If you use the Python client, you do the same, just with dictionaries. Let’s write the get_stock_price function definition. You can think of it as a description of a function as if you’d explain it to your colleague. You tell the LLM that there exists a function with such a name, it does XYZ, and it receives certain parameters. The same way you would talk to a colleague, but in a structured way.

{
    "type": "function",
    "function": {
        "name": "get_stock_price",
        "description": "Get current stock index price",
        "parameters": {
            "type": "object",
            "properties": {
                "ticker": {
                    "type": "string",
                    "description": "stock index ticker in format of TICKER, without prefixes such ^ or $",
                }
            },
            "required": ["ticker"],
        },
    },
}

First, we state that type is a function schema object, and then we go to the function embedded object.

Function defined by its:

name of a function as in your code
description of the function which the LLM takes into consideration
Description of parameters function accepts

Let’s go line by line. We are describing get_stock_price we’ve implemented above, so its name is get_stock_price.

Description is something very important in the context of function calling. Describing a function via some schema is not new; however, in the context of LLMs, the description is part of the prompt. Remember the analogy of the function definition schema and your coworker? LLM takes the description into consideration while generating the completion. What you put in the description might—and likely will—affect the output. As with everything with LLMs, it is just a guideline and not enforcement. Description of a function will be considered by the LLM, and if it semantically fits what the user asked, the LLM will ask you to call it to get the missing data. We will write this logic shortly.

In the parameters field, you describe what get_stock_price expects as its parameters. In our case, the function has the following signature: get_stock_price(ticker: str) -> str. The return type here is irrelevant because, for the LLM, everything is a token that gets returned as a string. Be it a number, or JSON – the return type is always a string. We have a parameter ticker, which is called a property. Let’s zoom in.

"properties": {
    "ticker": {
        "type": "string",
        "description": "stock index ticker in format of TICKER, without prefixes such ^ or $",
    }
},

tickers property type is string, and its description contains instruction for LLM about this property. We tell that ticker is a stock index ticker represented in a certain format.

Lastly we specify what function parameters are required. In this case ticker is required. Time to bring it all together.

def get_llm_functions() -> List[dict[str, str]]:
    return [
        {
            "type": "function",
            "function": {
                "name": "get_stock_price",
                "description": "Get current stock index price",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "ticker": {
                            "type": "string",
                            "description": "stock index ticker in format of TICKER, without prefixes such ^ or $",
                        }
                    },
                    "required": ["ticker"],
                },
            },
        }
    ]

Before we go any further, a few words about List[dict[str, str]] return type definition. It is not entirely correct. I’ve used it for the sake of semi-correct simplification. There is an old GitHub issue in Python project where Guido van Rossum replied that ultimately JSON value type has to be defined as Any. It doesn’t bring any use either way, therefore, feel free to choose whatever fits your data best. Another options could be: List[dict[str, Any]] or List[dict[str, str | List[str]]] which I find the most descriptive for the case, yet unnecessary complex. At the end of the day type definitions in Python don’t do much other than serving as a documentation.

Next up we have get_completion function. It is a helper function which wraps API call.

def get_completion(messages: List[dict[str, str]], tools=None) -> ChatCompletion:
    res = client.chat.completions.create(
        model=GPT_MODEL, 
        messages=messages, 
        # Note ``tools``, that is where we provide our functions 
        # definition schema for the function calling.
        tools=tools
    )
    return res

Last function, and the most interesting one - controller. You can use any terminology here – domain logic, controller, service, module, whatever makes more sense in your architecture, here we will stick to controller. Controller’s implementation contains more comments to help you understand what is happening at each logical scope of the function.

def controller(
    user_input: str, functions: dict[str, Callable] = None
) -> ChatCompletion:
    # Fetch prompt, functions
    prompt = get_prompt()
    llm_functions = get_llm_functions()
    # Set up first messages
    messages = [
        {"role": "system", "content": prompt},
        {"role": "user", "content": user_input},
    ]
    # Generate LLM response with messages and functions
    # Prompt is already in the messages stored as system role.
    completion = get_completion(
        messages=messages,
        tools=llm_functions,
    )

    # Verify if completion has `tool_calls` which is
    # List[ChatCompletionMessageToolCall] or None
    is_tool_call = completion.choices[0].message.tool_calls
    if is_tool_call:
        tool_call = completion.choices[0].message.tool_calls[0]

        # We need call ID, and function out of it. ID has to be send back to LLM later
        fn = functions[tool_call.function.name]
        args = json.loads(tool_call.function.arguments)

        # Call the function
        res = fn(**args)

        # Add messages. Both of them are essential for the correct call.
        # Add assistant's response message
        messages.append(completion.choices[0].message)
        # Add function calling result
        messages.append(dict(role="tool", tool_call_id=tool_call.id, content=res))

        # Run completion again to get the answer
        tool_completion = get_completion(messages=messages)

        # Return response which was generated with help of function calling
        return tool_completion.choices[0].message.content

    # Return response without function calling
    return completion.choices[0].message.content

The eval function’s anatomy is similar to the function from the structured output example, with just some extra LLM call if the LLM choice contains a tool_call. The function receives user input and a list of Python functions defined as dict[str, Callable], meaning that it is a map where a string is mapped onto a Callable, which is a pointer to a functionIf its doesn't makes sense, no worries, we'll talk about how it actually works shortly. For now think of it as a way to call a function by its string name representation other than by its actual name of a variable. implementation.

💡

This way of working with function calling is universal and works with other providers equally well. It is a transferable skill. Anthropic maintains a cookbook repository on GitHub with examples of how can you acomplish one task or another. You can take a look into Creating a Customer Service Agent with Client-Side Tools. At Step 5: Interact with the chatbot you can find very-very-very similar code to what we wrote here.

messages array contains two messages, one is system message with a prompt, and another is user message with user message. Messages play crucial role in OpenAI API, and other LLMs for that matter. OpenAI API has a pattern of messages which is represented by a dictionary with two keys: role and content. Here are roles you need to work with LLMs and function calling:

# System prompt
dict(role="system", content="system prompt content")

# User message
dict(role="user", content="user message content")

# Assistant message
dict(role="assistant", content="assistant message content")

# Function calling result
dict(role="tool", content="function calling result", tool_call_id: "ID of the call")

Here and in the rest of the article I use dict map initialization syntax. It is equivalent to {} convenience reserved syntax. E.g. dict(a=1) == {"a": 1} is True.

Messages also have to be sequentially fed into LLM, one after another, in a chronological order, maintaining correct roles of messages.

We pass messages into get_completion function with two arguments – messages and tools. Here is a confusing terminology. Tools refer to new functions. In previous versions there were function_call and functions, today they are deprecated and instead you should use tools, although API still returns old names containing Nones.

If LLM decides that a tool has to be called completion.choices[0].message.tool_calls will contain a list of ChatCompletionMessageToolCall. If no tool has to be called then we simply return message content and cycle is over. However, if a tool call is present we go to function calling logic which starts from if is_tool_call statement.

First we get the tool to call. In our case we provided only one tool, therefore we pick the first item from the array. It has type of ChatCompletionMessageToolCall. Let’s zoom into the type to see what is so special about function calling after all (spoiler: between not much and nothing). It is a straightforward schema definition:

class ChatCompletionMessageToolCall(BaseModel):
    id: str
    function: Function
    type: Literal["function"]

# Where `function` is
class Function(BaseModel):
    arguments: str
    name: str

That is it. They are just pydantic schemas. In our concrete example here is what tool_call looks like:

# This is what object ``completion.choices[0].message.tool_calls[0]`` holds
tool_call = ChatCompletionMessageToolCall(
    id='call_Mr0NT31AHPv4yTGXuULtZ9bA', 
    function=Function(arguments='{"ticker":"DJI"}', name='get_stock_price'), 
    type='function')

Now, let’s dive into how do we do function calling:

# Get function from map of functions by `tool_call.function.name` key. 
fn = functions[tool_call.function.name]

# Call the function (args explained below)
res = fn(**args)

functions are defined as a parameter of the controller as a dictionary with string key and callable value. We map the function name returned by the LLM to the actual function implementation value. Here is the main principle behind it in isolation:

# Implementation of a "tool"
def get_price(args):
    pass

# Map of function names to function reference
functions = { "get_price": get_price, "other_function": ... }
# Pick function by its name as a key. Now `fn` holds reference to the implementation
fn = functions[tool_call.function.name]
# Get arguments for the function call if any
args = json.loads(tool_call.function.arguments)

# Call the function
fn(**args)

Let’s focus on args. Why do we even need JSON and the hell is **. The LLM completion returns tokens, which are glued together into a string. It returns arguments as JSON-formatted string, where key is name of an argument and value is the argument. To get them we have to turn string into Python object first, such as dictionary. json.loads does exactly that. loads stands for “load string”. It takes JSON string and returns a dictionary. Then we have ** which is called dictionary unpacking. You may be familiar with this concept under different name – destructuring.

# Mock of our function
def get_stock_price(ticker: str) -> str:
    return ticker

# Call it with concrete value of a map
get_stock_price(**{"ticker":"DJI"})

# Equivalent to 
get_stock_price(ticker="DJI")

# Returns
'DJI'

After we call fn(**args) it returns result of get_stock_price which is the current value of the stock. Remember I said how messages are important? To return result back to LLM so it can finally generate a completion res has to be added to the messages.

# Add LLM's message where it asks for a function call into messages
# It has the role of assistant, next code listing shows its internals
messages.append(completion.choices[0].message)
# Add function calling result
messages.append(dict(role="tool", tool_call_id=tool_call.id, content=res))

Tool call ID is mandatory to link the tool call request to the result which is provided as a parameter to content.

Finally we call get_completion(messages=messages) again with our updated messages and return the message completion. Before we will get it all together, let’s zoom into messages state as it is at this point to solidify the intuition.

[
    # System prompt
    {
        'role': 'system',
        'content': 'You are a helpful assistant. Use provided functions if response is not clear.'
    },
    # Initial user message
    {
        'role': 'user', 
        'content': 'What is the price of Dow Jones today?'
    },
    # LLM response with request to call a funciton to get the missing information
    ChatCompletionMessage(
        content=None, 
        role='assistant', 
        function_call=None, 
        tool_calls=[
            ChatCompletionMessageToolCall(
                id='call_Wvtx0DYHnLT9AWujhXn4AwIl', 
                function=Function(
                    arguments='{"ticker":"DJI"}', 
                    name='get_stock_price'), 
                type='function')
            ], 
            refusal=None),
    # Our response to LLM with the missing data
    # Note that IDs are the same, that what creates the link between data
    {
        'role': 'tool',
        'tool_call_id': 'call_Wvtx0DYHnLT9AWujhXn4AwIl',
        'content': '40,345.41'
    }
]

How did LLM understand what function to call? Remember functon schema definition from get_llm_functions function? That shcema contains description of the function. Based on the discription LLM gets what function it needs to “call” to get the missing data.

Now its time to put it all together, let’s run the code:

if __name__ == "__main__":
    # Make a map of available functions
    available_functions = dict(get_stock_price=get_stock_price)
    # Pretty-print return result of controller
    pp(
        controller(
            "What is the price of Dow Jones today?",
            functions=available_functions,
        )
    )

Ouput returns

The price of the Dow Jones Industrial Average (DJI) today is 40,345.41.

Congrats, now you know what structured output and function calling is, how to use it, as well as why function calling is a specific case of structured output.

🔗

You can find the final code in this Gist.

Before closing I’d like to share some tips on the covered topics in the appendix.

Liked the post? Get notified about the next one.

Pavel Bazin writes about software engineering and engineering management. Some posts are highly technical dives with hands-on coding, others span into softs skills and carreer territory.

Appendix

Model Support

Not all models are built the same, or more correctly to say they are trained and fine-tuned differently. Support of JSON structured output and function calling is not default for all LLMs, it is rather a feature.

If you are relying on open-source models or can only use them for some reason, it would be a good idea to first verify how consistently and correctly your model works with JSON.

I’ll provide a short example with Mixtral 8x7B, a popular and powerful model. You can test many open source models at groqcloud playground.

Let’s make three runs with the following settings and give them 3 runs:

Prompt: Extract user groceries into JSON object.
User: I’d like to buy water, bread, and some eggs.

First run

{
  "groceries": [
    "water",
    "bread",
    "eggs"
  ],
  "quantity": {
    "eggs": "some"
  }
}

Second run

{
  "groceries": [
    "water",
    "bread",
    "eggs"
  ],
  "quantity": [
    1,
    1,
    some
  ]
}

Third run

{
  "groceries": [
    "water",
    "bread",
    "eggs"
  ],
  "quantity": [
    null,
    null,
    1 // assuming you want to buy "some" eggs as a quantity of 1
  ]
}

Three runs, three different outputs, two of which are not legal JSON. It would be hard to build a robust system out of such a model. Prompt engineering will definitely help to make the situation better, yet not great.

Try it yourself before choosing a particular model.

OpenAI Update of Structured Output

I would suggest not using if for now, until it’s under “beta” namespace - client.beta. Two issues I’ve encountered in production use:

Behaviour of a models was different, inconsistent between client.beta and client, and OpenAI’s Playground almost like it was some other model
The only model family which can be used is GPT 4o

One day, most of the regression tests started to fail while the codebase had no major changes. The model was being called by its versioned name gpt-4o-2024-08-06. It turned out the culprit was the new API. After switching to the old API and introducing the serialization method I’ve shown in the Serialization section, all tests went back to normal. This is a strong argument against using client.beta even though it is what OpenAI suggests in their documentation.

Don’t Trust LLMs Blindly

LLMs are still non-deterministic entities, even when fine-tuned on structured output; sometimes they don’t return what you’d expect. There are some ways of dealing with it:

Prompt engineering
Solid test coverage
Try catch / except

Prompt engineering is a straightforward way of improving LLM output. Add some rules, few-shot learning, or rephrase—it all may yield great results. Do not rush to write code, yet alone it might be inconsistent. It is not uncommon for models to return 9 out of 10 responses correctly. That 1 malformed or otherwise wrong response might cause quite significant issues in your system and business if you don’t build hedges around it.

Even though I said not to rush to write code—do rush to write tests. LLM systems need to be covered by tests extensively: unit, integration, and E2E tests. If classical software does not really need to have tests for $ 1 + 1 $ kind of features, LLMs do need them. Tests are the only way to ensure your system does what it is designed to do; they are essential. You’ll be surprised how many times you’ll say, “It was working just an hour ago!” In fact, the final code we got might crash because the LLM may return a ticker not as DJI but as DJIA; it will happen rarely, but it will happen. This is a great example of why tests are essential, and they should be run as frequently as possible; otherwise, debugging won’t be fun.

The last tip here is to try-catch / except all the mappings from LLM input onto structures such as data classes or function parameters. When you do something like fn(**args), you are relying on LLM output, and once in a while, it will explode. Solutions depend on your case, but a uniform solution might be to just use good old retry. LLMs tend to make mistakes sometimes, but they are non-persistent; therefore, just re-run inference—it would be good enough in many cases.

The Essential Guide to Large Language Models Structured Output, and Function Calling

Getting it Straight: Structured Output and Function Calling

Why Do We Need All That

Models Don’t Have All The Data

Models Need To Be Integrated With Other Systems

Architecture of LLMs Has Some Limitations

💫 Successfully subscribed 💫

Get notified when post like this gets published.

Frameworks or It All Simpler Than It Looks

How Structured Output Works

Use Case

Serialization

How Function Calling Works

💫 Successfully subscribed 💫

Get notified when post like this gets published.

💫 Successfully subscribed 💫

Liked the post? Get notified about the next one.

Appendix

Model Support

OpenAI Update of Structured Output

Don’t Trust LLMs Blindly