Paper: ToolFormer

Despite this being a rather old paper, it turned out to be a rather interesting and insightful read. In today’s age of powerful models like Opus 4.7 and GPT 5.5, it is easy to take tool calling for granted given how accurate and reliable it has become.

However, rewind a couple of years, when LLMs were mostly limited to pure text generation and instruction following. ToolFormer really paved the way for more grounded and reliable responses through external tool usage. Before ToolFormer, tool usage was heavily handcrafted where the developer had to manually engineer or prompt the model to use a certain tool in a certain case.

For example:

If the user asks for math:
- Use Calculator API

If the user asks about weather:
- Use Weather API

With ToolFormer, the LLM can do three main things which streamline tool usage:

  1. Learn when tools are useful.
  2. Learn where tools should be inserted.
  3. Self-supervise tool usage from only a small number of demonstrations.

How ToolFormer Works

There are four main steps to this self-supervised pipeline. We start with a corpus of text.

(AI Generated Example)
1. Out of 1400 participants, 400 passed the test, which is about 29%.
2. The Nile has an approximate length of 6,853 kilometers.
3. The capital of Japan is Tokyo.
4. The name derives from "la tortuga", the Spanish word for turtle.
5. The shop will reopen tomorrow, March 10.

Tools:
Calculator(expression)
QA(question)
MT(text)
Calendar()

Step 1: Sampling API Calls

Given the corpus, the model attempts to see where there are good positions to add a tool. The key thing to note is that the model can suggest multiple tools, either correct or wrong, at any position. The LLM is exploring possibilities.

1. Out of 1400 participants, 400 passed the test, which is about
[Calculator(400 / 1400)]
[QA("Who passed the test?")]
29%.

2. The capital of Japan is
[QA("What is the capital of Japan?")]
[Calculator(5 * 12)]
Tokyo.

3. The name derives from "la tortuga", the Spanish word for
[MT("tortuga")]
[Calendar()]
turtle.

4. The shop will reopen tomorrow,
[Calendar()]
[QA("What day is tomorrow?")]
March 10.

Step 2: Executing API Calls

The model then executes the API calls.

1. Out of 1400 participants, 400 passed the test, which is about
[Calculator(400 / 1400) -> 0.29]
[QA("Who passed the test?") -> 400 participants]
29%.

2. The capital of Japan is
[QA("What is the capital of Japan?") -> Tokyo]
[Calculator(5 * 12) -> 60]
Tokyo.

3. The name derives from "la tortuga", the Spanish word for
[MT("tortuga") -> turtle]
[Calendar() -> Today is Thursday, March 9, 2023.]
turtle.

4. The shop will reopen tomorrow,
[Calendar() -> Today is Thursday, March 9, 2023.]
[QA("What day is tomorrow?") -> Friday]
March 10.

Step 3: Filtering API Calls

Now we need to choose if the tool is correct and if we even need a tool. The key idea is that we should keep the tool if it makes the LLM better at predicting future tokens. This is measured by cross-entropy loss.

Here is an example without a tool call:

prefix: Out of 1400 participants, 400 passed the test, which is about ...

correct answer: 29%

Given just the prefix, it is very hard for the model to correctly predict 29%, so the loss is expected to be high.

With the tool call:

prefix: Out of 1400 participants, 400 passed the test, which is about
[Calculator(400 / 1400) -> 0.29]

Since the model can see 0.29, the probability that it will predict the next token as 29% is much higher, so the expected loss is lower.

ToolFormer collects two points: the loss with the API call and result provided (L+), and the minimum between the loss without the API call and the loss with the API call but without the result (L-).

Loss is computed with L- - L+. If the difference is more than the threshold, the tool call is kept.

Step 4: Model Finetuning and Inference

After filtering, the model has chosen the best tools and is trained on the corpus. The model is then finetuned on this corpus using standard next-token prediction. This teaches the LLM patterns like when there is math, use the calculator tool; when there is factual stuff, use the Wiki tool.

ToolFormer does not replace the original training corpus. API calls are interleaved into the existing text, which preserves the model’s general language modeling abilities while teaching it how and when to use tools.

This learning opens the possibility for inference where the model can suddenly stop decoding, predict a tool call, call the external tool, insert the value, and continue decoding.

My Thoughts

Before I read this paper, I treated tool calling as a black box. Somehow the model knew what tool to call and I did not really question it. This paper cleared the mist for me.

Some takeaways:

  1. I feel the key component of ToolFormer is the filtering. Given large corpuses, it is inevitable that there will be many noisy, irrelevant, or unhelpful tool calls. For the model to have the best accuracy after training, it is vital that filtering keeps only the important tools, meaning the ones that reduce loss. In a way, it is data cleaning done by the LLM itself.
  2. While ToolFormer was legendary for its time, we have come a long way in terms of tool calling. I am interested in learning more about the progression from ToolFormer to more complex capabilities such as chaining tools, retrying failed calls, interactive browsing, and executing multi-step tool workflows.
  3. With the focus on harnesses these days, I used to think tool calling mainly belonged to the jurisdiction of external runtimes: define tools, route calls, execute steps, and manage agent workflows. However, ToolFormer offers another perspective: tool calling can also be framed as a learning problem, where the model learns when and how to call tools from data rather than relying only on hardcoded orchestration rules.