The idea of LLMs acting as agents that interact with tools and software (for example, navigating a website, editing spreadsheets, or fixing code) sounds very exciting. Unfortunately, many of these agentic LLMs still struggle with these tasks. For example, the best agentic framework on the SWE benchmark [1], which tests how well LLMs can fix GitHub issues, only achieves about 65% accuracy, and that often requires specialized prompting or additional enhancement techniques. In basic zero-shot or few-shot settings, performance can be even lower (49% resolved for Claude 3.5 Sonnet).
Model improvements are probably the most direct path forward, but some research also aims to boost performance using additional techniques. One example is Learn-by-Interact [2], which automatically generates task-specific training data without human annotation or labeling.
The key insight is that high-quality training data for agents can be created by letting an LLM interact with an environment—clicking buttons, running terminal commands, or performing other relevant actions. These interaction logs (called trajectories) then serve as demonstration data for how to handle similar situations in the future. This is how it works:
1. Generate Task Instructions
An LLM begins by analyzing documentation or FAQs for a given environment (e.g., GitHub, Chrome, or a BigQuery console). It then synthesizes realistic tasks users might want to do, such as Upload a CSV from Google Drive to BigQuery. This ensures the instructions remain grounded in real-world domains.
2. Collect Interaction Trajectories
With the generated instructions in hand, the LLM attempts to complete them by interacting with the environment. Each step produces an (observation, action, next_observation, next_action, ...)
sequence, known as a trajectory. For instance, in a web-based GUI, each action might be Click button X, and each observation would be the resulting webpage.
3. Backward Construction
Sometimes, the LLM fails to achieve the original instruction. Backward Construction then creates new instructions from the actual trajectory. For example, if the initial goal was Upload a CSV from Google Drive but the LLM used Google Cloud Storage instead, the system revises the instruction to Link CSV file in Google Cloud Storage to BigQuery.
This fixes the mismatch: even if the LLM made a mistake, the resulting trajectory is still valid for a newly created (aligned) task. Partial snippets of the trajectory can yield sub-instructions like Summarize steps 2–4 or Replicate exactly what happened from steps 3–5.
4. Filtering Out Low-Quality Data
Not every generated trajectory is valid or useful. The LLM might get stuck in loops or repeatedly click the same buttons. Learn-by-Interact uses an LLM committee to check whether each instruction–trajectory pair is coherent, aligned, and realistic. Both Gemini 1.5 Pro and Claude 3.5 Sonnet vote yes or no based on criteria such as:
Does the trajectory actually achieve the instruction’s goal?
Are the actions coherent, natural, and not repetitive?
5. Data Usage: In-Context Learning and Fine-Tuning
Once you have a high-quality set of (instruction, trajectory)
pairs, you can use them in two ways:
In-Context Learning (ICL)
At inference time, retrieve the most relevant examples—via the instruction or current observation and prepend them to your LLM’s prompt. Learn-by-Interact suggests two retrieval methods:
Observation-based Retrieval: Look for states or observations that match the current environment.
Model-based Retrieval: Ask the LLM to form a query based on the instruction, history, and current observation, and retrieve examples that match.
Both approaches provide the LLM with helpful hints for the next action.
Fine-Tuning
For smaller or open-source models, you can do full or LoRA-based fine-tuning on the synthetic data. This allows the model to handle recurring scenarios directly, without relying on lengthy prompts at inference time.
Performance and Limitations
Tests on agent benchmarks like SWE-bench (GitHub issue resolution) and OSWorld (desktop tasks) show that adding synthetic data from Learn-by-Interact can significantly boost success rates. In some experiments, performance nearly doubled for specific tasks. However, agent performance is still far from perfect—Claude 3.5 Sonnet’s best results on four benchmarks were only around 60% accuracy. While promising, the method still leaves much room for further improvement.
Read the full paper:
[1] SWE-bench Can Language Models Resolve Real-World GitHub Issues?
[2] Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments