When an LLM Stops Talking and Starts Deleting Files — Enter the Age of LAMs
Omri & Mike’s Daily Paper: 16.08.25, Large Action Models: From Inception to Implementation
What are Large Action Models (LAMs), and how are they different from LLMs?
Bottom line: a LAM is an LLM, but one that is trained and adapted specifically to produce executable actions in a real environment. While a standard LLM is trained to make coherent, high-quality text, a LAM is trained to generate programs and commands that can be executed through an agent, whether that’s a click, a keystroke, or an API call, directly affecting the world rather than just “talking about it.”
The authors argue that instead of connecting LLMs to agent environments, we should connect a LAM that essentially serves as the decision-making engine inside the agent loop: the agent collects observations from the environment (e.g., screen state, list of available buttons, or API data), feeds them into the LAM, and the LAM outputs the next action to take. The agent then executes the action and returns feedback about the outcome, enabling the LAM to update its subsequent decisions.
This is where the critical difference lies, also in terms of risk. A “classic” LLM error typically manifests as a wrong answer or hallucination, harming understanding or trust, but without direct real-world consequences. By contrast, an error by a LAM could cause real damage: deleting an important file, sending a message to the wrong address, or executing an unwanted business operation.
The researchers’ experiments were conducted only on Windows, focusing on Microsoft Word tasks. They connected the LAM to UFO, a GUI agent for Windows. The agent reads the UI state: a list of controls with type, title, and index passes this information to the LAM for decision-making, and then executes the chosen action (mouse click, typing, or API call).
The proposed pipeline has 5 stages: Data → Training → Integration & Grounding → Offline Eval → Online Eval.
Throughout the paper, a clear distinction is drawn between Task Planning and Task Action. During data collection, they first gathered Task→Plan data, then converted these steps into concrete trajectories in Word: selecting a specific button, defining the action type and parameters, so the agent can run them and verify success or failure. This process is called Grounding: anchoring the model’s textual output to a real UI and deterministic operational actions.
LAM1: trained with SFT only on Task→Plan (𝑡ᵢ→𝑃ᵢ). The idea is to first teach the model how to break tasks into logical steps before choosing actual actions. They used ~76.7K examples from sources like help guides, WikiHow, and historical queries, carefully cleaned and processed for consistency and quality.
LAM2: shifted focus to State→Action, imitating successful GPT-4o trajectories (𝑠ₜ→𝑎ₜ). Each example contained the current UI state (list of controls with type, title, index + task text) and the exact action executed. These success trajectories were derived from LAM1’s Task→Plan data, then grounded into real Word UI actions, executed live, and filtered for successful runs. The dataset ended with 2,192 successful trajectories used for SFT.
LAM3: continued SFT on State→Action, but introduced Self-Boosting: they took GPT-4o failure trajectories, let the LAM2-trained model retry, and collected new successes. This produced high-quality additional data without manual annotation, covering harder cases.
LAM4: moved from SFT to RL, applying Offline PPO guided by a Reward Model (RM). The RM was based on LAM3, with an added layer that scores each action. It was fine-tuned with LoRA on both success and failure trajectories: +1 for successful steps, −1 for failed ones, and trained with MSE to predict these scores. Then they used the RM to train LAM4 with Offline PPO, this time focusing on 1,788 failure trajectories from LAM3 — in order to “learn from mistakes.” The training format was (𝑠ₜ, 𝑟ₜ) → 𝑎ₜ, with the RM providing the reward.
The paper reports three types of evaluation: Planning, Offline Eval, and Online Eval.
In the first two, they checked task-level planning success, step breakdowns, and accuracy in selecting the correct object and action. The models steadily improved from competitive baselines to consistent gains. In the third, live runs in Windows/Word, the text-only LAM performed competitively against GPT-4o, even surpassing it in some purely text-based configurations. However, when GPT-4o was enhanced with vision, it achieved higher success rates, albeit with slower speed and reduced efficiency.
The authors suggest that as agents become more common and capable, we will likely see more of this type of work: connecting models to real environments, training on task-specific data, and adapting to domains. It’s not clear that the exact recipe of three SFT stages followed by one RL stage is optimal, but the direction, turning LLMs into more task-oriented, structured, domain-adapted models, seems like a natural step in an era where more and more agents act in the real world.
https://arxiv.org/abs/2412.10047