Can Language Models Learn How to Teach Themselves?

Eden and Mike’s Daily Paper, 29.12.25, SEAL: Self-Adapting Language Models

Dec 29, 2025

Introduction

LLMs are trained on massive amounts of data and, as a result, exhibit excellent capabilities across a wide range of tasks and domains. However, despite this, these models show poor performance on tasks that require new information. Consequently, the authors of the paper pose the following question:

Is there a way for the models themselves to generate the training data they need for a task and decide how to train/teach themselves using that information? The researchers investigate the answer to this question in this paper. They argue that just as humans studying for an exam know how to organize information in the way most convenient for them to learn (e.g., in a diagram, table, etc.), language models should be given the same capability. The researchers frame this ability as a Reinforcement Learning (RL) problem, where an RL algorithm learns, given new information, the best way to teach the language model. In this context, “learning” refers to updating the model’s weights. They call this method SEAL.

The Method

The method operates given a task context and a method used for the evaluation of the task solution. An example of this is the ARC benchmark, where the context consists of the few-shot examples provided for each question.

At the heart of the method lies the concept of Self-edits (SE). This term describes the action the model can perform: editing the context itself to improve its own solution to the task. For example, given a context of a paragraph about which questions are asked, the model can edit the paragraph into a different format (based on a specific instruction from the researcher) and use this form (SE) if it leads to improved performance.

The researchers also experimented with creating SEs in other ways, such as rewriting the paragraph or writing questions and answers based on the paragraph.

Technical Implementation:

Parallel Processing: The authors perform several actions simultaneously, creating numerous SEs for a given context.
LoRA: they use LoRA (a fast and memory-efficient fine-tuning algorithm) to perform a single training step (including both SFT and RL) for each SE separately. This creates a large number of different versions of the same model (each corresponding to one training step for its specific SE).
Evaluation: Each version undergoes an evaluation phase and receives a score based on its defined task.

The Reward System:

The training reward was defined as binary:

1: If using the SE led to a model update (via Fine-Tuning) that improved results (based on the evaluation performed).
0: If the SE did not lead to improvement.

This reward is used by the RL algorithm to learn the optimal way to write an SE so that it serves the model best during the Fine-Tuning (FT) stage.

RL Algorithm:

The authors chose the ReST (Reinforcement Learning from Self-Play) method as their RL algorithm. In this approach, the model generates several examples (candidates) in its current version and then uses only the examples that yielded a positive reward for the learning process.

Self-Improving Capabilities

The researchers demonstrate how the method can also be used when the context consists of several solved examples for the task (i.e., few-shot). They use the ARC benchmark as an example, where the model must demonstrate reasoning and generalization to answer various tasks using a small set of solved examples.

In this environment, the SEs the model could create included:

Tool Use: Applying tools to augment the context examples to create additional examples (e.g., inversion, grouping, etc.).
Training Parameters: Optimizing parameters for its own training, such as the number of epochs, learning rate, and more.

This creates a scenario where the model can learn how to optimize its own training process for a task to improve its performance. In other words, we obtain a type of self-improving model.

Results

To test their method on ARC, the researchers used the Llama-3.2-1B model. They successfully showed that even though the model encounters a task (ARC) it has never seen before, it learns and improves using the proposed method. The paper claims that no ARC data was used for the initial training.

They compared their method against the following approaches:

In-context learning (ICL)
SEAL without RL (i.e., including only the SFT stage)

The results achieved using SEAL were superior to the other methods.

For the first example involving paragraphs, they performed a comparison using the SQuAD dataset. Here, they used Qwen-2.5-7B models and tested their method against others, including one that uses GPT-4o (referred to in the text as 4.1) as a “teacher model” to create the SE (without using RL). In some cases, SEAL yielded better results, while in others, the teacher model approach performed better.

https://arxiv.org/abs/2506.10943

A guest post by

Eden Yavin

A passionate AI researcher and aspiring AI engineer, dedicated to applying the latest LLM research and techniques in cybersecurity, while continuously exploring advancements in AI engineering concepts

Mathy AI Substack

Discussion about this post

Ready for more?