Encoder-Only Transformers as Semantic Reward Models
Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO, Mike’s daily paper review: 20.10.25
Aligning LLMs to produce explanations that are not just correct but also pedagogically sound is a significant challenge. The conventional methods for this alignment process are often a trade-off between two extremes. On one hand, using a powerful “LLM-as-a-judge” to score outputs is slow, computationally expensive, and can be prone to biases. On the other, relying on simple, keyword-based metrics like ROUGE is efficient but shallow, as it rewards lexical overlap rather than deep conceptual understanding. This paper carves out a novel third path.
The central innovation is the use of a small, efficient, encoder-only transformer as a specialized semantic reward model. Instead of judging an explanation with natural language or counting keywords, this approach operates in the abstract space of vector embeddings. The methodology is elegantly simple: both the model-generated explanation and a ground-truth reference explanation are passed through the encoder, which converts each piece of text into a dense, numerical vector. The cosine similarity between these two vectors is then calculated. This single numerical score, which captures the conceptual and structural alignment between the two explanations, becomes the primary reward signal. A high similarity indicates that the generated text is semantically close to the expert-written ideal, incentivizing the model to learn the underlying reasoning structure, not just surface-level features.
This semantic reward mechanism is deployed within the Group Relative Policy Optimisation (GRPO) framework. GRPO is a reinforcement learning algorithm that, for each prompt, scores a group of generated responses. It then updates the model’s policy based on how each generation’s score compares to the average score of the group. This provides a stable and efficient way to apply the reward signal.
The paper’s proposed reward function is not monolithic; it’s a multi-component system designed to shape the model’s output holistically. The total reward is a sum of four distinct signals:
Semantic Similarity: The core component, calculated as described above using the encoder model. To make the signal more discriminative, the final reward is an adjusted cosine similarity, where a baseline similarity (calculated from an average of random explanation embeddings) is subtracted from the raw score. This is a crucial detail, as it prevents the model from being rewarded for generating text that is generically similar to any explanation.
Factual Accuracy: A simple binary reward for whether the final answer is correct.
Structural Correctness: A rule-based reward that checks if the output is formatted correctly with the required XML tags.
Reasoning Transparency: A reward for including a non-empty “chain-of-thought” within designated tags.
In essence, the novelty of this work lies in its pragmatic and effective solution to reward shaping. It demonstrates that a small, computationally inexpensive encoder model can provide a semantically rich reward signal that is more nuanced than lexical metrics and more efficient and stable than using a large LLM-as-a-judge. By integrating this semantic score into a multi-faceted reward function within the GRPO framework, the paper presents a powerful methodology for guiding LLMs toward producing higher-quality, conceptually-aligned explanations.
https://arxiv.org/abs/2509.13081

