Online Learning: Adapting Machine Learning Models to Evolving Data

Joint work with Uri Itai

Mike Erlihson, Mathy AI

and

Uri Itai

Mar 29, 2025

Introduction

In traditional supervised machine learning (ML), models are trained on a fixed dataset, learning patterns from known examples to generate predictions on unseen data. This paradigm follows three fundamental stages: data collection, training, and inference. However, real-world data is rarely static—it evolves, introducing challenges that conventional ML approaches may struggle to address effectively.

To address this challenge, we adopt online learning, an ML paradigm in which models are continuously updated as new data becomes available. Instead of being trained once on a fixed dataset, online learning allows models to evolve dynamically, adapting to changes in data distribution and maintaining accuracy over time.

What is Online Learning?

Online learning techniques allow models to learn incrementally from a continuous stream of data, rather than being trained on a fixed dataset. This approach offers significant advantages in scenarios that demand real-time model adaptation.

The online learning approach plays a pivotal role in a wide range of practical applications. Below is a partial list of such use cases:

Streaming analytics, including stock market predictions and sensor-based monitoring.
Dynamic environments, such as recommendation systems and fraud detection.
Real-time decision-making, applied in autonomous systems and personalized healthcare.

By continuously updating models with new data, online learning ensures adaptability and responsiveness in dynamic and evolving environments.

Online learning is particularly valuable in applications where immediate adaptation to new data is critical for maintaining accuracy while minimizing processing delays. Notable examples include anomaly detection in cybersecurity, real-time fraud detection, and adaptive medical diagnostics. In these domains, continuous model updates not only enhance responsiveness but also improve time efficiency by eliminating the need for frequent full retraining. This enables timely identification of emerging threats, fraudulent activities, or evolving medical conditions, ensuring that decision-making remains both accurate and computationally efficient.

Why is Online Learning Necessary?

A model trained on past data may perform well on historical patterns but struggle with new, unseen data. To maintain accuracy, periodic updates are needed. This raises two fundamental questions:

When should we update the model?
How should we update the model?

This blog on Medium addresses the former, while this blog post will focus on the latter.

The primary approach to maintaining model performance is retraining it to adapt to newly available data. However, this process is computationally intensive and time-consuming, which makes frequent retraining impractical. Furthermore, the latency associated with model updates can degrade the model performance, particularly in real-time applications where timely responses are critical.

As a result, alternative strategies such as incremental or online learning are often preferred. These approaches allow models to adapt gradually through successive updates, eliminating the need for complete retraining and thereby offering greater efficiency when handling continuous data streams. However, in scenarios where the underlying data distribution experiences significant changes, incremental updates alone may prove insufficient. In such cases, full retraining of the model may be necessary to maintain accuracy and ensure reliable performance.

To address this challenge, two main approaches are commonly used:

On-device updates, where the model is updated directly on the device, ensuring real-time adaptability.
Offline training, where model updates occur in the background, often on a server, allows for more computationally intensive adjustments.

Choosing the right approach depends on factors such as computational constraints, update frequency, and system latency requirements.

Fully online learning

On-device updates are particularly well-suited for algorithms with minimal computational demands, rendering them a critical enabler of fully online learning. Fully online learning refers to a learning paradigm in which a model incrementally updates its parameters after the arrival of each new data point, rather than processing data in large batches or relying on periodic retraining. This continuous learning process ensures that the model can adapt promptly to new information.

Unlike conventional background training, which typically requires an offline server or cloud infrastructure to perform computationally intensive updates, fully online learning allows updates to be performed directly on the device in real-time. This approach eliminates the latency and privacy concerns associated with transmitting data to external servers.

As a result, the model remains consistently aligned with the most recent user behavior or environmental context, making it particularly advantageous for applications that demand immediate adaptability—such as mobile personalization, adaptive user interfaces, and edge AI deployments. Nevertheless, due to inherent hardware and energy constraints, on-device learning is currently best suited to lightweight models with streamlined architectures.

Fully Online Learning Example: Exponential Moving Average(EMA)

The first example comes from the field of financial analysis, specifically the Exponential Moving Average. In this approach, more recent data points are assigned higher weights, while older data points gradually diminish in importance over time. Mathematically, the weighted average WA(t) at time t is given by:

\(WA(t) = w \cdot \text{val}(t) + w^2 \cdot \text{val}(t-1) + w^3 \cdot \text{val}(t-2) + \ldots\)

where 0 < w < 1 is a damping coefficient, ensuring that older values contribute progressively less to the moving average.

On-device updates are particularly well-suited for algorithms with minimal computational requirements, making them a foundational component of fully online learning. In this learning paradigm, the model updates its parameters immediately after receiving each new data point, enabling continuous and incremental adaptation over time.

A common optimization technique employed in such settings is Horner's method, also known as Horner representation. This method is a computationally efficient algorithm for evaluating polynomials by restructuring them into a nested form that significantly reduces the number of arithmetic operations required. Specifically, instead of evaluating a polynomial in its standard form:

P(x) = aₙxⁿ + aₙ₋₁xⁿ⁻¹ + ... + a₁x + a₀,

Horner’s method rewrites it as:

P(x) = (...((aₙx + aₙ₋₁)x + aₙ₋₂)x + ...) + a₀,

This transformation minimizes the number of multiplications, thereby reducing computational complexity—an advantage that is particularly beneficial in resource-constrained environments such as on-device learning.

This efficiency is crucial for on-device learning, where computational resources, memory, and energy availability are constrained. By leveraging such representations, devices can perform real-time updates without relying on external servers or cloud infrastructure. This approach ensures models remain current and responsive, which is especially beneficial in latency-sensitive applications such as adaptive user interfaces, mobile personalization, and edge AI systems. However, the effectiveness of on-device learning still depends on maintaining lightweight model architectures that align with the device's computational capabilities.

By leveraging the Horner representation, this equation can be rewritten in a more computationally efficient form:

\(WA(t) = w \cdot (\text{val}(t) + WA(t-1))\)

This reformulation highlights a key advantage: the update process requires only two arithmetic operations, making it extremely lightweight. Due to its efficiency, EMA is widely used in real-time financial analytics, such as stock price smoothing, trend detection, and volatility analysis, where rapid updates are essential without significant computational overhead

Fully Online Learning Example: Stochastic Gradient Descent(SGD)

Another important case in the context of online learning arises when the problem is convex, meaning the objective function is differentiable and has a unique global minimum. In such scenarios, stochastic gradient descent provides an effective mechanism for continuously updating model parameters as new data arrives. It is important to highlight that in machine learning, SGD-based techniques (like mini-batch gradient descent, Adam or RMSProp) are commonly used to optimize highly non-convex models, such as deep neural networks in the standard, non-online setting.

With SGD given a loss function L and a set of weights w, each weight is updated incrementally using the rule:

\(w_{new} = w_{old} - \eta \cdot \nabla L\)

where η is the learning rate (step size), and ∇L(w_old) represents the gradient of the loss function computed at the current weight values. This iterative update process allows models to adapt efficiently in real-time, making SGD particularly well-suited for online learning applications where continuous adjustments are necessary to accommodate evolving data distributions.

Fully Online Learning Example 3: Support Vector Machines(SVM)

Support Vector Machines are supervised learning algorithms widely used for classification and regression tasks. They operate by identifying the optimal hyperplane that maximally separates data points from different classes in a high-dimensional feature space.

SVMs can be effectively adapted to the online learning setting, wherein the decision boundary - or separating hyperplane is incrementally updated as new data points become available. This continuous update mechanism enables the model to remain responsive to shifts in the underlying data distribution without necessitating complete retraining from scratch. Such adaptability is particularly beneficial in dynamic environments where timely updates are essential, such as in real-time systems or streaming data applications.

A key challenge in the implementation of online SVMs lies in maintaining computational efficiency. Since support vectors are critical in defining the decision boundary, their retention and update strategies must be carefully designed to avoid excessive memory usage and prohibitively slow updates. Without effective pruning mechanisms, the number of support vectors can increase substantially over time, undermining the scalability of the algorithm. Therefore, achieving a balance between adaptability and resource constraints is essential for the efficient deployment of online SVMs. For further discussion, see Lu et al., 2012 and Cauwenberghs & Poggio, 2001.

Up to this point, we have discussed full online training, a widely used approach for continuously updating models in real time. However, in many cases, the model may be too large or computationally demanding for this method to be practical.

Background and Batch Training

An alternative strategy to fully online learning is background training, where new data is collected and processed on a separate server rather than updating the model directly on the device. This approach allows for more efficient use of computational resources while still incorporating new information. Background training can be implemented through two primary methods: batch training and incremental background training.

Batch Training: The model is periodically retrained on newly collected data, treating it as a separate batch. This method ensures consistency but requires careful tuning of batch size to balance efficiency and model freshness.
Incremental Background Training: Instead of retraining the entire model, parameter-efficient fine-tuning updates only a small subset of parameters (sometimes newly added, as in LoRA's low-rank matrices). These techniques typically modify far fewer parameters than traditional transfer learning while maintaining comparable performance. This dramatically reduces memory requirements, training time, and computational costs, making it feasible to adapt large foundation models to specific domains or tasks even with limited resources.

In batch training, the model undergoes periodic updates by incorporating newly collected data as a separate batch. This method is particularly well-suited for deep learning, as it enables consistent model refinement without requiring a complete retraining process for every update. However, selecting the appropriate batch size presents trade-offs. If the batch size is too small, frequent updates may become inefficient due to the computational and communication overhead associated with uploading new models. Conversely, if the batch size is too large, the model may become outdated before the next update, leading to suboptimal performance.

Anyone who has trained a deep neural network is familiar with the term batch (sometimes called minibatch). In this context, it carries a similar meaning. However, as with training deep learning models for time series data, it is essential to clarify that the batch refers to newly arrived data.

Background Training Techniques: Transfer Learning (LoRa and Adapters)

A more computationally efficient alternative within background training is transfer learning, where only a subset of the model’s parameters is updated instead of retraining the entire network. This approach is particularly advantageous for deep learning models, as it allows for selective adaptation by fine-tuning only specific layers -typically the final or task-specific layers—while keeping the earlier layers frozen(remaining intact during fine-tuning). By leveraging pre-trained representations, transfer learning facilitates faster adaptation to new data while significantly reducing computational costs, making it a practical solution for large-scale applications that require efficient background model updates.

To further enhance computational efficiency during model adaptation, a class of techniques known as Parameter-Efficient Fine-Tuning (PEFT) has been developed. PEFT methods aim to fine-tune large pre-trained models by updating only a small subset of parameters, thereby reducing memory usage and computational demands while preserving model performance.

Among the most widely adopted PEFT approaches are Low-Rank Adaptation (LoRA) and adapter modules. LoRA introduces low-rank matrix decompositions into the weight update process, allowing fine-tuning with significantly fewer trainable parameters. This method reduces the memory footprint and accelerates training, making it particularly suitable for deployment in resource-constrained environments. Adapter layers, by contrast, are lightweight trainable components inserted into a frozen pre-trained network. They capture task-specific knowledge while leaving the majority of the model parameters unchanged, offering a modular and extensible approach to transfer learning.

Both LoRA and adapter-based methods are particularly valuable in scenarios where rapid model adaptation is required, but full retraining is computationally infeasible—such as in personalized applications or real-time deployment settings. Additionally, LoRA-anchored fine-tuning can be applied sequentially, allowing a model to be incrementally adapted to a series of tasks without the need to retrain or modify the entire network at each step.

This approach significantly reduces computational costs and speeds up adaptation, making it ideal for scenarios where new data is available but full retraining is impractical. However, if the underlying data distribution undergoes a significant shift, incremental updates may no longer suffice. In such cases, retraining the entire model becomes necessary to ensure that it continues to generalize well to new data. Striking a balance between selective fine-tuning and full retraining is essential for maintaining both efficiency and accuracy in dynamic environments.

Putting it all together

It is important to recognize that the online learning methods discussed above are not mutually exclusive. In practical applications, hybrid approaches are frequently employed to leverage the strengths of multiple techniques. One prominent example is the Passive-Aggressive (PA) algorithm, which dynamically adjusts its learning behavior based on the model’s performance.

In this framework, the model behaves passively—applying minimal or no updates—when its predictions are accurate or when errors are negligible. However, when a significant misclassification occurs, the algorithm responds aggressively, applying a more substantial update, often delegated to a server with greater computational capacity. This dual-mode strategy allows the model to remain efficient during normal operation while still addressing substantial errors through more intensive updates when necessary.

Such blended approaches are especially valuable in applications requiring real-time responsiveness, including fraud detection, adaptive recommendation systems, and speech recognition. In these contexts, minor adjustments can be executed efficiently on-device, while larger corrections are offloaded to server-side infrastructure, maintaining a balance between adaptability and resource management.

Furthermore, online learning is a highly effective approach for addressing the cold-start problem, which occurs when a model is initially deployed with insufficient data. In such cases, the model begins with a simple or generic framework and progressively refines its parameters as new data becomes available. This incremental learning process enables the model to adapt over time, gradually capturing the complexities of the underlying data distribution and leading to improved predictive performance.

Potential Shortcoming of Online Learning: Catastrophic forgetting

A significant limitation of online learning is the risk of catastrophic forgetting—a phenomenon in which a model loses previously acquired knowledge as it continuously updates with new data. This issue stems from the sequential nature of online learning, where data is processed one(or a very small amount) instance at a time without access to the entire training dataset. Without appropriate mitigation strategies, online learning models are prone to overwriting earlier representations, potentially resulting in substantial performance degradation on previously learned tasks.

Catastrophic forgetting is particularly problematic in non-stationary environments, where older data remains relevant. For example, in fraud detection, a model trained only on recent fraudulent patterns might fail to recognize older fraud techniques that still pose a threat. Similarly, in robotics and reinforcement learning, an agent that continuously adapts to new tasks may lose proficiency in previously learned tasks.

To mitigate catastrophic forgetting, several strategies can be employed:

Replay Mechanisms (Experience Replay) – Storing a subset of past data and periodically reintroducing it during training helps maintain knowledge of older patterns.
Regularization Techniques – Methods like Elastic Weight Consolidation (EWC) prevent drastic changes to important model parameters, preserving previously acquired knowledge.
Progressive Learning & Multi-Model Approaches – Instead of modifying a single model, new data can be incorporated using an ensemble of models or modular architectures.

Balancing adaptation to new data while retaining previous knowledge remains one of the biggest challenges in online learning, requiring careful design choices depending on the application.

Conclusion

Online learning plays a pivotal role in machine learning by enabling models to dynamically adapt as new data arrives. Depending on the application, updates can be performed either on-device, utilizing fully online learning, or on a server through background training.

When designing an online learning system, it is essential to assess the trade-offs between computational efficiency and update frequency carefully. On-device learning provides real-time adaptability but is constrained by memory and processing limitations. Conversely, server-based updates allow for more sophisticated model refinements but introduce challenges such as latency, bandwidth costs, and computational overhead.

Achieving the right balance between these factors ensures that the model remains efficient, up-to-date, and scalable for real-world applications. As data streams continue to grow in volume and velocity, online learning is becoming increasingly vital in modern machine learning systems. Integrating these techniques into data-driven projects can significantly enhance adaptability and performance.

A guest post by

Uri Itai

Mathematician in exile, researching algorithms and machine learning, applying data science, and expanding my ideas.

Mathy AI Substack

Discussion about this post