DeepSeek-OCR: Is Optical Compression the Future of Long-Context?
DeepSeek-OCR: Contexts Optical Compression, Mike’s daily paper review: 25.10.25
Today’s paper tackles the fundamental bottleneck of modern AI: the quadratic compute cost of processing long text sequences. As context windows grow, attention mechanisms become prohibitively expensive. This paper asks a radical question: why process text as text at all? What if we could compress a 10K-token essay into a 1K-token image and have a model read that instead?
This is the core idea of “optical compression.” DeepSeek-OCR is a proof-of-concept designed to explore the feasibility and boundaries of this idea. The central mathematical concept is the “compression ratio”: the number of ground-truth text tokens divided by the number of vision tokens the model uses. The entire paper is a methodological exploration of how high this ratio can be pushed before the signal is lost.
The novelty lies almost entirely in the DeepEncoder, a new vision encoder architecture purpose-built for this task. It’s not a standard Vision Transformer; it’s a clever, three-stage cascade designed to manage computation at high resolutions.
Here’s how it works:
Stage 1: High-Res Perception (Local Attention) First, the high-resolution document image (e.g., 1Kx1K) is fed into a SAM-base model. This initial stage uses only windowed (local) attention. This is a critical choice. It can process the image at its full resolution, generating a large number of initial patch tokens (e.g., 4K tokens), but without the quadratic memory explosion of global attention. Its job is to handle the raw visual perception cheaply.
Stage 2: The Bottleneck (Convolutional Compression) Next, the 4K tokens from the SAM block are passed through a simple 2-layer convolutional module. This acts as a hard, non-learned (in the attention sense) compressor. It performs a 16x spatial downsampling, crushing the token count from 4K down to just 256. This is the “optical compression” step made manifest, a hard bottleneck.
Stage 3: Knowledge Extraction (Global Attention) Only after this aggressive compression are the remaining 256 tokens fed into a CLIP-large model. This component uses dense global attention, allowing it to perform the “deep understanding” and reasoning. Because it only sees 256 tokens, this computationally brutal step becomes perfectly manageable.
This serial, hybrid-attention design is the main architectural takeaway. It intelligently separates the task: cheap local attention handles the high-bandwidth pixel data, a convolutional bottleneck forces compression, and expensive global attention works only on the resulting low-bandwidth latent representation.
The other half of the system is the decoder, a 3B Mixture-of-Experts model (DeepSeek3B-MoE). Its novelty is not in its architecture but in its task. It is trained to be a “decompressor”, to take the 256 compressed vision tokens from the encoder and reconstruct the original, multi-thousand-token text sequence. The paper investigates whether such a non-linear “decompression” mapping can even be learned, especially by a relatively small model.
The final piece of methodological novelty is the multi-resolution training. To explicitly test the limits of the compression ratio, the model is trained on several input modes simultaneously.
Native Modes (Tiny, Small, Base, Large): These are for research. The Tiny (512x512) and Small (640x640) modes resize (as in, rescale) the original image, which degrades the data. This forces the model to learn to decode text from blurry, low-information inputs, directly testing the boundaries of high compression. The Base and Large modes pad the image, preserving its content for more practical, high-fidelity use cases.
Dynamic Modes (Gundam, Gundam-Master): This is the system designed for practical performance. It’s a tiling method that combines n local-view tiles with one global-view image. This “secondary windowing” allows the model to process ultra-high-resolution documents (like newspapers) that wouldn’t fit in memory even with the DeepEncoder’s cascade.
In short, DeepSeek-OCR isn’t just another OCR model. It’s a structured investigation into the “channel capacity” of vision as a medium for text. The novel encoder architecture is the tool it uses to probe this question, efficiently managing computation while forcing text information through a compressed visual bottleneck.
https://arxiv.org/abs/2510.18234


Thanks for writing this, it clarifies a lot. Radical idea, good exploration.