Researchers at DeepSeek have released a new experimental AI model, V3.2-exp, designed to dramatically lower inference costs in long-context operations. The model was announced Monday via Hugging Face, accompanied by an academic paper posted on GitHub.
At the core of the release is DeepSeek Sparse Attention, a new attention mechanism built around two modules. The first, a “lightning indexer”, prioritizes key excerpts from the context window. The second, a “fine-grained token selection system”, filters specific tokens within those excerpts to load into the model’s limited attention window. By combining these methods, Sparse Attention enables the model to process long inputs with reduced server demands.
Preliminary testing suggests the system could cut API costs in half for long-context scenarios. While more robust third-party testing is needed, the model is open-weight and freely available, making independent verification likely in the near term.
The breakthrough adds to a growing set of innovations targeting the cost of inference, the ongoing expense of running large models, distinct from the upfront cost of training. DeepSeek’s approach focuses on optimizing the transformer architecture itself for greater efficiency.
DeepSeek, based in China, has positioned itself as an unconventional player in the global AI race. Earlier this year, it gained attention with its R1 model, trained primarily through reinforcement learning at a fraction of U.S. competitors’ costs. While R1 did not deliver the predicted upheaval in AI training, V3.2-exp may influence how providers approach the persistent problem of inference expenses.
Though unlikely to generate the same level of debate as R1, DeepSeek’s sparse attention method could still offer practical lessons for U.S. and global AI developers struggling with long-context efficiency.

