A graduate student named Makram Chahine has been working on a problem that has plagued the AI community for years, but no one really wants to acknowledge how serious it is, somewhere on the upper floors of MIT’s Stata Center. It is extremely costly to train a large model. Everyone is aware of that.
The field continues to do this because the alternatives have been worse, despite the fact that the hardware costs, electricity, and months of compute time on clusters that hum like industrial freezers are all unsustainable. Either you train something massive and then shave it down, or you train something tiny and acknowledge that it won’t be as intelligent. Really, there hasn’t been a third choice.
| Field | Detail |
|---|---|
| Technique Name | CompreSSM |
| Lead Author | Makram Chahine, PhD candidate, EECS |
| Senior Author | Daniela Rus, Director of CSAIL |
| Institutions Involved | MIT CSAIL, Max Planck Institute for Intelligent Systems, ELLIS, ETH Zurich, Liquid AI |
| Architecture Targeted | State-space models (including Mamba) |
| Compression Stage | During training (after ~10% of training steps) |
| Mathematical Tool | Hankel singular values, drawn from control theory |
| Reported Speedup | Up to 4x on Mamba; 1.5x on image benchmarks |
| Benchmark Highlight | 85.7% accuracy on CIFAR-10 at 1/4 state dimension |
| Comparison Baselines | Pruning, knowledge distillation, Hankel nuclear norm regularization |
| Year Announced | 2026 |
Chahine and his colleagues recently published a technique called CompreSSM, which is an attempt at that third option. To put it simply, the idea is to let the model make its own decisions while it is still learning which parts of itself are essentially riding along for free and which parts are performing useful work. Then, surprisingly early in the training process, the parts that aren’t useful are cut. The remaining training proceeds at the pace of a considerably smaller model. Anyone who has worked in machine learning immediately assumes there must be a hidden cost somewhere because it sounds almost too neat to be true. The researchers contend that there isn’t using methods from control theory.
The trick is found in a mathematical quantity known as Hankel singular values, which quantifies the contribution of each internal state of a model to its overall behavior. These values, which are taken from a field that typically deals with chemical plants and airplanes rather than language models, end up stabilizing remarkably early—after about 10% of training. The team was taken aback by that aspect. The model basically keeps growing into itself, getting smaller and faster, rather than being trimmed back later by a different process, and once the rankings settle, the irrelevant dimensions can be eliminated with little impact.

It’s difficult to ignore the numbers. A model trained at that smaller size from the beginning only achieved 81.8 percent accuracy on CIFAR-10, whereas a compressed model operating at roughly a quarter of its original state dimension achieved 85.7 percent accuracy. The team reports about 4x training speedups on Mamba, one of the more popular state-space architectures at the moment, reducing a 128-dimensional model to about 12 dimensions while maintaining competitive performance. Because the previous method required costly eigenvalue computations at each gradient step, CompreSSM performed more than 40 times faster than a recent regularization method that accomplishes something similar in spirit.
As this develops, it seems as though the AI sector has been silently anticipating a similar outcome. The current method of shrinking models, which involves training a large teacher and then a smaller student on top of it, has always seemed like a clumsy workaround and doubles the training expense. Knowledge distillation functions similarly to duct tape. CompreSSM is not like that. It incorporates compression into the process of learning, which is more in line with the actual development of biological systems. The senior author, Daniela Rus, presented it as a radically different approach to developing AI. This isn’t just press release rhetoric; the math supporting it, which relies on Weyl’s theorem to demonstrate the smoothness of state importance over time, lends credence to the assertion.
It remains to be seen if industry quickly adopts it. Compared to transformers, state-space models occupy a smaller portion of the market, and the majority of funding still goes to the larger architectures. However, there is clearly a desire for anything that lowers computation. It’s difficult to ignore the fact that large labs elsewhere were ordering more GPUs during the same week that researchers were celebrating leaner training. I’m getting used to the contrast.
For the time being, CompreSSM is in that early stage where promising research frequently resides: it has been published, tested, and is intriguing, but it hasn’t yet been integrated into the production pipelines that would make it significant on a large scale. It’s still unclear if the method will generalize to other architectures as smoothly or if, once it leaves the benchmark suite, some hidden brittleness will become apparent. While the field is generally skeptical, usually with good reason, researchers tend to be optimistic about their own work. In any case, the fundamental finding—that a model can determine what to discard before it has even completed learning—feels like the kind of minor concept that goes farther than anyone anticipated.
