On May 20, Sapient Intelligence (Singapore) and its collaborators from MIT released the HRM-Text paper, simultaneously open-sourcing the model weights and full training framework on GitHub and HuggingFace. Built upon the Hierarchical Reasoning Model (HRM) architecture, HRM-Text replaces the standard Transformer’s single forward pass with a dual-timescale recursive design: higher-level modules handle slow, abstract planning while lower-level modules perform rapid, fine-grained computations; both layers execute nested recursion within a continuous latent space before producing outputs, effectively enabling nearly infinite computational depth despite a fixed parameter count. The XL variant, featuring 1 billion parameters, was trained on roughly 40 billion effective tokens over 46 hours using two nodes equipped with 16 H100 GPUs at a cost of approximately $1,472, achieving scores of 60.7% on MMLU, 84.5% on GSM8K, 81.9% on ARC-Challenge, and 56.2% on MATH. Meanwhile, the L (TRM) variant with 600 million parameters required just about $800 in training expenses yet outperformed numerous 3-billion-parameter models trained via conventional Transformers across multiple downstream benchmarks.
In terms of training data, HRM-Text utilized merely around 40 billion structured tokens—merely 1/100th to 1/1000th of the volume used in mainstream pre-training runs involving 4 trillion to 36 trillion tokens—resulting in computational savings of 130 to 600 times and data consumption reductions of 150 to 900 times. The open-source release includes tools for data extraction, PrefixLM sequence packing, PyTorch FSDP2-based distributed training, FlashAttention 3 kernels, and checkpoint conversion utilities; currently, it is compatible solely with Hopper-architecture GPUs such as H100 and H800. It’s worth noting that the available weights represent pure pre-trained models lacking instruction tuning or RLHF alignment, thus supporting only prefix continuation capabilities; after quantization into int4 format, the model size drops to roughly 0.6 GiB, making local deployment feasible. The creators position this initiative as tangible proof of ‘democratizing foundational model pre-training,’ given that prior to now, such large-scale pre-training efforts remained largely exclusive to elite research labs due to prohibitive computational costs.