On May 22, Zhipu AI launched its flagship high-speed model API, ‘GLM-5.1-highspeed’, which is currently available exclusively to select enterprise clients. This model achieves an output speed of 400 tokens per second—a figure Zhipu claims sets a new global benchmark for API speeds among large language model providers. Unlike the industry norm where ‘high speed’ typically implies reduced capabilities, the high-speed version of GLM-5.1 fully retains all reasoning and coding functionalities of the standard GLM-5.1 model. It also supports a 200K context window and a maximum output length of 128K tokens, marking the first time a domestic Chinese LLM combines flagship-level performance with ultra-low latency at a production level.
Technically, this high-speed version was jointly developed by Zhipu’s GLM team and TileRT team. A key breakthrough came from the TileRT inference engine, which leverages static compilation orchestration and Tile-level microtask scheduling to eliminate redundant overhead, thereby pushing utilization close to hardware limits. Additionally, dynamic batch processing and KV cache scheduling were implemented at the system level to minimize tail latency; further optimizations were applied to clusters and network configurations to ensure the 400 tokens/s speed remains a stable, production-grade capability rather than just a peak value. Empirical tests indicate that code generation efficiency increases roughly tenfold compared to regular models—complex web code can now be generated in under 30 seconds. In Agent Swarm scenarios, up to 50 distinct AI agents can be deployed simultaneously. Currently, this API caters to latency-sensitive applications such as AI-powered programming, real-time interactions, business decision-making, and instant voice processing. Pricing details remain undisclosed; enterprises must apply through Zhipu’s BigModel Open Platform to qualify for access.