Zhipu AI launches high-speed version of GLM-5.1 API; claims 400 tokens/s processing speed sets new global benchmark for large models

ref · May 22, 2026, 8:31am

On May 22, Zhipu AI launched its flagship high-speed model API, ‘GLM-5.1-highspeed’, which is currently available exclusively to select enterprise clients. This model achieves an output speed of 400 tokens per second—a figure Zhipu claims sets a new global benchmark for API speeds among large language model providers. Unlike the industry norm where ‘high speed’ typically implies reduced capabilities, the high-speed version of GLM-5.1 fully retains all reasoning and coding functionalities of the standard GLM-5.1 model. It also supports a 200K context window and a maximum output length of 128K tokens, marking the first time a domestic Chinese LLM combines flagship-level performance with ultra-low latency at a production level.

Technically, this high-speed version was jointly developed by Zhipu’s GLM team and TileRT team. A key breakthrough came from the TileRT inference engine, which leverages static compilation orchestration and Tile-level microtask scheduling to eliminate redundant overhead, thereby pushing utilization close to hardware limits. Additionally, dynamic batch processing and KV cache scheduling were implemented at the system level to minimize tail latency; further optimizations were applied to clusters and network configurations to ensure the 400 tokens/s speed remains a stable, production-grade capability rather than just a peak value. Empirical tests indicate that code generation efficiency increases roughly tenfold compared to regular models—complex web code can now be generated in under 30 seconds. In Agent Swarm scenarios, up to 50 distinct AI agents can be deployed simultaneously. Currently, this API caters to latency-sensitive applications such as AI-powered programming, real-time interactions, business decision-making, and instant voice processing. Pricing details remain undisclosed; enterprises must apply through Zhipu’s BigModel Open Platform to qualify for access.

IT Home

Topic	Replies	Views
谷歌发布 Gemini 3.5 Flash，速度 4 倍于同类前沿模型常规 ai , gemini , google , 大模型 , google-io	5	May 20, 2026
阿里巴巴发布闭源模型 Qwen3.7-Max，加大强化学习算力投入常规 ai , 大模型 , 阿里巴巴 , qwen , 强化学习	9	May 21, 2026
阿里千问发布 Qwen3.7-Max，智能体旗舰，自主执行 35 小时常规 ai , 大模型 , qwen , 智能体 , 阿里巴巴	6	May 20, 2026
Cohere 开源旗舰 Command A+，219B MoE，双 H100 可运行常规开源 , 大模型 , cohere , moe , command-a	4	May 21, 2026
MiniMax M3 发布：MSA架构实现1M超长上下文，Coding与多模态能力进入国际前沿常规 ai , 大模型 , coding , minimax	3	June 1, 2026

Zhipu AI launches high-speed version of GLM-5.1 API; claims 400 tokens/s processing speed sets new global benchmark for large models

Related topics