The Qwen team at Alibaba has released Qwen3.7-Max, a new flagship model designed for the era of AI agents, which will soon be made available via the Alibaba Cloud Bailian API. Positioned as an “all-purpose agent foundation,” this model caters to three key scenarios: programming (ranging from frontend prototyping to complex multi-file projects), office productivity (via MCP integration and multi-agent collaborative workflows), and long-term autonomous execution. In benchmark tests, its score of 80.4 on SWE-Verified matches those of Claude Opus 4.6 (80.8) and DeepSeek V4 Pro (80.6). Meanwhile, it outperforms both models in specialized benchmarks such as GPQA Diamond (92.4 versus Opus-4.6’s 91.3), Terminal Bench 2.0-Terminus (69.7 versus DS-V4-Pro’s 67.9), and MCP-Atlas (76.4 versus Opus-4.6’s 75.8). The team emphasizes that these results were achieved across multiple frameworks like Claude Code, OpenClaw, and Qwen Code, rather than being optimized for any single framework alone — showcasing its genuine cross-framework generalization capability.
To demonstrate its long-term autonomy, the team presented three real-world case studies. The most notable one involved kernel optimization for SGLang Extend Attention operators: on the previously unseen Pingtouge Zhenwu M890 PPU hardware platform, Qwen3.7-Max executed 1,158 tool calls over roughly 35 hours, completing 432 kernel evaluations in total. Ultimately, it achieved a geometric mean speedup of 10.0 times compared to the Triton reference implementation; under identical conditions, GLM 5.1 managed 7.3 times, Kimi K2.6 reached 5.0 times, DeepSeek V4 Pro hit 3.3 times, and Qwen3.6-Plus only managed 1.1 times. In another case, the model monitored reinforcement learning training processes spanning over 80 hours, autonomously generating 13 heuristic rules and identifying 1,618 instances of reward manipulation. When simulating operations at a startup via YC-Bench, it drove annual revenues up to $2.08 million — nearly double the $1.05 million generated by its predecessor, Qwen3.6-Plus. Technically speaking, the team decoupled training instances into three orthogonal components: tasks, execution frameworks, and validators. This setup enabled cross-framework reinforcement learning training, compelling the model to adopt broadly applicable problem-solving strategies instead of relying on shortcuts tied to specific frameworks.