Anthropic 发布 Opus 4.7:四项核心升级,Token 成本上涨 35%
-
Anthropic 正式发布 Opus 4.7,目前最强的公开可用 Claude 模型。(内部仍有更高规格的 Mythos,暂不对外开放。)
本次升级聚焦四个方向:任务自检:长任务完成后自动执行输出校验再返回结果,显著降低幻觉率。
Token 预算控制:支持设定 Token 上限,模型自主分配思考与工具调用的资源占比,避免无效消耗。
自适应思考深度:根据任务复杂度动态调整推理时长,无需手动配置。
高分辨率图片输入:原生支持高清图像输入。注意:Opus 4.7 采用新版 Tokenizer,相同内容的 Token 消耗较上一代约增加 35%,使用前建议重新评估成本预算。
基准测试 Opus 4.7 Opus 4.6 GPT-5.4 Gemini 3.1 Pro Mythos Preview Agentic coding (SWE-bench Pro) 64.3% 53.4% 57.7% 54.2% 77.8% Agentic coding (SWE-bench Verified) 87.6% 80.8% — 80.6% 93.9% Agentic terminal coding (Terminal-Bench 2.0) 69.4% 65.4% 75.1% (self-reported harness) 68.5% 82.0% Multidisciplinary reasoning - Humanity’s Last Exam (no tools) 46.9% 40.0% 42.7% (no tools Pro) 44.4% 56.8% Multidisciplinary reasoning - Humanity’s Last Exam (with tools) 54.7% 53.3% 58.7% (with tools Pro) 51.4% 64.7% Agentic search (BrowseComp) 79.3% 83.7% 89.3% (Pro) 85.9% 86.9% Scaled tool use (MCP-Atlas) 77.3% 75.8% 68.1% 73.9% — Agentic computer use (OSWorld-Verified) 78.0% 72.7% 75.0% — 79.6% Agentic financial analysis (Finance Agent v1.1) 64.4% 60.1% 61.5% (Pro) 59.7% — Cybersecurity vulnerability reproduction (CyberGym) 73.1% 73.8% 66.3% — 83.1% Graduate-level reasoning (GPQA Diamond) 94.2% 91.3% 94.4% (Pro) 94.3% 94.6% Visual reasoning - CharXiv Reasoning (no tools) 82.1% 69.1% — — 86.1% Visual reasoning - CharXiv Reasoning (with tools) 91.0% 84.7% — — 93.2% Multilingual Q&A (MMMLU) 91.5% 91.1% — 92.6% — Opus 4.7 相比 4.6 全面提升,但在 BrowseComp(搜索)和 CyberGym(网安)两项上反而略有下滑。Scaled tool use 是 Opus 4.7 唯一明显领先所有对手的项目,体现了新预算控制机制的价值。Mythos 在几乎所有有数据的项目上都是最强,但大量栏位标注 “—”,选择性公开的意味很明显。