Anthropic 正式发布 Opus 4.7,目前最强的公开可用 Claude 模型。(内部仍有更高规格的 Mythos,暂不对外开放。)
本次升级聚焦四个方向:
任务自检:长任务完成后自动执行输出校验再返回结果,显著降低幻觉率。
Token 预算控制:支持设定 Token 上限,模型自主分配思考与工具调用的资源占比,避免无效消耗。
自适应思考深度:根据任务复杂度动态调整推理时长,无需手动配置。
高分辨率图片输入:原生支持高清图像输入。
注意:Opus 4.7 采用新版 Tokenizer,相同内容的 Token 消耗较上一代约增加 35%,使用前建议重新评估成本预算。
基准测试
Opus 4.7
Opus 4.6
GPT-5.4
Gemini 3.1 Pro
Mythos Preview
Agentic coding (SWE-bench Pro)
64.3%
53.4%
57.7%
54.2%
77.8%
Agentic coding (SWE-bench Verified)
87.6%
80.8%
—
80.6%
93.9%
Agentic terminal coding (Terminal-Bench 2.0)
69.4%
65.4%
75.1% (self-reported harness)
68.5%
82.0%
Multidisciplinary reasoning - Humanity’s Last Exam (no tools)
46.9%
40.0%
42.7% (no tools Pro)
44.4%
56.8%
Multidisciplinary reasoning - Humanity’s Last Exam (with tools)
54.7%
53.3%
58.7% (with tools Pro)
51.4%
64.7%
Agentic search (BrowseComp)
79.3%
83.7%
89.3% (Pro)
85.9%
86.9%
Scaled tool use (MCP-Atlas)
77.3%
75.8%
68.1%
73.9%
—
Agentic computer use (OSWorld-Verified)
78.0%
72.7%
75.0%
—
79.6%
Agentic financial analysis (Finance Agent v1.1)
64.4%
60.1%
61.5% (Pro)
59.7%
—
Cybersecurity vulnerability reproduction (CyberGym)
73.1%
73.8%
66.3%
—
83.1%
Graduate-level reasoning (GPQA Diamond)
94.2%
91.3%
94.4% (Pro)
94.3%
94.6%
Visual reasoning - CharXiv Reasoning (no tools)
82.1%
69.1%
—
—
86.1%
Visual reasoning - CharXiv Reasoning (with tools)
91.0%
84.7%
—
—
93.2%
Multilingual Q&A (MMMLU)
91.5%
91.1%
—
92.6%
—
Opus 4.7 相比 4.6 全面提升,但在 BrowseComp(搜索)和 CyberGym(网安)两项上反而略有下滑。Scaled tool use 是 Opus 4.7 唯一明显领先所有对手的项目,体现了新预算控制机制的价值。Mythos 在几乎所有有数据的项目上都是最强,但大量栏位标注 “—”,选择性公开的意味很明显。