Anthropic 发布 Opus 4.7：四项核心升级，Token 成本上涨 35%

ref

Anthropic 正式发布 Opus 4.7，目前最强的公开可用 Claude 模型。（内部仍有更高规格的 Mythos，暂不对外开放。）
本次升级聚焦四个方向：

任务自检：长任务完成后自动执行输出校验再返回结果，显著降低幻觉率。
Token 预算控制：支持设定 Token 上限，模型自主分配思考与工具调用的资源占比，避免无效消耗。
自适应思考深度：根据任务复杂度动态调整推理时长，无需手动配置。
高分辨率图片输入：原生支持高清图像输入。

注意：Opus 4.7 采用新版 Tokenizer，相同内容的 Token 消耗较上一代约增加 35%，使用前建议重新评估成本预算。

基准测试	Opus 4.7	Opus 4.6	GPT-5.4	Gemini 3.1 Pro	Mythos Preview
Agentic coding (SWE-bench Pro)	64.3%	53.4%	57.7%	54.2%	77.8%
Agentic coding (SWE-bench Verified)	87.6%	80.8%	—	80.6%	93.9%
Agentic terminal coding (Terminal-Bench 2.0)	69.4%	65.4%	75.1% (self-reported harness)	68.5%	82.0%
Multidisciplinary reasoning - Humanity’s Last Exam (no tools)	46.9%	40.0%	42.7% (no tools Pro)	44.4%	56.8%
Multidisciplinary reasoning - Humanity’s Last Exam (with tools)	54.7%	53.3%	58.7% (with tools Pro)	51.4%	64.7%
Agentic search (BrowseComp)	79.3%	83.7%	89.3% (Pro)	85.9%	86.9%
Scaled tool use (MCP-Atlas)	77.3%	75.8%	68.1%	73.9%	—
Agentic computer use (OSWorld-Verified)	78.0%	72.7%	75.0%	—	79.6%
Agentic financial analysis (Finance Agent v1.1)	64.4%	60.1%	61.5% (Pro)	59.7%	—
Cybersecurity vulnerability reproduction (CyberGym)	73.1%	73.8%	66.3%	—	83.1%
Graduate-level reasoning (GPQA Diamond)	94.2%	91.3%	94.4% (Pro)	94.3%	94.6%
Visual reasoning - CharXiv Reasoning (no tools)	82.1%	69.1%	—	—	86.1%
Visual reasoning - CharXiv Reasoning (with tools)	91.0%	84.7%	—	—	93.2%
Multilingual Q&A (MMMLU)	91.5%	91.1%	—	92.6%	—

Opus 4.7 相比 4.6 全面提升，但在 BrowseComp（搜索）和 CyberGym（网安）两项上反而略有下滑。Scaled tool use 是 Opus 4.7 唯一明显领先所有对手的项目，体现了新预算控制机制的价值。Mythos 在几乎所有有数据的项目上都是最强，但大量栏位标注 “—”，选择性公开的意味很明显。

WeLinux

Anthropic 发布 Opus 4.7：四项核心升级，Token 成本上涨 35%