标签为“opus-4.7”的主题

Anthropic 正式发布 Opus 4.7，目前最强的公开可用 Claude 模型。（内部仍有更高规格的 Mythos，暂不对外开放。）本次升级聚焦四个方向：任务自检：长任务完成后自动执行输出校验再返回结果，显著降低幻觉率。 Token 预算控制：支持设定 Token 上限，模型自主分配思考与工具调用的资源占比，避免无效消耗。自适应思考深度：根据任务复杂度动态调整推理时长，无需手动配置。高分辨率图片输入：原生支持高清图像输入。注意：Opus 4.7 采用新版 Tokenizer，相同内容的 Token 消耗较上一代约增加 35%，使用前建议重新评估成本预算。基准测试 Opus 4.7 Opus 4.6 GPT-5.4 Gemini 3.1 Pro Mythos Preview Agentic coding (SWE-bench Pro) 64.3% 53.4% 57.7% 54.2% 77.8% Agentic coding (SWE-bench Verified) 87.6% 80.8% — 80.6% 93.9% Agentic terminal coding (Terminal-Bench 2.0) 69.4% 65.4% 75.1% (self-reported harness) 68.5% 82.0% Multidisciplinary reasoning - Humanity’s Last Exam (no tools) 46.9% 40.0% 42.7% (no tools Pro) 44.4% 56.8% Multidisciplinary reasoning - Humanity’s Last Exam (with tools) 54.7% 53.3% 58.7% (with tools Pro) 51.4% 64.7% Agentic search (BrowseComp) 79.3% 83.7% 89.3% (Pro) 85.9% 86.9% Scaled tool use (MCP-Atlas) 77.3% 75.8% 68.1% 73.9% — Agentic computer use (OSWorld-Verified) 78.0% 72.7% 75.0% — 79.6% Agentic financial analysis (Finance Agent v1.1) 64.4% 60.1% 61.5% (Pro) 59.7% — Cybersecurity vulnerability reproduction (CyberGym) 73.1% 73.8% 66.3% — 83.1% Graduate-level reasoning (GPQA Diamond) 94.2% 91.3% 94.4% (Pro) 94.3% 94.6% Visual reasoning - CharXiv Reasoning (no tools) 82.1% 69.1% — — 86.1% Visual reasoning - CharXiv Reasoning (with tools) 91.0% 84.7% — — 93.2% Multilingual Q&A (MMMLU) 91.5% 91.1% — 92.6% — Opus 4.7 相比 4.6 全面提升，但在 BrowseComp（搜索）和 CyberGym（网安）两项上反而略有下滑。Scaled tool use 是 Opus 4.7 唯一明显领先所有对手的项目，体现了新预算控制机制的价值。Mythos 在几乎所有有数据的项目上都是最强，但大量栏位标注 “—”，选择性公开的意味很明显。

WeLinux

Anthropic 发布 Opus 4.7：四项核心升级，Token 成本上涨 35%