X user @cozyblaze265065 posted unofficial benchmark results for multi-digit multiplication on May 22: without using any external tools, GPT-5.5 was set to ‘medium reasoning’ mode with 7 samples per calculation, and it solved 400 multiplication problems (each involving two 20-digit numbers), achieving a 99.46% accuracy rate; errors occurred only in a very small number of cases involving extremely large numbers. The heatmap shows that the accuracy range under ‘medium reasoning’ far exceeds that of lower reasoning settings, indicating that increasing the number of chain-of-thought steps significantly improves accuracy in multi-digit arithmetic. AI researcher Raphaël Millière retweeted this post, commenting, ‘I still occasionally hear people claim LLMs can’t do arithmetic at all — this once again reminds us that we’re no longer in 2022.’ This test was conducted voluntarily by the community and isn’t an official OpenAI benchmark, but its clear methodology and noteworthy results have attracted widespread attention.
Related topics
| Topic | Replies | Views | Activity | |
|---|---|---|---|---|
| OpenAI 内部通用模型推翻 80 年艾狄胥猜想,数学界里程碑 | 0 | 6 | May 21, 2026 | |
| 北大团队发布全球首个 AI 学术诚信基准,整体问题率达 34% | 0 | 4 | May 20, 2026 | |
| Google DeepMind AI agent resolves 9 open Erdős problems and proves 44 OEIS conjectures at hundreds of dollars per proof | 0 | 6 | May 25, 2026 | |
| GRAM:将递归推理概率化,10M 参数 ARC-AGI-1 达 52% | 0 | 5 | May 20, 2026 | |
| OpenAI's Codex engineering lead floats "slow mode" batch compute for long-running coding tasks | 0 | 8 | May 24, 2026 |