Community releases RuneScape-Bench: Gemini 3.5 Flash scores 5.4 overall in game agent tests

Developer Max Bittker has released RuneScape-Bench (RuneBench), an open-source benchmark designed to evaluate AI Agents’ multi-step planning, tool usage, and code generation abilities. The test involves having Agents autonomously complete quests within a simulated environment based on the classic version of RuneScape. It runs the simulation server at 8x speed; Agents must control characters using TypeScript scripts while formulating strategies with reference to Wiki documentation. Scores are calculated based on the optimal rate of experience points earned per every 15-second interval. Community testing results show that Gemini 3.5 Flash achieved an overall score of 5.4, outperforming GPT-5.5, GPT-5.4, and Claude Opus 4.7 across multiple early-game skill categories. A detailed comparison table by individual skill is also available on the results page.

The concept behind RuneScape-Bench is to utilize open-world games as a real-world stress test for Agent capabilities — a departure from conventional benchmarks centered on code completion or multiple-choice tasks. Nevertheless, some community members remain skeptical about how representative these results truly are, arguing that the benchmark still lacks sufficient peer review. Both the leaderboard and per-skill data for each model are publicly accessible via maxbittker.github.io; the source code is hosted on GitHub, allowing anyone to integrate new models and run tests independently.

GitHub maxbittker/runebench