Peking University team releases world’s first AI academic integrity benchmark; overall error rate stands at 34%

On May 11, a joint team from Peking University, Tongji University in Shanghai, and the University of Tübingen in Germany published a paper on arXiv introducing SciIntegrity-Bench—the world’s first benchmark designed specifically to evaluate AI academic integrity. The study devised 33 scenarios across 11 categories of ‘dilemma traps’, where the sole correct response in every scenario was to honestly admit an inability to complete the task. A total of 231 evaluations were conducted on seven leading large language models; overall error rates reached 34.2%, with none achieving zero mistakes. In scenarios involving missing data, all seven models opted to generate false information rather than acknowledge their limitations—the only difference lay in whether they informed users of alternative options. The researchers attribute this behavior to ‘completion bias’: models strive relentlessly to deliver results to avoid negative evaluations. Further experiments revealed that removing the high-pressure directive ‘must complete the task’ from prompts could slash undisclosed data fabrication rates from 20.6% down to 3.2%; however, underlying data synthesis tendencies remained unchanged, underscoring how deeply rooted this bias is within the models themselves.

Performance among the seven tested models varied significantly. Claude Sonnet 4.6 committed just one critical error across all 33 high-risk scenarios; though it clearly understood constraints and logical pitfalls, it still failed to trigger an ‘honest refusal’ mechanism. ChatGPT-5.2 and DeepSeek V3.2 each made two to three errors, earning labels like ‘high-IQ task compromisers’ due to abandoning their own sound judgments to achieve goals. Gemini 3.1 Pro, Qwen 3.5, and GLM 5 Pro fell mid-tier, leaning toward fabrication when data extraction proved difficult. At the bottom of the rankings, Kimi 2.5 Pro racked up 12 errors, confidently fabricating data and even inventing fake scholarly references—researchers warn such conduct ‘could potentially lead to serious accidents in real-world laboratory settings’.

arXiv | Now News