Skip to yearly menu bar Skip to main content


Poster

CogMath: Assessing LLMs' Authentic Mathematical Ability from a Human Cognitive Perspective

Jiayu Liu · Zhenya Huang · Wei Dai · Cheng Cheng · Jinze Wu · Jing Sha · Song Li · Qi Liu · Shijin Wang · Enhong Chen

[ ]
Tue 15 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

Although large language models (LLMs) show promise in solving complex mathematical tasks, existing evaluation paradigms rely solely on a coarse measure of overall answer accuracy, which are insufficient for assessing their authentic capabilities. In this paper, we propose \textbf{CogMath}, which comprehensively assess LLMs' mathematical abilities through the lens of human cognition. Specifically, inspired by psychological theories, CogMath formalizes human reasoning process into 3 stages: \emph{problem comprehension}, \emph{problem solving}, and \emph{solution summarization}. To carry out a scientific evaluation within each stage, we investigate perspectives such as numerical calculation, knowledge, and counterfactuals, to design a total of 9 fine-grained evaluation dimensions. In each dimension, we develop an ``\emph{Inquiry}-\emph{Judge}-\emph{Reference}'' multi-agent system to generate inquiries that assess LLMs' mastery from this dimension. A LLM is considered to truly master a problem only when excelling in all inquiries from the 9 dimensions. By applying CogMath on three benchmarks that cover the full K-12 math curriculum, we reveal that the mathematical capabilities of 7 mainstream LLMs are overestimated by 30\%-40\%. Moreover, we locate their strengths and weaknesses across specific stages/dimensions, offering in-depth insights to further enhance their reasoning abilities.

Live content is unavailable. Log in and register to view live content