The Limits of Mathematical Reasoning in Large Language Models
Side-by-side code editors display JSON-like syntax with colorful highlighting. A left window labeled "GitHub", right window is "JSON Symbolic Translator". Sample code is visible in both, demonstrating the data translation process.Side-by-side code editors display JSON-like syntax with colorful highlighting. A left window labeled "GitHub", right window is "JSON Symbolic Translator". Sample code is visible in both, demonstrating the data translation process.
Side-by-side code editors display JSON-like syntax with colorful highlighting. A left window labeled "GitHub", right window is "JSON Symbolic Translator". Sample code is visible in both, demonstrating the data translation process.
Illustration of the GSM-Symbolic template creation process.
1/1
Illustration of the GSM-Symbolic template creation process. This dataset serves as a tool to investigate the presumed reasoning capabilities of LLMs, enabling the design of controllable mathematical reasoning evaluations with more reliable metrics. Our results reveal that all state-of the-art LLMs exhibit significant performance variations, suggesting the fragility or lack of reasoning.