Loading...



Illustration of the GSM-Symbolic template creation process. This dataset serves as a tool to investigate the presumed reasoning capabilities of LLMs, enabling the design of controllable mathematical reasoning evaluations with more reliable metrics. Our results reveal that all state-of the-art LLMs exhibit significant performance variations, suggesting the fragility or lack of reasoning.