♟️ ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models

arXiv preprint 2025

Abstract

Recent large language models (LLMs) have shown strong reasoning capabilities. However, a critical question remains: do these models possess genuine reasoning skills—particularly complex strategic reasoning—or are they primarily excelling at sophisticated pattern recognition within their training data? To address this question, this paper presents a chess testbed, ChessArena, to evaluate the strategic reasoning capabilities of LLMs. Chess requires complex strategic reasoning ca- pabilities including long-term planning, strict rule comprehension, and multi-turn conversation memorization. Specifically, ChessArena is a competitive framework where LLMs play against each other, under four different play modes. The testbed is equipped with a ranking algorithm and a leaderboard. The testbed can also eval- uate fine-grained capabilities including basic understanding, move selection, and puzzle solving. Over 13 LLMs with different modes are evaluated in ChessArena, playing over 800 games. The results reveal significant shortcomings in current LLMs: no model can beat Maia-1100 (a chess engine at human amateur level), while some even failed to defeat a random player that selects moves arbitrarily. We also present a strong baseline to the testbed: our fine-tuned Qwen3-8B substantially improved performance, approaching much larger state-of-the-art reasoning models.

Overview

ChessArena is a competitive arena for Large Language Models (LLMs) constructed through model-versus-model gameplay. By having LLMs play against each other like humans, we can evaluate their strategic reasoning, instruction compliance, and multi-turn conversational memory—capabilities that play a crucial role in today's LLMs. Following Lichess, we offer four distinct play modes, allowing models to engage in diverse matches. Furthermore, to assess the gameplay abilities of models more comprehensively, we have designed three fine-grained evaluation tasks: Basic Understanding, Move Selection, and Puzzle Solving. We found that many models cannot even defeat a random player, and no model has been able to surpass Maia-1100 (a human-like chess engine). This indicates significant room for improvement in strategic reasoning among LLMs. By distilling data from powerful models in ChessArena and applying post-training (SFT+RL) based on Qwen3-8B, we have established a strong baseline model within ChessArena. In summary, we have launched the ChessArena benchmark—a chess platform that supports LLMs competing against each other under various settings. To submit your model, please use the interface below.

Key Features

♟️ Competition Sampling

Automated matching system based on Glicko formula that pairs players with similar skill levels for faster rating convergence.

📊 Glicko-1 Rating System

Robust rating calculation system for comprehensive model evaluation and comparison across different configurations.

🎯 Fine-grained Evaluation

Three specialized tasks: basic understanding, move selection, and puzzle solving to identify specific failure modes.

🏋️ Training Pipeline

Complete training framework with SFT and RL stages, including trained models and datasets on HuggingFace.

Four Game Modes

We introduce four distinct game modes with different evaluation focuses. Each mode presents unique challenges and is suitable for different types of models:

⚡ Bullet Mode

Lightning Fast

Models receive the current board state (FEN) and must output moves (UCI or SAN) directly without any thinking process.

Suitable for: Non-thinking models

⚡⚡ Blitz Mode

Quick Thinking

Models receive the current board state and are allowed to think before outputting moves.

Suitable for: Non-thinking models

🎯 Standard Mode

Deep Thinking

Models receive the current board state and can perform long Chain-of-Thought (CoT) reasoning before outputting moves.

Suitable for: Thinking models

🎭 Blindfold Mode

Most Challenging

Move history is provided through multi-turn dialogue. Models must reconstruct the board from move history and think before outputting moves.

Suitable for: Both Thinking and non-thinking models

These four game modes have different evaluation focuses, with Blindfold being the most difficult mode as it requires models to maintain mental board representation.

Leaderboard Results

Through over 1,000 games, we established a comprehensive leaderboard evaluating various LLMs. Key findings include:

  • Thinking models generally outperform non-thinking models
  • All models are inferior to Maia-1100 (a chess-specific model)
  • In ChessArena, our Qwen3-8B-Chess is a relatively strong baseline.
Rank Model Mode Legal Moves Rating RD 95% CI Games
1 Maia-1100 - 2220 82 (2058, 2382) 44
2 O3 Standard 1948 78 (1793, 2101) 28
3 Doubao-Seed-1-6-Thinking Standard 1830 50 (1729, 1929) 60
4 Gemini-2.5-Pro Standard 1819 81 (1659, 1979) 18
5 Qwen3-8B-Chess (baseline) Blitz 1776 93 (1593, 1959) 16
6 Doubao-Seed-1-6-Thinking Standard 1743 66 (1612, 1873) 36
7 GPT-4.1 Blindfold 1699 50 (1601, 1797) 60
8 Doubao-Seed-1-6-Thinking Blindfold 1687 73 (1542, 1831) 24
9 GPT-4.1 Blitz 1686 50 (1588, 1784) 182
10 Claude-3.7-Sonnet Blitz 1654 50 (1555, 1751) 74
11 Claude-3.7-Sonnet Blindfold 1625 66 (1493, 1756) 30
12 GPT-4.1 Blitz 1623 50 (1525, 1721) 106
13 Gemini-2.5-Pro Standard 1616 74 (1469, 1762) 28
14 Seed-Coder-8B-Chess Blitz 1614 63 (1490, 1738) 30
15 Qwen3-8B-SFT-Stage2 (Ours) Blitz 1612 56 (1501, 1721) 40
16 Claude-3.7-Sonnet Blindfold 1588 72 (1445, 1729) 28
17 GPT-4.1 Bullet 1583 50 (1485, 1681) 54
18 DeepSeek-V3 Blitz 1553 50 (1454, 1650) 174
19 Random Player (Weak baseline) - 1524 50 (1425, 1621) 284
20 Qwen3-235B-A22B Blitz 1483 50 (1385, 1581) 146
21 DeepSeek-V3 Blitz 1482 58 (1367, 1597) 48
22 DeepSeek-V3 Blindfold 1437 75 (1290, 1584) 24
23 DeepSeek-V3 Bullet 1382 80 (1224, 1540) 22
24 Qwen3-235B-A22B Bullet 1369 54 (1261, 1476) 46
25 Qwen3-8B Blitz 1335 65 (1205, 1463) 32
26 Seed-Coder-8B-Instruct Blitz 1009 106 (800, 1218) 30

Note: RD = Rating Deviation; 95% CI = 95% Confidence Interval; ✓ = Legal moves provided; ✗ = No legal moves provided

Distribution of terminations

This figure displays all game termination conditions—including checkmate, forfeit, move limit draw, and others—offering better understanding of the models' playing behavior.

🎯 Fine-Grained Evaluation

Observing the issues models encountered in the main chess competitions, we designed three fine-grained evaluation tasks to analyze specific weaknesses:

  • Basic Understanding: Assesses the model's fundamental comprehension of the chessboard state.
  • Move Selection: Evaluates the model's single-step move choice ability.
  • Puzzle Solving: Tests the model's capability to solve multi-step chess puzzles (tactical reasoning).

The specific results for these tasks are detailed below.

1. Basic Understanding

Basic understanding results. Thinking models such as O3 and Doubao-Seed-1-6-Thinking show strong chessboard understanding capabilities. Our post-training significantly improves the basic understanding capability.
Model PMA (%) Precision (%) Recall (%)
GPT-4.198.089.392.1
O398.598.598.5
DeepSeek-V397.081.875.3
DeepSeek-V3.189.087.587.4
DeepSeek-R1100.099.298.4
Doubao-1-5-Pro-32k76.050.656.2
Doubao-1-5-Lite-32k51.533.330.3
Doubao-1-5-Thinking-Pro99.598.098.0
Doubao-Seed-1-6-Thinking100.099.999.9
Qwen3-235B-A22B80.550.749.3
Claude-3.7-Sonnet98.087.687.3
Gemini-2.5-Pro100.098.596.7
Qwen3-8B36.014.118.8
Qwen3-8B-Chess-SFT-Stage163.5 (+31.5)20.6 (+5.9)29.5 (+14.3)
Qwen3-8B-Chess-SFT-Stage270.5 (+7.0)51.9 (+31.3)45.3 (+15.8)
Qwen3-8B-Chess (SFT+RL)79.0 (+8.5)52.6 (+0.7)50.1 (+4.8)

2. Move Selection

Move selection performance across four play modes with/without legal moves provision. We bold the highest LR, TR, and MAR within each group. LLMs still have significant room for improvement, especially when the legal moves are not provided.
Mode Model or Engine With Legal Moves Without Legal Moves
LR (%) TR (%) MAR (%) LR (%) TR (%) MAR (%)
Blitz Random Player100.014.8-1.1///
Maia-1100///100.078.3+107.6
GPT-4.197.525.9+20.571.629.3+6.2
Claude-3.7-Sonnet99.626.1+25.668.418.2-17.7
DeepSeek-V399.118.5+10.764.512.9-27.7
DeepSeek-V3.193.426.7+18.663.716.9-23.6
Qwen3-235B-A22B89.824.9+29.064.217.0-25.3
Qwen3-8B96.213.4+1.89.82.1-79.5
Qwen3-8B-Chess-SFT-Stage186.813.6-9.615.12.6-74.9
Qwen3-8B-Chess-SFT-Stage296.923.4+15.166.313.3-22.1
Qwen3-8B-Chess (SFT+RL)92.940.2+41.187.620.2-1.2
Seed-Coder-8B-Instruct59.38.5-36.14.51.0-85.4
Seed-Coder-8B-Chess(SFT+RL)99.529.5+35.785.112.4-9.0
Bullet GPT-4.198.725.0+20.874.028.7+5.7
Claude-3.7-Sonnet98.622.5+16.875.217.9-9.4
DeepSeek-V398.918.8+11.366.213.3-21.8
DeepSeek-V3.180.616.1-8.056.312.7-35.7
Qwen3-235B-A22B95.917.8+4.569.115.9-18.5
Standard DeepSeek-R1100.032.7+34.782.523.7-1.0
Doubao-1-5-Thinking-Pro99.732.9+35.478.024.8+3.0
Doubao-Seed-1-6-Thinking99.839.1+53.790.736.0+32.0
Gemini-2.5-Pro99.437.6+46.585.540.5+36.5
O399.658.7+80.198.062.0+80.2
Blindfold GPT-4.196.820.1+12.772.720.2+1.2
Claude-3.7-Sonnet98.223.9+21.577.318.9-9.1
DeepSeek-V395.119.2+16.278.514.9-7.8
DeepSeek-V3.196.526.0+27.266.013.7-18.0
DeepSeek-R194.722.7+14.044.610.9-36.9
Qwen3-235B-A22B96.119.9+17.475.317.2-10.4
Doubao-Seed-1-6-Thinking97.832.1+36.543.612.9-30.5
Gemini-2.5-Pro98.730.4+23.568.721.5-8.7
O398.446.9+63.286.943.5+50.9

3. Puzzle Solving

Puzzle solving accuracy when legal moves are provided. LLMs perform relatively poorly, with O3 standing out as the strongest.
Model or Engine Puzzle Solving Accuracy (%)
200-600 600-1000 1000-1400 1400-1800 1800-2200 2200-2600 2600-3000 Overall
Stockfish (Depth=20)100.0100.0100.0100.099.397.991.598.4
Maia-110098.697.291.682.572.751.028.274.6
Random Player1.41.42.10.00.00.00.00.7
GPT-4.118.914.08.44.91.42.80.07.2
Claude-3.7-Sonnet18.216.14.94.25.61.40.07.2
DeepSeek-V311.97.72.10.70.00.70.03.3
DeepSeek-V3.113.310.58.44.91.42.87.06.0
Qwen3-235B-A22B24.518.29.85.64.21.40.09.1
Qwen3-8B2.84.92.10.00.00.00.01.4
Qwen3-8B-Chess31.516.810.57.05.62.10.010.5
Seed-Coder-8B-Instruct0.01.40.00.01.00.00.00.4
Seed-Coder-8B-Chess23.88.44.93.54.92.80.06.9
O397.990.279.762.946.510.51.455.6
Gemini-2.5-Pro37.124.518.29.14.23.51.414.0
Doubao-Seed-1-6-Thinking27.323.811.97.74.21.42.111.2
DeepSeek-R123.120.37.04.22.80.70.78.4

♟️ Chess Post-training Pipeline

To establish a strong baseline within ChessArena and validate the efficacy of our testbed, we implemented a comprehensive post-training pipeline. We distilled the reasoning data generated by high-performing models—specifically GPT-4.1 and Doubao-Seed-1-6-Thinking.

This distilled data was first used for Supervised Fine-Tuning (SFT). Subsequently, we applied Reinforcement Learning (RL) based on the GRPO method to further enhance the models' strategic reasoning and playing strength.

This process yielded two strong specialized models: Qwen3-8B-Chess (based on Qwen3-8B) and Seed-Coder-8B-Chess (based on Seed-Coder-8B-Instruct). As evidenced by the results in the Chess Competitions and Fine-Grained Evaluation sections above, these models demonstrate a significant enhancement in chess reasoning capabilities, substantially closing the gap with much larger state-of-the-art models.

We have open-sourced our data and models, and you can click the link at the top of the page to view them.

🚀 Generalization Experiments

To test the broader impact of chess post-training, we evaluated the models' generalization ability on established Code, Math, and Reasoning benchmarks. This helps determine if the strategic reasoning enhanced in the chess domain transfers to other cognitive tasks.

Performance of Qwen3-8B and its Post-Trained Variants on External Benchmarks.
Model Variant LiveCodeBench AIME2025 ZebraLogic BigCodeBench CruxEval DROP
Qwen3-8B (Baseline) 25.19 18.61 25.90 41.32 73.25 85.15
I. Chess Training Only
Qwen3-8B-Chess-SFT-Stage2 27.48 15.43 30.40 41.40 68.00 82.83
Qwen3-8B-Chess (SFT+RL) 25.19 19.30 48.00 39.82 72.25 83.58
II. Multi-Task RL (with Chess-SFT)
+Math-RL 25.19 19.70 42.40 39.39 73.25 84.12
+Math+Chess-RL 26.72 19.32 39.00 40.09 72.09 84.01
+Math+Code-RL 28.24 18.61 38.30 40.26 71.25 83.34
+Math+Code+Chess-RL 30.53 17.96 38.00 40.30 72.22 83.84
III. Single-Task RL (without Chess-SFT)
+Math-RL 25.19 21.30 28.00 42.37 77.47 85.53

Key Findings

  1. Models fine-tuned via Chess Supervised Fine-Tuning (SFT) consistently demonstrate a degree of generalization ability to the logical reasoning benchmark, ZebraLogic. Notably, models that underwent Chess SFT show a significant improvement in their ZebraLogic scores (up to 48.00 for Qwen3-8B-Chess (SFT+RL)) after the RL phase compared to the baseline (25.90).
  2. Incorporating a certain amount of Chess data into the Reinforcement Learning (RL) dataset contributes to the generalization of Code capabilities, particularly on the LiveCodeBench. Comparisons (e.g., Math+Code+Chess-RL vs. Math+Code-RL) reveal that RL models augmented with Chess data achieve an improvement in their LiveCodeBench scores.
  3. However, the use of Chess SFT may lead to performance degradation on other benchmarks (e.g., BigCode-Bench, CruxEval); mixing Chess data for RL could also result in a slight performance decline on mathematical tasks (i.e., AIME2025).

Qualitative Analysis of Reasoning Improvement

In our paper, we have included additional case studies (e.g., in Appendix G.3) that illustrate the change in the model's reasoning process. All puzzles demonstrate that the model's reasoning before chess training was often superficial, frequently exhibiting steps skipped or known conditions forgotten.

The model's reasoning after post-training in chess is not confined to a specific format but rigorously follows known conditions step-by-step, with demonstrable adjustments when errors occur (Puzzle 3). This indicates that the chess post-training enhances the model's reasoning process, making it more rigorous and leading to higher-quality answers. You can find the relevant case studies in our paper.

Paper Page

If you are interested in ChessArena, please contact us at jinchengliu@smail.nju.edu.cn. You can also refer to our paper at https://arxiv.org/abs/2509.24239 for more information.

BibTeX

@article{liu2025chessarena,
  title={ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models},
  author={Liu, Jincheng and He, Sijun and Wu, Jingjing and Wang, Xiangsen and Chen, Yang and Kuang, Zhaoqi and Bao, Siqi and Yao, Yuan},
  journal={arXiv preprint arXiv:2509.24239},
  year={2025}
}