ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models

Liu, Jincheng; He, Sijun; Wu, Jingjing; Wang, Xiangsen; Chen, Yang; Kuang, Zhaoqi; Bao, Siqi; Yao, Yuan

♟️ ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models

Jincheng Liu, Sijun He, Jingjing Wu, Xiangsen Wang, Yang Chen, Zhaoqi Kuang, Siqi Bao, Yuan Yao

arXiv preprint 2025

Abstract

Recent large language models (LLMs) have shown strong reasoning capabilities. However, a critical question remains: do these models possess genuine reasoning skills—particularly complex strategic reasoning—or are they primarily excelling at sophisticated pattern recognition within their training data? To address this question, this paper presents a chess testbed, ChessArena, to evaluate the strategic reasoning capabilities of LLMs. Chess requires complex strategic reasoning ca- pabilities including long-term planning, strict rule comprehension, and multi-turn conversation memorization. Specifically, ChessArena is a competitive framework where LLMs play against each other, under four different play modes. The testbed is equipped with a ranking algorithm and a leaderboard. The testbed can also eval- uate fine-grained capabilities including basic understanding, move selection, and puzzle solving. Over 13 LLMs with different modes are evaluated in ChessArena, playing over 800 games. The results reveal significant shortcomings in current LLMs: no model can beat Maia-1100 (a chess engine at human amateur level), while some even failed to defeat a random player that selects moves arbitrarily. We also present a strong baseline to the testbed: our fine-tuned Qwen3-8B substantially improved performance, approaching much larger state-of-the-art reasoning models.

Overview

ChessArena is a competitive arena for Large Language Models (LLMs) constructed through model-versus-model gameplay. By having LLMs play against each other like humans, we can evaluate their strategic reasoning, instruction compliance, and multi-turn conversational memory—capabilities that play a crucial role in today's LLMs. Following Lichess, we offer four distinct play modes, allowing models to engage in diverse matches. Furthermore, to assess the gameplay abilities of models more comprehensively, we have designed three fine-grained evaluation tasks: Basic Understanding, Move Selection, and Puzzle Solving. We found that many models cannot even defeat a random player, and no model has been able to surpass Maia-1100 (a human-like chess engine). This indicates significant room for improvement in strategic reasoning among LLMs. By distilling data from powerful models in ChessArena and applying post-training (SFT+RL) based on Qwen3-8B, we have established a strong baseline model within ChessArena. In summary, we have launched the ChessArena benchmark—a chess platform that supports LLMs competing against each other under various settings. To submit your model, please use the interface below.

Key Features

♟️ Competition Sampling

Automated matching system based on Glicko formula that pairs players with similar skill levels for faster rating convergence.

📊 Glicko-1 Rating System

Robust rating calculation system for comprehensive model evaluation and comparison across different configurations.

🎯 Fine-grained Evaluation

Three specialized tasks: basic understanding, move selection, and puzzle solving to identify specific failure modes.

🏋️ Training Pipeline

Complete training framework with SFT and RL stages, including trained models and datasets on HuggingFace.

Four Game Modes

We introduce four distinct game modes with different evaluation focuses. Each mode presents unique challenges and is suitable for different types of models:

⚡ Bullet Mode

Lightning Fast

Models receive the current board state (FEN) and must output moves (UCI or SAN) directly without any thinking process.

Suitable for: Non-thinking models

⚡⚡ Blitz Mode

Quick Thinking

Models receive the current board state and are allowed to think before outputting moves.

Suitable for: Non-thinking models

🎯 Standard Mode

Deep Thinking

Models receive the current board state and can perform long Chain-of-Thought (CoT) reasoning before outputting moves.

Suitable for: Thinking models

🎭 Blindfold Mode

Most Challenging

Move history is provided through multi-turn dialogue. Models must reconstruct the board from move history and think before outputting moves.

Suitable for: Both Thinking and non-thinking models

These four game modes have different evaluation focuses, with Blindfold being the most difficult mode as it requires models to maintain mental board representation.

Leaderboard Results

Through over 1,000 games, we established a comprehensive leaderboard evaluating various LLMs. Key findings include:

Thinking models generally outperform non-thinking models
All models are inferior to Maia-1100 (a chess-specific model)
In ChessArena, our Qwen3-8B-Chess is a relatively strong baseline.

Rank	Model	Mode	Legal Moves	Rating	RD	95% CI	Games
1	Maia-1100	-	✗	2220	82	(2058, 2382)	44
2	O3	Standard	✗	1948	78	(1793, 2101)	28
3	Doubao-Seed-1-6-Thinking	Standard	✓	1830	50	(1729, 1929)	60
4	Gemini-2.5-Pro	Standard	✓	1819	81	(1659, 1979)	18
5	Qwen3-8B-Chess (baseline)	Blitz	✓	1776	93	(1593, 1959)	16
6	Doubao-Seed-1-6-Thinking	Standard	✗	1743	66	(1612, 1873)	36
7	GPT-4.1	Blindfold	✓	1699	50	(1601, 1797)	60
8	Doubao-Seed-1-6-Thinking	Blindfold	✓	1687	73	(1542, 1831)	24
9	GPT-4.1	Blitz	✓	1686	50	(1588, 1784)	182
10	Claude-3.7-Sonnet	Blitz	✓	1654	50	(1555, 1751)	74
11	Claude-3.7-Sonnet	Blindfold	✓	1625	66	(1493, 1756)	30
12	GPT-4.1	Blitz	✗	1623	50	(1525, 1721)	106
13	Gemini-2.5-Pro	Standard	✗	1616	74	(1469, 1762)	28
14	Seed-Coder-8B-Chess	Blitz	✓	1614	63	(1490, 1738)	30
15	Qwen3-8B-SFT-Stage2 (Ours)	Blitz	✓	1612	56	(1501, 1721)	40
16	Claude-3.7-Sonnet	Blindfold	✗	1588	72	(1445, 1729)	28
17	GPT-4.1	Bullet	✓	1583	50	(1485, 1681)	54
18	DeepSeek-V3	Blitz	✓	1553	50	(1454, 1650)	174
19	Random Player (Weak baseline)	-	✓	1524	50	(1425, 1621)	284
20	Qwen3-235B-A22B	Blitz	✓	1483	50	(1385, 1581)	146
21	DeepSeek-V3	Blitz	✗	1482	58	(1367, 1597)	48
22	DeepSeek-V3	Blindfold	✓	1437	75	(1290, 1584)	24
23	DeepSeek-V3	Bullet	✓	1382	80	(1224, 1540)	22
24	Qwen3-235B-A22B	Bullet	✓	1369	54	(1261, 1476)	46
25	Qwen3-8B	Blitz	✓	1335	65	(1205, 1463)	32
26	Seed-Coder-8B-Instruct	Blitz	✓	1009	106	(800, 1218)	30

Note: RD = Rating Deviation; 95% CI = 95% Confidence Interval; ✓ = Legal moves provided; ✗ = No legal moves provided

Distribution of terminations

This figure displays all game termination conditions—including checkmate, forfeit, move limit draw, and others—offering better understanding of the models' playing behavior.

🎯 Fine-Grained Evaluation

Observing the issues models encountered in the main chess competitions, we designed three fine-grained evaluation tasks to analyze specific weaknesses:

Basic Understanding: Assesses the model's fundamental comprehension of the chessboard state.
Move Selection: Evaluates the model's single-step move choice ability.
Puzzle Solving: Tests the model's capability to solve multi-step chess puzzles (tactical reasoning).

The specific results for these tasks are detailed below.

1. Basic Understanding

Basic understanding results. Thinking models such as O3 and Doubao-Seed-1-6-Thinking show strong chessboard understanding capabilities. Our post-training significantly improves the basic understanding capability.
Model	PMA (%)	Precision (%)	Recall (%)
GPT-4.1	98.0	89.3	92.1
O3	98.5	98.5	98.5
DeepSeek-V3	97.0	81.8	75.3
DeepSeek-V3.1	89.0	87.5	87.4
DeepSeek-R1	100.0	99.2	98.4
Doubao-1-5-Pro-32k	76.0	50.6	56.2
Doubao-1-5-Lite-32k	51.5	33.3	30.3
Doubao-1-5-Thinking-Pro	99.5	98.0	98.0
Doubao-Seed-1-6-Thinking	100.0	99.9	99.9
Qwen3-235B-A22B	80.5	50.7	49.3
Claude-3.7-Sonnet	98.0	87.6	87.3
Gemini-2.5-Pro	100.0	98.5	96.7

Qwen3-8B	36.0	14.1	18.8
Qwen3-8B-Chess-SFT-Stage1	63.5 (+31.5)	20.6 (+5.9)	29.5 (+14.3)
Qwen3-8B-Chess-SFT-Stage2	70.5 (+7.0)	51.9 (+31.3)	45.3 (+15.8)
Qwen3-8B-Chess (SFT+RL)	79.0 (+8.5)	52.6 (+0.7)	50.1 (+4.8)

2. Move Selection

Move selection performance across four play modes with/without legal moves provision. We bold the highest LR, TR, and MAR within each group. LLMs still have significant room for improvement, especially when the legal moves are not provided.
Mode	Model or Engine	With Legal Moves			Without Legal Moves
Mode	Model or Engine	LR (%)	TR (%)	MAR (%)	LR (%)	TR (%)	MAR (%)
Blitz	Random Player	100.0	14.8	-1.1	/	/	/
	Maia-1100	/	/	/	100.0	78.3	+107.6

	GPT-4.1	97.5	25.9	+20.5	71.6	29.3	+6.2
	Claude-3.7-Sonnet	99.6	26.1	+25.6	68.4	18.2	-17.7
	DeepSeek-V3	99.1	18.5	+10.7	64.5	12.9	-27.7
	DeepSeek-V3.1	93.4	26.7	+18.6	63.7	16.9	-23.6
	Qwen3-235B-A22B	89.8	24.9	+29.0	64.2	17.0	-25.3
	Qwen3-8B	96.2	13.4	+1.8	9.8	2.1	-79.5
	Qwen3-8B-Chess-SFT-Stage1	86.8	13.6	-9.6	15.1	2.6	-74.9
	Qwen3-8B-Chess-SFT-Stage2	96.9	23.4	+15.1	66.3	13.3	-22.1
	Qwen3-8B-Chess (SFT+RL)	92.9	40.2	+41.1	87.6	20.2	-1.2
	Seed-Coder-8B-Instruct	59.3	8.5	-36.1	4.5	1.0	-85.4
	Seed-Coder-8B-Chess(SFT+RL)	99.5	29.5	+35.7	85.1	12.4	-9.0

Bullet	GPT-4.1	98.7	25.0	+20.8	74.0	28.7	+5.7
	Claude-3.7-Sonnet	98.6	22.5	+16.8	75.2	17.9	-9.4
	DeepSeek-V3	98.9	18.8	+11.3	66.2	13.3	-21.8
	DeepSeek-V3.1	80.6	16.1	-8.0	56.3	12.7	-35.7
	Qwen3-235B-A22B	95.9	17.8	+4.5	69.1	15.9	-18.5

Standard	DeepSeek-R1	100.0	32.7	+34.7	82.5	23.7	-1.0
	Doubao-1-5-Thinking-Pro	99.7	32.9	+35.4	78.0	24.8	+3.0
	Doubao-Seed-1-6-Thinking	99.8	39.1	+53.7	90.7	36.0	+32.0
	Gemini-2.5-Pro	99.4	37.6	+46.5	85.5	40.5	+36.5
	O3	99.6	58.7	+80.1	98.0	62.0	+80.2

Blindfold	GPT-4.1	96.8	20.1	+12.7	72.7	20.2	+1.2
	Claude-3.7-Sonnet	98.2	23.9	+21.5	77.3	18.9	-9.1
	DeepSeek-V3	95.1	19.2	+16.2	78.5	14.9	-7.8
	DeepSeek-V3.1	96.5	26.0	+27.2	66.0	13.7	-18.0
	DeepSeek-R1	94.7	22.7	+14.0	44.6	10.9	-36.9
	Qwen3-235B-A22B	96.1	19.9	+17.4	75.3	17.2	-10.4
	Doubao-Seed-1-6-Thinking	97.8	32.1	+36.5	43.6	12.9	-30.5
	Gemini-2.5-Pro	98.7	30.4	+23.5	68.7	21.5	-8.7
	O3	98.4	46.9	+63.2	86.9	43.5	+50.9

3. Puzzle Solving

Puzzle solving accuracy when legal moves are provided. LLMs perform relatively poorly, with O3 standing out as the strongest.
Model or Engine	Puzzle Solving Accuracy (%)
Model or Engine	200-600	600-1000	1000-1400	1400-1800	1800-2200	2200-2600	2600-3000	Overall
Stockfish (Depth=20)	100.0	100.0	100.0	100.0	99.3	97.9	91.5	98.4
Maia-1100	98.6	97.2	91.6	82.5	72.7	51.0	28.2	74.6
Random Player	1.4	1.4	2.1	0.0	0.0	0.0	0.0	0.7

GPT-4.1	18.9	14.0	8.4	4.9	1.4	2.8	0.0	7.2
Claude-3.7-Sonnet	18.2	16.1	4.9	4.2	5.6	1.4	0.0	7.2
DeepSeek-V3	11.9	7.7	2.1	0.7	0.0	0.7	0.0	3.3
DeepSeek-V3.1	13.3	10.5	8.4	4.9	1.4	2.8	7.0	6.0
Qwen3-235B-A22B	24.5	18.2	9.8	5.6	4.2	1.4	0.0	9.1
Qwen3-8B	2.8	4.9	2.1	0.0	0.0	0.0	0.0	1.4
Qwen3-8B-Chess	31.5	16.8	10.5	7.0	5.6	2.1	0.0	10.5
Seed-Coder-8B-Instruct	0.0	1.4	0.0	0.0	1.0	0.0	0.0	0.4
Seed-Coder-8B-Chess	23.8	8.4	4.9	3.5	4.9	2.8	0.0	6.9

O3	97.9	90.2	79.7	62.9	46.5	10.5	1.4	55.6
Gemini-2.5-Pro	37.1	24.5	18.2	9.1	4.2	3.5	1.4	14.0
Doubao-Seed-1-6-Thinking	27.3	23.8	11.9	7.7	4.2	1.4	2.1	11.2
DeepSeek-R1	23.1	20.3	7.0	4.2	2.8	0.7	0.7	8.4

♟️ Chess Post-training Pipeline

To establish a strong baseline within ChessArena and validate the efficacy of our testbed, we implemented a comprehensive post-training pipeline. We distilled the reasoning data generated by high-performing models—specifically GPT-4.1 and Doubao-Seed-1-6-Thinking.

This distilled data was first used for Supervised Fine-Tuning (SFT). Subsequently, we applied Reinforcement Learning (RL) based on the GRPO method to further enhance the models' strategic reasoning and playing strength.

This process yielded two strong specialized models: Qwen3-8B-Chess (based on Qwen3-8B) and Seed-Coder-8B-Chess (based on Seed-Coder-8B-Instruct). As evidenced by the results in the Chess Competitions and Fine-Grained Evaluation sections above, these models demonstrate a significant enhancement in chess reasoning capabilities, substantially closing the gap with much larger state-of-the-art models.

We have open-sourced our data and models, and you can click the link at the top of the page to view them.

🚀 Generalization Experiments

To test the broader impact of chess post-training, we evaluated the models' generalization ability on established Code, Math, and Reasoning benchmarks. This helps determine if the strategic reasoning enhanced in the chess domain transfers to other cognitive tasks.

Performance of Qwen3-8B and its Post-Trained Variants on External Benchmarks.
Model Variant	LiveCodeBench	AIME2025	ZebraLogic	BigCodeBench	CruxEval	DROP
Qwen3-8B (Baseline)	25.19	18.61	25.90	41.32	73.25	85.15
I. Chess Training Only
Qwen3-8B-Chess-SFT-Stage2	27.48	15.43	30.40	41.40	68.00	82.83
Qwen3-8B-Chess (SFT+RL)	25.19	19.30	48.00	39.82	72.25	83.58
II. Multi-Task RL (with Chess-SFT)
+Math-RL	25.19	19.70	42.40	39.39	73.25	84.12
+Math+Chess-RL	26.72	19.32	39.00	40.09	72.09	84.01
+Math+Code-RL	28.24	18.61	38.30	40.26	71.25	83.34
+Math+Code+Chess-RL	30.53	17.96	38.00	40.30	72.22	83.84
III. Single-Task RL (without Chess-SFT)
+Math-RL	25.19	21.30	28.00	42.37	77.47	85.53

Key Findings

Models fine-tuned via Chess Supervised Fine-Tuning (SFT) consistently demonstrate a degree of generalization ability to the logical reasoning benchmark, ZebraLogic. Notably, models that underwent Chess SFT show a significant improvement in their ZebraLogic scores (up to 48.00 for Qwen3-8B-Chess (SFT+RL)) after the RL phase compared to the baseline (25.90).
Incorporating a certain amount of Chess data into the Reinforcement Learning (RL) dataset contributes to the generalization of Code capabilities, particularly on the LiveCodeBench. Comparisons (e.g., Math+Code+Chess-RL vs. Math+Code-RL) reveal that RL models augmented with Chess data achieve an improvement in their LiveCodeBench scores.
However, the use of Chess SFT may lead to performance degradation on other benchmarks (e.g., BigCode-Bench, CruxEval); mixing Chess data for RL could also result in a slight performance decline on mathematical tasks (i.e., AIME2025).

Qualitative Analysis of Reasoning Improvement

In our paper, we have included additional case studies (e.g., in Appendix G.3) that illustrate the change in the model's reasoning process. All puzzles demonstrate that the model's reasoning before chess training was often superficial, frequently exhibiting steps skipped or known conditions forgotten.

The model's reasoning after post-training in chess is not confined to a specific format but rigorously follows known conditions step-by-step, with demonstrable adjustments when errors occur (Puzzle 3). This indicates that the chess post-training enhances the model's reasoning process, making it more rigorous and leading to higher-quality answers. You can find the relevant case studies in our paper.

Paper Page

If you are interested in ChessArena, please contact us at jinchengliu@smail.nju.edu.cn. You can also refer to our paper at https://arxiv.org/abs/2509.24239 for more information.

BibTeX

@article{liu2025chessarena,
  title={ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models},
  author={Liu, Jincheng and He, Sijun and Wu, Jingjing and Wang, Xiangsen and Chen, Yang and Kuang, Zhaoqi and Bao, Siqi and Yao, Yuan},
  journal={arXiv preprint arXiv:2509.24239},
  year={2025}
}