Parallel-R1: Towards Parallel Thinking via Reinforcement Learning

Technical Report, 2025

Tong Zheng^1,2,†, Hongming Zhang¹, Wenhao Yu¹, Xiaoyang Wang¹, Xinyu Yang³, Runpeng Dai^1,4,
Rui Liu², Huiwen Bao⁵, Chengsong Huang⁶, Heng Huang², Dong Yu¹

¹Tencent AI Lab Seattle, ²University of Maryland, College Park, ³Carnegie Mellon University,
⁴University of North Carolina at Chapel Hill, ⁵City University of Hong Kong, ⁶Washington University in St. Louis

Paper arXiv Code

An overview of the Parallel-R1 framework

Parallel-R1 teaches Large Language Models to learn parallel thinking through reinforcement learning, turning a single line of thought into a multi-thread reasoning strategy.

Abstract

Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently. However, activating such capabilities through training remains challenging, as existing methods predominantly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced imitation rather than exploration and generalization.

Different from them, we propose Parallel-R1, the first reinforcement learning (RL) framework that enables parallel thinking behaviors for complex real-world reasoning tasks. Our framework employs a progressive curriculum that explicitly addresses the cold-start problem in training parallel thinking with RL. We first use SFT on prompt-generated trajectories from easier tasks to instill the parallel thinking ability, then transition to RL to explore and generalize this skill on harder problems.

Highlights

Presents Parallel-R1, the first RL framework to instill parallel thinking, achieving an 8.4% accuracy improvement over sequential thinking models trained with RL.
Discovers a clear strategic evolution in how the model uses parallel thinking, shifting from early-stage computational exploration to late-stage multi-perspective verification as training progresses.
Establishes the value of parallel thinking as a mid-training exploration scaffold, a temporary phase which unlocks a higher performance ceiling and yields a 42.9% improvement over the baseline.

Challenges in Teaching Parallel Thinking

While powerful, teaching LLMs to think in parallel via Reinforcement Learning is non-trivial and presents several core challenges that our work aims to address.

1. The RL Cold-Start & Data Bottleneck

Since the current LLMs have not seen parallel thinking behavior during the pre-training and sft, they cannot generate such trajectories during explorations for the model to learn from. Thus, the cold-start training becomes crucial. The goal of this stage is to teach the model basic formats without harming it too much, which requires a small-scale, high-quality dataset. However, the fact is that high-quality parallel thinking data for complex, real-world problems is extremely rare in natural text and difficult to synthesize.

2. The Reward Design Dilemma

The ideal reward function is unclear. If we only reward final accuracy, the model might learn to "cheat" and abandon complex parallel thinking. If we only reward the use of parallel structures, performance on the actual task might suffer.

3. The "Black Box" Strategy

Even if a model learns this skill, its strategic role and underlying mechanisms are a "black box". How does the model's strategy evolve during training? Without understanding this dynamic, it's impossible to fully unlock the potential of parallel thinking.

Our Contributions

1. The Parallel-R1 Framework

To solve the cold-start and reward design challenges, we propose a complete framework featuring a progressive curriculum and dedicated reward design.

The illustration of our training recipe.

Progressive Curriculum: We leverage a key finding that it's easy to generate parallel data for simple tasks (like GSM8K). We create the Parallel-GSM8K dataset to first teach the model the *format* of parallel thinking via SFT, before using RL to generalize the skill on harder problems.
Dedicated Reward Design: We explore multiple reward schems to effectively stimulate parallel thinking behaviors and propose an alternating reward strategy that switches between an accuracy-only reward and a tiered reward that gives a bonus to correct answers generated with parallel thinking.
Model Architecture Support: We explore two versions of our Parallel-R1, the Seen and Unseen versions, which utilize novel architectural components like path-window attention and multiverse position ID.

Diagram illustrating Path-Window attention and Multiverse position ID

2. Ablation on Rewards: Stimulating Parallel Thinking

To find the best reward design for stimulating parallel thinking, we conducted an ablation study on three different reward schemes:

Accuracy-Only Reward: This baseline scheme provides a reward of 1 for a correct final answer and 0 for an incorrect one.
Parallel Reward: This scheme introduces a bonus for correct answers that also demonstrate parallel thinking, encouraging the model to not only get the right answer but to do so in the desired format.
Alternating Accuracy/Parallel Reward (Our Method): Our proposed approach alternates between the two schemes. It first uses the simple accuracy reward to anchor the model on task performance, then switches to the parallel reward to teach the specific thinking style.

Training Configuration	Parallel Ratio	AIME 25	AIME 24	AMC 23	MATH
Accuracy	13.6	17.7	18.3	69.7	82.6
Parallel	80.3	17.7	15.2	59.4	81.7
Alternating Acc./Parallel	63.0	19.0	16.3	67.5	84.5

Findings: The results show a clear trade-off. The Accuracy-only model performs well on benchmarks but fails to learn the parallel format (13.6% ratio). The Parallel-only model masters the format (80.3% ratio) but at the cost of accuracy. Our Alternating approach strikes the best balance, achieving a high parallel ratio (63.0%) while also attaining the highest scores on AIME 25 and MATH, demonstrating its effectiveness.

3. Uncovering Learning Dynamics

To open the "black box", we provide the first empirical evidence of how an LLM's reasoning strategy with parallel thinking evolves. Our analysis reveals a clear strategic shift: the model initially leverages parallel paths for exploration, but as it gains proficiency, it shifts towards using them for verification.

Graph showing the relative position of the parallel block during RL training

Early Stage: Exploration

Late Stage: Verification

4. Parallel Thinking as a Mid-Training Scaffold

We conceptualize and validate that parallel thinking can serve as a structured exploration "scaffold". By temporarily forcing the model to explore with parallel paths, we guide it toward more robust policy spaces. This temporary phase unlocks a higher performance ceiling after RL, yielding a 42.9% improvement over the baseline and reaching a peak accuracy of 25.6% on AIME25.

Graph showing two-stage training with parallel reasoning as a scaffold

Main Results at a Glance

Our full framework leads to an **8.4% average accuracy improvement** over the sequential thinking model trained directly on challenging tasks with RL. The table below provides a detailed breakdown of performance across all benchmarks and configurations.

Method	# Parallel (%)	AIME25		AIME24		AMC23		MATH (Mean@1)	Avg.
Method	# Parallel (%)	Mean@16	Pass@16	Mean@16	Pass@16	Mean@16	Pass@16	MATH (Mean@1)	Avg.
Qwen3-4B-Base	0.0	1.3	10.2	2.9	16.5	8.1	51.2	13.9	6.6
SFT + Parallel
Parallel-SFT-Seen	95.6	8.0	29.8	10.6	26.4	48.9	79.2	76.6	36.0
Parallel-SFT-Unseen	95.6	5.2	20.9	8.5	26.7	41.7	80.1	71.5	31.7
RL Approach
GRPO (DAPO)	0.0	14.8	32.4	18.5	30.6	63.6	85.1	83.5	45.1
+ RL on GSM8K	0.0	13.3	26.3	18.8	34.9	66.4	82.2	82.6	45.3
Parallel-R1-Seen	27.3	19.2	38.9	19.4	37.1	70.5	85.0	86.7	48.9
Parallel-R1-Unseen (S1)	13.6	17.7	37.8	18.3	33.2	69.7	88.9	82.6	47.1
Parallel-R1-Unseen (S2)	63.0	19.0	42.2	16.3	31.8	67.5	91.5	84.5	46.8

BibTeX

@misc{zheng2025parallelr1parallelthinkingreinforcement,
      title={Parallel-R1: Towards Parallel Thinking via Reinforcement Learning}, 
      author={Tong Zheng and Hongming Zhang and Wenhao Yu and Xiaoyang Wang and Xinyu Yang and Runpeng Dai and Rui Liu and Huiwen Bao and Chengsong Huang and Heng Huang and Dong Yu},
      year={2025},
      eprint={2509.07980},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.07980}, 
}