SCOPE

Self-Play via Co-Evolving Policies for Open-Ended Tasks

Wai-Chung Kwan¹ Aryo Pradipta Gema¹ Joshua Ong Jun Leang² Pasquale Minervini^1,3

¹University of Edinburgh · ²Imperial College London · ³Miniml.AI

SCOPE overview: a Challenger creates a document-grounded task, a frozen Judge writes a rubric, a Solver answers through multi-turn retrieval, and the Judge grades the response before a GRPO update.

Self-play can train language models without external supervision, but existing methods need rule-verifiable answers. SCOPE initialises three roles from a single base model: a Challenger that generates document-grounded tasks, a Solver that answers them through multi-turn retrieval, and a frozen self-judge that writes task-specific rubrics from the source document and grades responses against them. Training alternates between Challenger and Solver via GRPO, with no curated prompts or frontier-model supervision.

Main Results

Up to +10.4 points on eight open-ended benchmarks across three 7–8B models (Qwen2.5, Qwen3, OLMo-3), matching or exceeding GRPO trained on ~9K curated prompts, with zero curated data.
Up to +13.8 points on seven held-out short-form QA benchmarks, despite training only on open-ended tasks. Surpasses GRPO on all three models.

Main Insights

Co-evolving the Challenger is necessary; a frozen Challenger stalls after iteration 1 as tasks drift from the Solver's frontier.
Gains arise from improvements in both retrieval and synthesis, with relative contribution varying by task.
Rubric generation quality is the bottleneck; once criteria are document-grounded, scaling generation or grading yields little.

One model plays three roles

SCOPE is the first self-play method for open-ended tasks, needing no curated data or frontier-model supervision. It initialises a Challenger, a Solver, and a frozen Judge from the same base model, then trains the Challenger and Solver with GRPO.

Document grounding creates the information asymmetry needed for sustained self-play: the Challenger and Judge see a source document the Solver never does, so the Solver must recover what it needs through retrieval.

Method	Open-ended	Data-free	Reward source
SPICE	✕	✓	Rule match
Dr. Zero	✕	✓	Rule match
R-Zero	✕	✓	Rule match
Absolute Zero	✕	✓	Code executor
OpenSIR	✕	✓	Rule match
RaR	✓	✕	Rubric (frontier LLM)
DR Tulu	✓	✕	Rubric (frontier LLM)
RPG	✓	✕	Rubric (self-judge)
SCOPE (ours)	✓	✓	Rubric (self-judge)

Challenger

Proposes tasks

Reads a corpus document via multi-turn retrieval and writes an open-ended task targeting the Solver's frontier. Rewarded when the Solver scores near 50%; tasks must pass the Judge's quality gates before entering training.

Solver

Solves tasks

Tackles each task through multi-turn retrieval, searching the corpus and synthesising evidence. Rewarded by the Judge's length-controlled rubric score, format compliance, and search tool usage.

Judge

Writes rubrics & grades responses

Derives task-specific rubrics from the source document, applies quality gates to Challenger-generated tasks, and grades Solver responses with binary verdicts per criterion.

Reward design

Challenger

Difficulty reward peaks at Solver score ≈ 0.5; zeroed for ungrounded or generic tasks.
Format reward checks think / search / task output structure.

Solver

Rubric reward averages the Judge's pass/fail verdicts across task-specific criteria.
Format reward checks think / tool / answer output structure.
Search reward rewards active search tool use during retrieval.

Matches curated-data training

Across three 7–8B models, SCOPE improves open-ended benchmark scores monotonically across iterations and matches or exceeds GRPO_data, a baseline trained on ~9K curated prompts with frontier-model rubrics, without using any curated prompts.

Self-play builds transferable skills: short-form QA improves by +7.8 to +13.8 points despite zero short-form training.

Base model (instruct; suffixes omitted for brevity)

Benchmark suite

Open-ended benchmarks: Qwen3-8B

Per-benchmark scores.

Average score across iterations

SCOPE matches or outperforms GRPO_data without any curated data.

Extended training to six iterations

Gains stay positive across all six iterations with diminishing per-step returns but no sign of collapse.

What drives SCOPE's gains?

We probe SCOPE from three angles: whether the Challenger must co-evolve, what the Solver actually learns, and what makes the self-judge effective.

The Challenger must keep adapting

A static Challenger cannot sustain open-ended learning. Its tasks become too easy for the improving Solver, and performance gains stall unless both policies co-evolve.

Frozen Challenger caps performance gains

A frozen Challenger yields 4× fewer Solver gains than full SCOPE.

Co-evolution keeps tasks challenging

A frozen Challenger's tasks become too easy as the Solver improves.

The Solver improves both retrieval and synthesis

SCOPE improves both evidence retrieval and answer synthesis, with the larger gain tracking each task's bottleneck: retrieval for multi-hop, synthesis for single-hop and knowledge-mismatched tasks.

Both retrieval and synthesis improve

Controlled replay swaps one Solver component at a time:

Baseline: Iter-1 search + Iter-1 answer.
+ Synthesis: Iter-1 search + Iter-3 answer — measures answer quality on fixed evidence.
+ Retrieval: Iter-3 search + Iter-1 answer — measures evidence-finding quality.

Rubric quality is the self-judge bottleneck

Rubric quality > grading capacity

The ability to write specific rubrics matters more than the capacity to grade against them. Scaling the grader barely moves the needle; switching to a 4B rubric writer collapses performance.

Quality gates prevent reward hacking

Both gates are necessary

Removing quality gates (No-QG) or the length penalty (No-LP) collapses training by iter-3.

Task diversity compounds gains

Every task domain contributes

Each domain targets a distinct skill, letting their benefits compound rather than plateau: the full four-domain mixture pulls ahead of every ablation, widening its lead from +0.2 at Iter-1 to +1.6 at Iter-3. Among the four, long-form QA is the most impactful, dropping the average by 3.0 points when removed.

Reference comparison with related methods

Re-evaluated under SCOPE's single-retrieval protocol. SCOPE is the most balanced and the only method using zero curated prompts and no frontier supervision.

Cite SCOPE

@article{kwan2026scope,
  title   = {SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks},
  author  = {Kwan, Wai-Chung and Gema, Aryo Pradipta and
             Leang, Joshua Ong Jun and Minervini, Pasquale},
  journal = {arXiv preprint arXiv:2605.31433},
  year    = {2026}
}