Data-free self-play · open-ended RL

SCOPE

Self-Play via Co-Evolving Policies for Open-Ended Tasks

Self-play can train language models without external supervision, but existing methods need rule-checkable answers. SCOPE extends data-free self-play to open-ended tasks by co-evolving a Challenger that writes document-grounded tasks and a Solver that answers them through multi-turn retrieval, graded by a frozen self-judge — with no curated prompts and no frontier-model supervision.

Wai-Chung Kwan1 Aryo Pradipta Gema1 Joshua Ong Jun Leang2 Pasquale Minervini1,3

1University of Edinburgh · 2Imperial College London · 3Miniml.AI

SCOPE overview: a Challenger creates a document-grounded task, a frozen Judge writes a rubric, a Solver answers through multi-turn retrieval, and the Judge grades the response before a GRPO update.
The SCOPE loop: Challenger → Task → Judge rubric → Solver → graded response → GRPO update.
+10.4 max open-ended gain (Qwen2.5-7B)
0 curated prompts used
15 benchmarks (8 open-ended + 7 QA)
3 7–8B model families

How SCOPE works

Three roles from one base model, co-evolving without data

SCOPE initialises a Challenger, a Solver, and a Judge from the same base model. The Challenger and Solver are trained with GRPO; the Judge stays frozen. Document grounding creates an information asymmetry — the Challenger and Judge see a source document the Solver never does, so the Solver must recover what it needs through retrieval.

Challenger

Proposes tasks

Reads a corpus document and, via multi-turn retrieval, writes an open-ended task calibrated to sit right at the Solver's capability frontier — rewarded when the Solver scores near 50%.

Evolves with GRPO
Solver

Answers tasks

Tackles each task through multi-turn retrieval-augmented generation, searching the corpus and synthesising evidence into a response, rewarded by the Judge's rubric score.

Evolves with GRPO
Judge

Writes & grades rubrics

A frozen copy of the base model derives task-specific rubrics from the source document, applies quality gates, and grades Solver responses against each criterion — no frontier model needed.

Frozen at M₀
Detailed SCOPE pipeline diagram showing the Challenger creating a task, the Judge writing a rubric, the Solver producing a response, the Judge grading it, and GRPO updates to both evolving policies.
Each iteration alternates two GRPO stages: (1) the Challenger trains against the current Solver to target moderate difficulty, then (2) the Solver trains on quality-gated, difficulty-filtered tasks with a length-controlled rubric reward plus format and search bonuses.

Create task

Challenger retrieves over a document and emits an open-ended task.

Write rubric

Frozen Judge derives 3–5 document-grounded criteria.

Solve task

Solver answers via up to 5 multi-turn search calls.

Grade response

Judge scores each rubric criterion with a strict binary verdict.

Update policies

GRPO updates Challenger and Solver; iterate.

Difficulty targeting keeps tasks learnable

The Challenger's reward peaks when the Solver's mean rubric score is 0.5 — the point of maximum feedback variance — and tasks outside [0.2, 0.8] are filtered out before Solver training.

Quality gates & length penalty stop reward hacking

Entity-identifiability and source-relevance gates keep tasks grounded, while a cosine length penalty stops the Solver from inflating answers to exploit the rubric judge.

Where SCOPE sits

First to make data-free self-play work on open-ended tasks

Prior data-free self-play needs verifiable answers; rubric-based RL handles open-ended tasks but leans on curated prompts or frontier judges. SCOPE is the first to need neither.

MethodOpen-endedData-freeReward source
SPICERule match
Dr. ZeroRule match
R-ZeroRule match
Absolute ZeroCode executor
OpenSIRRule match
RaRRubric (frontier LLM)
DR TuluRubric (frontier LLM)
RPGRubric (self-judge)
SCOPE (ours)Rubric (self-judge)

Interactive results

Matches curated-data training — with nothing curated

Pick a model and a benchmark suite. SCOPE improves every model monotonically across iterations and matches or exceeds GRPOdata, a baseline trained on ~9K curated prompts with frontier-model rubrics.

Benchmark suite

Open-ended benchmarks — Qwen3-8B

Per-benchmark scores. Higher is better.

Average score across iterations

SCOPE vs GRPOdata on open-ended benchmarks.

Extended training to six iterations

Qwen3-8B keeps improving (37.7 → 44.8) with diminishing per-step gains.

Largest gains on research-intensive tasks

Deep Research and Scholarly QA gain +11.1 points on average across models, followed by planning (+8.4), creative writing (+4.3), and user assistance (+3.0).

Transfers to held-out short-form QA

Although trained only on open-ended tasks, SCOPE lifts short-form QA by +7.8 to +13.8 points and surpasses GRPOdata on all three model families.

Analysis on Qwen3-8B

What drives the gains, and where they come from

We probe SCOPE from three angles: whether the Challenger must co-evolve, what the Solver actually learns, and what makes the self-judge effective.

Co-evolution is necessary

Freezing the Challenger caps Solver gains at +0.8 vs +3.4 for SCOPE.

Tasks drift away from the frontier

Without co-evolution, rubric score rises to 0.71 — tasks the Solver has outgrown.

Both guards prevent reward hacking

Removing quality gates (No-QG) or the length penalty (No-LP) collapses training by iter-3.

Rubric quality > grading capacity

Scaling the grader barely moves the needle; a 4B rubric writer drops the average by 2.8.

Retrieval and synthesis both improve

Swapping in iter-3 search vs iter-3 answer isolates each capability per benchmark.

Every task domain contributes

Leave-one-out: the full four-domain mixture improves fastest; long-form QA matters most.

Training dynamics

Per-step training signal across the three iterations. Rubric reward dips at iteration boundaries as the co-evolved Challenger raises difficulty, then recovers.

Reference comparison with related methods

Re-evaluated under SCOPE's single-retrieval protocol. SCOPE is the most balanced and the only method using zero curated prompts and no frontier supervision.

Related methods differ along several confounded axes (SFT warmup, frontier-model judges, richer tools), so these are reported as a reference evaluation rather than controlled baselines. DR Tulu leads on research-heavy tasks but falls below the base model on Arena-Hard (4.4) and WildBench (36.0); SCOPE wins 4/8.

Key takeaways

Self-improvement beyond the verifiable-answer regime

No supervision ceiling

SCOPE matches ~9K curated prompts with frontier rubrics using nothing but a raw corpus and a frozen self-judge.

Co-evolution is essential

A frozen Challenger stalls after iteration 1; both policies must evolve to keep tasks at the Solver's frontier.

Rubrics are the bottleneck

Once criteria are specific and document-grounded, scaling either rubric generation or grading yields little.

Citation

Cite SCOPE

The paper will appear on arXiv shortly — this entry will be updated with the arXiv identifier.

@article{kwan2026scope,
  title   = {SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks},
  author  = {Kwan, Wai-Chung and Gema, Aryo Pradipta and
             Leang, Joshua Ong Jun and Minervini, Pasquale},
  journal = {arXiv preprint arXiv:ARXIV_ID},
  year    = {2026}
}