Proposes tasks
Reads a corpus document and, via multi-turn retrieval, writes an open-ended task calibrated to sit right at the Solver's capability frontier — rewarded when the Solver scores near 50%.
Evolves with GRPOData-free self-play · open-ended RL
Self-Play via Co-Evolving Policies for Open-Ended Tasks
Self-play can train language models without external supervision, but existing methods need rule-checkable answers. SCOPE extends data-free self-play to open-ended tasks by co-evolving a Challenger that writes document-grounded tasks and a Solver that answers them through multi-turn retrieval, graded by a frozen self-judge — with no curated prompts and no frontier-model supervision.
1University of Edinburgh · 2Imperial College London · 3Miniml.AI
How SCOPE works
SCOPE initialises a Challenger, a Solver, and a Judge from the same base model. The Challenger and Solver are trained with GRPO; the Judge stays frozen. Document grounding creates an information asymmetry — the Challenger and Judge see a source document the Solver never does, so the Solver must recover what it needs through retrieval.
Reads a corpus document and, via multi-turn retrieval, writes an open-ended task calibrated to sit right at the Solver's capability frontier — rewarded when the Solver scores near 50%.
Evolves with GRPOTackles each task through multi-turn retrieval-augmented generation, searching the corpus and synthesising evidence into a response, rewarded by the Judge's rubric score.
Evolves with GRPOA frozen copy of the base model derives task-specific rubrics from the source document, applies quality gates, and grades Solver responses against each criterion — no frontier model needed.
Frozen at M₀
Challenger retrieves over a document and emits an open-ended task.
Frozen Judge derives 3–5 document-grounded criteria.
Solver answers via up to 5 multi-turn search calls.
Judge scores each rubric criterion with a strict binary verdict.
GRPO updates Challenger and Solver; iterate.
The Challenger's reward peaks when the Solver's mean rubric score is 0.5 — the point of maximum feedback variance — and tasks outside [0.2, 0.8] are filtered out before Solver training.
Entity-identifiability and source-relevance gates keep tasks grounded, while a cosine length penalty stops the Solver from inflating answers to exploit the rubric judge.
Where SCOPE sits
Prior data-free self-play needs verifiable answers; rubric-based RL handles open-ended tasks but leans on curated prompts or frontier judges. SCOPE is the first to need neither.
| Method | Open-ended | Data-free | Reward source |
|---|---|---|---|
| SPICE | ✕ | ✓ | Rule match |
| Dr. Zero | ✕ | ✓ | Rule match |
| R-Zero | ✕ | ✓ | Rule match |
| Absolute Zero | ✕ | ✓ | Code executor |
| OpenSIR | ✕ | ✓ | Rule match |
| RaR | ✓ | ✕ | Rubric (frontier LLM) |
| DR Tulu | ✓ | ✕ | Rubric (frontier LLM) |
| RPG | ✓ | ✕ | Rubric (self-judge) |
| SCOPE (ours) | ✓ | ✓ | Rubric (self-judge) |
Interactive results
Pick a model and a benchmark suite. SCOPE improves every model monotonically across iterations and matches or exceeds GRPOdata, a baseline trained on ~9K curated prompts with frontier-model rubrics.
Per-benchmark scores. Higher is better.
SCOPE vs GRPOdata on open-ended benchmarks.
Qwen3-8B keeps improving (37.7 → 44.8) with diminishing per-step gains.
Deep Research and Scholarly QA gain +11.1 points on average across models, followed by planning (+8.4), creative writing (+4.3), and user assistance (+3.0).
Although trained only on open-ended tasks, SCOPE lifts short-form QA by +7.8 to +13.8 points and surpasses GRPOdata on all three model families.
Analysis on Qwen3-8B
We probe SCOPE from three angles: whether the Challenger must co-evolve, what the Solver actually learns, and what makes the self-judge effective.
Freezing the Challenger caps Solver gains at +0.8 vs +3.4 for SCOPE.
Without co-evolution, rubric score rises to 0.71 — tasks the Solver has outgrown.
Removing quality gates (No-QG) or the length penalty (No-LP) collapses training by iter-3.
Scaling the grader barely moves the needle; a 4B rubric writer drops the average by 2.8.
Swapping in iter-3 search vs iter-3 answer isolates each capability per benchmark.
Leave-one-out: the full four-domain mixture improves fastest; long-form QA matters most.
Per-step training signal across the three iterations. Rubric reward dips at iteration boundaries as the co-evolved Challenger raises difficulty, then recovers.
Re-evaluated under SCOPE's single-retrieval protocol. SCOPE is the most balanced and the only method using zero curated prompts and no frontier supervision.
Key takeaways
SCOPE matches ~9K curated prompts with frontier rubrics using nothing but a raw corpus and a frozen self-judge.
A frozen Challenger stalls after iteration 1; both policies must evolve to keep tasks at the Solver's frontier.
Once criteria are specific and document-grounded, scaling either rubric generation or grading yields little.
Citation
The paper will appear on arXiv shortly — this entry will be updated with the arXiv identifier.
@article{kwan2026scope,
title = {SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks},
author = {Kwan, Wai-Chung and Gema, Aryo Pradipta and
Leang, Joshua Ong Jun and Minervini, Pasquale},
journal = {arXiv preprint arXiv:ARXIV_ID},
year = {2026}
}