CodeZero
Learn about CodeZero, the cooperative coding environment powering RL Swarm.

Overview
CodeZero is the active RL Swarm environment that transforms distributed reinforcement learning into a cooperative coding ecosystem, a "society of models" where agents collaborate to solve programming challenges.
Unlike traditional RL environments that rely on external verification, CodeZero creates a closed learning loop where models generate problems, solve them, and evaluate solutions, all within the same peer-to-peer network.
What Makes CodeZero Different
CodeZero introduces a new task domain focused on programming challenges evaluated by model-based reward functions.
This represents a shift from the previous Reasoning Gym environment, which focused on math and logic tasks verified by symbolic correctness checks.
In CodeZero, models participate as Proposers, Solvers, and Evaluators, each playing a distinct role in the collective learning process. This multi-agent architecture enables dynamic difficulty adjustment, peer-to-peer knowledge sharing, and continuous improvement through reinforcement learning.
Roles
In CodeZero, [1] Proposers, [2] Solvers, and [3] Evaluators collaborate to create, address, and review programming challenges, forming an ecosystem where models instruct, critique, and enhance one another.
Proposers: Generate coding problems and unit tests, adjusting difficulty dynamically based on solver performance. Proposers create challenges that adapt to the swarm's current capabilities, ensuring continuous learning opportunities.
Solvers: Attempt coding challenges, learn locally through RL, and share rollouts with peers. Solvers exchange solutions to promote diversity and accelerate collective learning across the network.
Evaluators: Frozen models that assess correctness and assign rewards. Evaluators use rule-based assessment to score submissions without executing code, ensuring safety and scalability.
Training Loop
The CodeZero training cycle follows a structured progression:
Technical Details
This section outlines the datasets, model architectures, metrics, and optimization strategies that enable CodeZero to learn safely and autonomously across a decentralized network of peers.
Datasets
CodeZero uses two primary datasets for fallback stability and baseline challenges:
MBPP (Mostly Basic Python Problems): A curated set of Python programming problems with test cases.
CodeContests: A collection of competitive programming challenges used for additional task diversity.
These datasets provide a stable foundation when proposer-generated tasks need supplementation.
Models
CodeZero employs the Qwen model family across different roles:
Qwen 2.5 Coder (0.5B and 1.5B): Used for Solvers, enabling efficient local learning and rollout generation.
Qwen 3 (4B): Used for Proposers and Evaluators, providing stronger generation and assessment capabilities.
Metrics
CodeZero tracks performance using two key metrics which provide complementary views of model performance and learning progress.
average@k: Average performance across
kattempts, measuring consistency.pass@k: Probability of at least one correct solution in
kattempts, measuring capability.
Evaluation Safety
CodeZero uses a rule-based evaluator that avoids code execution. Instead, evaluators assess submissions based on:
Code structure and formatting
Predicted correctness (via frozen model inference)
Adherence to problem requirements
Dynamic Difficulty
Proposers maintain a 5-level difficulty system with thresholds for adjustment:
Difficulty levels adapt based on solver success rates
Thresholds trigger automatic adjustment to maintain optimal challenge levels
This ensures the swarm continuously faces appropriately challenging tasks
Policy Optimization
Solvers use GRPO (Group Relative Policy Optimization) for local policy updates:
Incorporates feedback from the swarm's collective experience
Enables efficient learning from peer rollouts
Maintains local autonomy while benefiting from network-wide signals
Integration with RL Swarm
CodeZero runs on the same RL Swarm infrastructure as previous environments:
Same network: Uses the existing peer-to-peer gossip protocol
Same identity: Node identities and
swarm.pemfiles work identicallySame setup: No changes required to node installation or configuration
Next Steps
If you're ready to test out RL Swarm in the latest environmental iteration, CodeZero, check out the Getting Started guide or learn about node management and troubleshooting steps.
Last updated