CodeZero

Learn about CodeZero, the cooperative coding environment powering RL Swarm.

Overview

CodeZero is the active RL Swarm environment that transforms distributed reinforcement learning into a cooperative coding ecosystem, a "society of models" where agents collaborate to solve programming challenges.

Unlike traditional RL environments that rely on external verification, CodeZero creates a closed learning loop where models generate problems, solve them, and evaluate solutions, all within the same peer-to-peer network.

What Makes CodeZero Different

CodeZero introduces a new task domain focused on programming challenges evaluated by model-based reward functions.

This represents a shift from the previous Reasoning Gym environment, which focused on math and logic tasks verified by symbolic correctness checks.

In CodeZero, models participate as Proposers, Solvers, and Evaluators, each playing a distinct role in the collective learning process. This multi-agent architecture enables dynamic difficulty adjustment, peer-to-peer knowledge sharing, and continuous improvement through reinforcement learning.

Roles

In CodeZero, [1] Proposers, [2] Solvers, and [3] Evaluators collaborate to create, address, and review programming challenges, forming an ecosystem where models instruct, critique, and enhance one another.

Proposers: Generate coding problems and unit tests, adjusting difficulty dynamically based on solver performance. Proposers create challenges that adapt to the swarm's current capabilities, ensuring continuous learning opportunities.
Solvers: Attempt coding challenges, learn locally through RL, and share rollouts with peers. Solvers exchange solutions to promote diversity and accelerate collective learning across the network.
Evaluators: Frozen models that assess correctness and assign rewards. Evaluators use rule-based assessment to score submissions without executing code, ensuring safety and scalability.

Training Loop

The CodeZero training cycle follows a structured progression:

Question Generation

Proposers create coding tasks and tests, drawing from their learned patterns and difficulty adjustment logic.

Sampling

Solvers draw tasks from proposers or from small external datasets (MBPP, CodeContests) for fallback stability.

Solvers exchange solutions with peers to promote diversity and accelerate learning across the swarm.

Evaluation

Evaluators score rollouts using a frozen model (no code execution) to assess structure, formatting, and predicted correctness.

Reward Assignment

Scoring combines structure, formatting, and predicted correctness into a composite reward signal.

Difficulty Adjustment

Proposers adjust challenge levels based on solver success rates, maintaining an optimal learning curve.

Policy Update

Solvers optimize locally via GRPO (Group Relative Policy Optimization), incorporating feedback from the swarm's collective experience.

Technical Details

This section outlines the datasets, model architectures, metrics, and optimization strategies that enable CodeZero to learn safely and autonomously across a decentralized network of peers.

Datasets

CodeZero uses two primary datasets for fallback stability and baseline challenges:

MBPP (Mostly Basic Python Problems): A curated set of Python programming problems with test cases.
CodeContests: A collection of competitive programming challenges used for additional task diversity.

These datasets provide a stable foundation when proposer-generated tasks need supplementation.

Models

CodeZero employs the Qwen model family across different roles:

Qwen 2.5 Coder (0.5B and 1.5B): Used for Solvers, enabling efficient local learning and rollout generation.
Qwen 3 (4B): Used for Proposers and Evaluators, providing stronger generation and assessment capabilities.

Metrics

CodeZero tracks performance using two key metrics which provide complementary views of model performance and learning progress.

average@k: Average performance across k attempts, measuring consistency.
pass@k: Probability of at least one correct solution in k attempts, measuring capability.

Evaluation Safety

CodeZero uses a rule-based evaluator that avoids code execution. Instead, evaluators assess submissions based on:

Code structure and formatting
Predicted correctness (via frozen model inference)
Adherence to problem requirements

Dynamic Difficulty

Proposers maintain a 5-level difficulty system with thresholds for adjustment:

Difficulty levels adapt based on solver success rates
Thresholds trigger automatic adjustment to maintain optimal challenge levels
This ensures the swarm continuously faces appropriately challenging tasks

Policy Optimization

Solvers use GRPO (Group Relative Policy Optimization) for local policy updates:

Incorporates feedback from the swarm's collective experience
Enables efficient learning from peer rollouts
Maintains local autonomy while benefiting from network-wide signals

Integration with RL Swarm

CodeZero runs on the same RL Swarm infrastructure as previous environments:

Same network: Uses the existing peer-to-peer gossip protocol
Same identity: Node identities and swarm.pem files work identically
Same setup: No changes required to node installation or configuration

Next Steps

If you're ready to test out RL Swarm in the latest environmental iteration, CodeZero, check out the Getting Started guide or learn about node management and troubleshooting steps.

PreviousHow It Works NextLegacy Environments

Last updated 1 month ago