# CodeZero

<div data-with-frame="true"><figure><img src="https://1034405018-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FIcazOdplbOmP4R0T7sG8%2Fuploads%2FKoddnpsOr2BZVDl31SYJ%2FCopy%20of%20CodeZero-still.png?alt=media&#x26;token=83cdeb4b-3df6-48a0-8a9f-d42a6b9fc274" alt=""><figcaption></figcaption></figure></div>

## Overview

CodeZero is the active RL Swarm environment that transforms distributed reinforcement learning into a cooperative coding ecosystem, a "society of models" where agents collaborate to solve programming challenges.

Unlike traditional RL environments that rely on external verification, CodeZero creates a closed learning loop where models generate problems, solve them, and evaluate solutions, all within the same peer-to-peer network.

#### What Makes CodeZero Different

CodeZero introduces a new task domain focused on programming challenges evaluated by model-based reward functions.&#x20;

This represents a shift from the previous **Reasoning Gym** environment, which focused on math and logic tasks verified by symbolic correctness checks.

In CodeZero, models participate as **Proposers**, **Solvers**, and **Evaluators**, each playing a distinct role in the collective learning process. This multi-agent architecture enables dynamic difficulty adjustment, peer-to-peer knowledge sharing, and continuous improvement through reinforcement learning.

### Roles

In CodeZero, **\[1]** Proposers, **\[2]** Solvers, and **\[3]** Evaluators collaborate to create, address, and review programming challenges, forming an ecosystem where models instruct, critique, and enhance one another.

* **Proposers:** Generate coding problems and unit tests, adjusting difficulty dynamically based on solver performance. Proposers create challenges that adapt to the swarm's current capabilities, ensuring continuous learning opportunities.
* **Solvers:** Attempt coding challenges, learn locally through RL, and share rollouts with peers. Solvers exchange solutions to promote diversity and accelerate collective learning across the network.
* **Evaluators:** Frozen models that assess correctness and assign rewards. Evaluators use rule-based assessment to score submissions without executing code, ensuring safety and scalability.

#### Training Loop

The CodeZero training cycle follows a structured progression:

{% stepper %}
{% step %}

### Question Generation

Proposers create coding tasks and tests, drawing from their learned patterns and difficulty adjustment logic.
{% endstep %}

{% step %}

### Sampling

Solvers draw tasks from proposers or from small external datasets (MBPP, CodeContests) for fallback stability.
{% endstep %}

{% step %}

### Rollout Sharing

Solvers exchange solutions with peers to promote diversity and accelerate learning across the swarm.
{% endstep %}

{% step %}

### Evaluation

Evaluators score rollouts using a frozen model (no code execution) to assess structure, formatting, and predicted correctness.
{% endstep %}

{% step %}

### Reward Assignment

Scoring combines structure, formatting, and predicted correctness into a composite reward signal.
{% endstep %}

{% step %}

### Difficulty Adjustment

Proposers adjust challenge levels based on solver success rates, maintaining an optimal learning curve.
{% endstep %}

{% step %}

### Policy Update

Solvers optimize locally via GRPO (Group Relative Policy Optimization), incorporating feedback from the swarm's collective experience.
{% endstep %}
{% endstepper %}

### Technical Details

This section outlines the datasets, model architectures, metrics, and optimization strategies that enable CodeZero to learn safely and autonomously across a decentralized network of peers.

#### Datasets

CodeZero uses two primary datasets for fallback stability and baseline challenges:

* **MBPP** **(Mostly Basic Python Problems)**: A curated set of Python programming problems with test cases.
* **CodeContests:** A collection of competitive programming challenges used for additional task diversity.

These datasets provide a stable foundation when proposer-generated tasks need supplementation.

#### Models

CodeZero employs the Qwen model family across different roles:

* **Qwen 2.5 Coder (0.5B and 1.5B):** Used for Solvers, enabling efficient local learning and rollout generation.
* **Qwen 3 (4B):** Used for Proposers and Evaluators, providing stronger generation and assessment capabilities.

#### Metrics

CodeZero tracks performance using two key metrics which provide complementary views of model performance and learning progress.

* **average\@k:** Average performance across `k` attempts, measuring consistency.
* **pass\@k:** Probability of at least one correct solution in `k` attempts, measuring capability.

#### Evaluation Safety

CodeZero uses a **rule-based evaluator** that avoids code execution. Instead, evaluators assess submissions based on:

* Code structure and formatting
* Predicted correctness (via frozen model inference)
* Adherence to problem requirements

#### Dynamic Difficulty

Proposers maintain a **5-level difficulty system** with thresholds for adjustment:

* Difficulty levels adapt based on solver success rates
* Thresholds trigger automatic adjustment to maintain optimal challenge levels
* This ensures the swarm continuously faces appropriately challenging tasks

#### Policy Optimization

Solvers use **GRPO (Group Relative Policy Optimization)** for local policy updates:

* Incorporates feedback from the swarm's collective experience
* Enables efficient learning from peer rollouts
* Maintains local autonomy while benefiting from network-wide signals

### Integration with RL Swarm

CodeZero runs on the same RL Swarm infrastructure as previous environments:

* **Same network:** Uses the existing peer-to-peer gossip protocol
* **Same identity:** Node identities and `swarm.pem` files work identically
* **Same setup:** No changes required to node installation or configuration

### Next Steps

If you're ready to test out RL Swarm in the latest environmental iteration, CodeZero, check out the [Getting Started](https://docs.gensyn.ai/testnet/rl-swarm/getting-started) guide or learn about [node management](https://docs.gensyn.ai/testnet/rl-swarm/node-management) and [troubleshooting](https://docs.gensyn.ai/testnet/rl-swarm/troubleshooting) steps.&#x20;
