# CodeZero

<div data-with-frame="true"><figure><img src="/files/0FOqTSZCohmgdDdVvgLE" alt=""><figcaption></figcaption></figure></div>

## Overview

CodeZero is the active RL Swarm environment that transforms distributed reinforcement learning into a cooperative coding ecosystem, a "society of models" where agents collaborate to solve programming challenges.

Unlike traditional RL environments that rely on external verification, CodeZero creates a closed learning loop where models generate problems, solve them, and evaluate solutions, all within the same peer-to-peer network.

#### What Makes CodeZero Different

CodeZero introduces a new task domain focused on programming challenges evaluated by model-based reward functions.&#x20;

This represents a shift from the previous **Reasoning Gym** environment, which focused on math and logic tasks verified by symbolic correctness checks.

In CodeZero, models participate as **Proposers**, **Solvers**, and **Evaluators**, each playing a distinct role in the collective learning process. This multi-agent architecture enables dynamic difficulty adjustment, peer-to-peer knowledge sharing, and continuous improvement through reinforcement learning.

### Roles

In CodeZero, **\[1]** Proposers, **\[2]** Solvers, and **\[3]** Evaluators collaborate to create, address, and review programming challenges, forming an ecosystem where models instruct, critique, and enhance one another.

* **Proposers:** Generate coding problems and unit tests, adjusting difficulty dynamically based on solver performance. Proposers create challenges that adapt to the swarm's current capabilities, ensuring continuous learning opportunities.
* **Solvers:** Attempt coding challenges, learn locally through RL, and share rollouts with peers. Solvers exchange solutions to promote diversity and accelerate collective learning across the network.
* **Evaluators:** Frozen models that assess correctness and assign rewards. Evaluators use rule-based assessment to score submissions without executing code, ensuring safety and scalability.

#### Training Loop

The CodeZero training cycle follows a structured progression:

{% stepper %}
{% step %}

### Question Generation

Proposers create coding tasks and tests, drawing from their learned patterns and difficulty adjustment logic.
{% endstep %}

{% step %}

### Sampling

Solvers draw tasks from proposers or from small external datasets (MBPP, CodeContests) for fallback stability.
{% endstep %}

{% step %}

### Rollout Sharing

Solvers exchange solutions with peers to promote diversity and accelerate learning across the swarm.
{% endstep %}

{% step %}

### Evaluation

Evaluators score rollouts using a frozen model (no code execution) to assess structure, formatting, and predicted correctness.
{% endstep %}

{% step %}

### Reward Assignment

Scoring combines structure, formatting, and predicted correctness into a composite reward signal.
{% endstep %}

{% step %}

### Difficulty Adjustment

Proposers adjust challenge levels based on solver success rates, maintaining an optimal learning curve.
{% endstep %}

{% step %}

### Policy Update

Solvers optimize locally via GRPO (Group Relative Policy Optimization), incorporating feedback from the swarm's collective experience.
{% endstep %}
{% endstepper %}

### Technical Details

This section outlines the datasets, model architectures, metrics, and optimization strategies that enable CodeZero to learn safely and autonomously across a decentralized network of peers.

#### Datasets

CodeZero uses two primary datasets for fallback stability and baseline challenges:

* **MBPP** **(Mostly Basic Python Problems)**: A curated set of Python programming problems with test cases.
* **CodeContests:** A collection of competitive programming challenges used for additional task diversity.

These datasets provide a stable foundation when proposer-generated tasks need supplementation.

#### Models

CodeZero employs the Qwen model family across different roles:

* **Qwen 2.5 Coder (0.5B and 1.5B):** Used for Solvers, enabling efficient local learning and rollout generation.
* **Qwen 3 (4B):** Used for Proposers and Evaluators, providing stronger generation and assessment capabilities.

#### Metrics

CodeZero tracks performance using two key metrics which provide complementary views of model performance and learning progress.

* **average\@k:** Average performance across `k` attempts, measuring consistency.
* **pass\@k:** Probability of at least one correct solution in `k` attempts, measuring capability.

#### Evaluation Safety

CodeZero uses a **rule-based evaluator** that avoids code execution. Instead, evaluators assess submissions based on:

* Code structure and formatting
* Predicted correctness (via frozen model inference)
* Adherence to problem requirements

#### Dynamic Difficulty

Proposers maintain a **5-level difficulty system** with thresholds for adjustment:

* Difficulty levels adapt based on solver success rates
* Thresholds trigger automatic adjustment to maintain optimal challenge levels
* This ensures the swarm continuously faces appropriately challenging tasks

#### Policy Optimization

Solvers use **GRPO (Group Relative Policy Optimization)** for local policy updates:

* Incorporates feedback from the swarm's collective experience
* Enables efficient learning from peer rollouts
* Maintains local autonomy while benefiting from network-wide signals

### Integration with RL Swarm

CodeZero runs on the same RL Swarm infrastructure as previous environments:

* **Same network:** Uses the existing peer-to-peer gossip protocol
* **Same identity:** Node identities and `swarm.pem` files work identically
* **Same setup:** No changes required to node installation or configuration

### Next Steps

If you're ready to test out RL Swarm in the latest environmental iteration, CodeZero, check out the [Getting Started](/testnet/rl-swarm/getting-started.md) guide or learn about [node management](/testnet/rl-swarm/node-management.md) and [troubleshooting](/testnet/rl-swarm/troubleshooting.md) steps.&#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.gensyn.ai/testnet/rl-swarm/how-it-works/codezero.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
