# How It Works

## Reinforcement Learning

RL Swarm is built on a layered architecture that enables distributed reinforcement learning across independent nodes. Understanding how these layers fit together helps clarify both the system's capabilities and how it has evolved.

#### The Layered Stack

<table data-view="cards"><thead><tr><th></th><th></th></tr></thead><tbody><tr><td><strong>RL Swarm</strong></td><td>The orchestrator for distributed reinforcement learning environments. It manages node connections, identity, and network coordination across the peer-to-peer swarm.</td></tr><tr><td><strong>Gen RL</strong></td><td>The open-source reinforcement learning SDK that powers swarm environments. GenRL provides the modular framework  that enables multi-agent, multi-stage RL with decentralized coordination.</td></tr><tr><td><strong>CodeZero</strong></td><td>The active cooperative coding environment where models act as Proposers, Solvers, and Evaluators in a closed learning loop. CodeZero replaces Reasoning Gym as the current RL Swarm environment.</td></tr></tbody></table>

{% hint style="success" %}
All active swarms now use CodeZero as the default environment. [Legacy environments ](/testnet/rl-swarm/how-it-works/legacy-environments.md)such as Reasoning Gym are archived.
{% endhint %}

#### From Reasoning Gym to CodeZero

Earlier versions of RL Swarm used a math and logic environment known as Reasoning Gym.

Starting with the [November 2025 release](https://github.com/gensyn-ai/rl-swarm/releases), this has been replaced by **CodeZero,** a new cooperative coding environment built on the same peer-to-peer framework.

CodeZero extends the RL Swarm framework onto the same decentralized network and identity system, but introduces a *new task domain* focused on programming and debugging, where task success is scored using model-based or execution-based reward functions instead of rule-based logical verification.

{% hint style="success" %}
**As of November 12th, 2025, CodeZero replaces Reasoning Gym as the active RL Swarm environment.** Node setup, identity, and network connection remain identical.
{% endhint %}

#### Reinforcement Learning in a Distributed Setting

Reinforcement Learning (RL) enables agents to learn optimal actions through feedback. RL Swarm extends this paradigm into a collaborative, distributed setting, where many agents train and critique *together* instead of working *alone*.

<div data-with-frame="true"><figure><img src="/files/IsGk9MzNLKQABbHdIW6L" alt=""><figcaption></figcaption></figure></div>

### RL in Action

Reinforcement Learning (RL) continues to prove its power in solving complex problems, from optimizing systems to training intelligent agents.&#x20;

As we push the boundaries, especially in scenarios involving multiple interacting agents, the need for robust and flexible environments becomes even more critical.&#x20;

> Our core philosophy is to provide a highly customisable and scalable platform that addresses the limitations often encountered when building multi-agent RL systems.

Many existing frameworks tend to be either centralized in their approach or simply don't offer native support for multi-agent settings, which can lead to significant development hurdles.

RL Swarm, powered by GenRL (short for “General Reinforcement Learning”), is a framework built to simplify and accelerate the development of advanced, multi-agent reinforcement learning environments.

#### GenRL

A key highlight of GenRL is its native support for horizontally scalable, multi-agent, multi-stage RL with decentralised coordination and communication.&#x20;

Unlike frameworks that might force a centralised control scheme, GenRL is built for environments where agents can learn and interact in a distributed, open, and permissionless manner.

### Four Components

At its heart, GenRL puts the user in control of defining the entire 'game' agents play. We've built an intuitive, modular architecture that orchestrates the complete RL cycle, allowing you to tailor every aspect of your environment.&#x20;

This is achieved through four well-defined components: **\[1]** DataManager, **\[2]** RewardManager, **\[3]** Trainer, and **\[4]** GameManager.

#### 1. DataManager

> Manages data and initializes each training round.

The DataManager defines and organizes the dataset your RL environment uses, whether it’s a large quantity of text, a labeled image collection, or a specialized format like a chessboard configuration.&#x20;

It makes sure the system has the right inputs to learn and perform tasks effectively by 'managing' the given data.&#x20;

By choosing and structuring data precisely, you directly shape the RL environment’s scope, performance, and applicability, since the nature and *quality* of the dataset determine the agent’s learning efficiency and potential outcomes. For example, richer datasets can improve robustness and generalization, while narrow datasets may constrain learning to specific scenarios.

#### 2. RewardManager

> Defines custom reward logic.&#x20;

The RewardManager defines and implements *model-based reward evaluation,* using frozen evaluators to score predicted correctness instead of rule-based checks.

By translating outcomes into feedback signals, it directly shapes the objective of your RL environment, influencing what behaviors are encouraged and how policies evolve over time.

#### 3. Trainer

> The Trainer applies algorithms such as **GRPO (Group Relative Policy Optimization)** to update solver policies from evaluator-scored rollouts.

The Trainer component handles both learning and rollout generation: it manages the core training loop, i.e., applying your chosen RL paradigm (e.g. policy gradient optimization, value-function approximation, etc.) to update the policy.&#x20;

It also produces the rollouts that drive agent–environment interactions to make sure the experiences needed for each subsequent training step are generated and ingested without a hitch.

#### 4. GameManager

> Coordinates data flow and communication between multiple agents.

The GameManager coordinates the data flow efficiently and effectively among the key modules you define, so there is smooth interaction between these modules and the other agents within the multi-agent swarm. It acts as a central hub that manages communications, synchronizes processes, and facilitates the exchange of information.&#x20;

{% hint style="success" %}
GenRL allows module customization for tailored learning goals and agent interactions, supporting scalable multi-agent RL solutions. It works with any environment, including **CodeZero** and legacy ones like Reasoning Gym.
{% endhint %}

### Framework Defined Progression

The game progresses on a per-round basis.&#x20;

1. Each round, the DataManager initializes the round data, triggering the game’s stages.&#x20;
2. For every stage, rollouts are generated, added to the game state, and shared with the swarm.&#x20;
3. Once the agent completes all predefined stages, rewards are evaluated and policies are updated.&#x20;

The user retains full control over this process within the `Trainer.train` method, allowing policy updates on either a per-stage or per-round basis.

<div data-with-frame="true"><figure><img src="/files/s28ZbGdauojZu8fs6lcR" alt=""><figcaption></figcaption></figure></div>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.gensyn.ai/testnet/rl-swarm/how-it-works.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
