How It Works
Learn how RL Swarm functions under the hood, from GenRL’s modular architecture to multi-agent learning, coordination, and reward cycles.
Reinforcement Learning
RL Swarm is built on a layered architecture that enables distributed reinforcement learning across independent nodes. Understanding how these layers fit together helps clarify both the system's capabilities and how it has evolved.
The Layered Stack
RL Swarm
The orchestrator for distributed reinforcement learning environments. It manages node connections, identity, and network coordination across the peer-to-peer swarm.
Gen RL
The open-source reinforcement learning SDK that powers swarm environments. GenRL provides the modular framework that enables multi-agent, multi-stage RL with decentralized coordination.
CodeZero
The active cooperative coding environment where models act as Proposers, Solvers, and Evaluators in a closed learning loop. CodeZero replaces Reasoning Gym as the current RL Swarm environment.
All active swarms now use CodeZero as the default environment. Legacy environments such as Reasoning Gym are archived.
From Reasoning Gym to CodeZero
Earlier versions of RL Swarm used a math and logic environment known as Reasoning Gym.
Starting with the November 2025 release, this has been replaced by CodeZero, a new cooperative coding environment built on the same peer-to-peer framework.
CodeZero extends the RL Swarm framework onto the same decentralized network and identity system, but introduces a new task domain focused on programming and debugging, where task success is scored using model-based or execution-based reward functions instead of rule-based logical verification.
As of November 12th, 2025, CodeZero replaces Reasoning Gym as the active RL Swarm environment. Node setup, identity, and network connection remain identical.
Reinforcement Learning in a Distributed Setting
Reinforcement Learning (RL) enables agents to learn optimal actions through feedback. RL Swarm extends this paradigm into a collaborative, distributed setting, where many agents train and critique together instead of working alone.

RL in Action
Reinforcement Learning (RL) continues to prove its power in solving complex problems, from optimizing systems to training intelligent agents.
As we push the boundaries, especially in scenarios involving multiple interacting agents, the need for robust and flexible environments becomes even more critical.
Our core philosophy is to provide a highly customisable and scalable platform that addresses the limitations often encountered when building multi-agent RL systems.
Many existing frameworks tend to be either centralized in their approach or simply don't offer native support for multi-agent settings, which can lead to significant development hurdles.
RL Swarm, powered by GenRL (short for “General Reinforcement Learning”), is a framework built to simplify and accelerate the development of advanced, multi-agent reinforcement learning environments.
GenRL
A key highlight of GenRL is its native support for horizontally scalable, multi-agent, multi-stage RL with decentralised coordination and communication.
Unlike frameworks that might force a centralised control scheme, GenRL is built for environments where agents can learn and interact in a distributed, open, and permissionless manner.
Four Components
At its heart, GenRL puts the user in control of defining the entire 'game' agents play. We've built an intuitive, modular architecture that orchestrates the complete RL cycle, allowing you to tailor every aspect of your environment.
This is achieved through four well-defined components: [1] DataManager, [2] RewardManager, [3] Trainer, and [4] GameManager.
1. DataManager
Manages data and initializes each training round.
The DataManager defines and organizes the dataset your RL environment uses, whether it’s a large quantity of text, a labeled image collection, or a specialized format like a chessboard configuration.
It makes sure the system has the right inputs to learn and perform tasks effectively by 'managing' the given data.
By choosing and structuring data precisely, you directly shape the RL environment’s scope, performance, and applicability, since the nature and quality of the dataset determine the agent’s learning efficiency and potential outcomes. For example, richer datasets can improve robustness and generalization, while narrow datasets may constrain learning to specific scenarios.
2. RewardManager
Defines custom reward logic.
The RewardManager defines and implements model-based reward evaluation, using frozen evaluators to score predicted correctness instead of rule-based checks.
By translating outcomes into feedback signals, it directly shapes the objective of your RL environment, influencing what behaviors are encouraged and how policies evolve over time.
3. Trainer
The Trainer applies algorithms such as GRPO (Group Relative Policy Optimization) to update solver policies from evaluator-scored rollouts.
The Trainer component handles both learning and rollout generation: it manages the core training loop, i.e., applying your chosen RL paradigm (e.g. policy gradient optimization, value-function approximation, etc.) to update the policy.
It also produces the rollouts that drive agent–environment interactions to make sure the experiences needed for each subsequent training step are generated and ingested without a hitch.
4. GameManager
Coordinates data flow and communication between multiple agents.
The GameManager coordinates the data flow efficiently and effectively among the key modules you define, so there is smooth interaction between these modules and the other agents within the multi-agent swarm. It acts as a central hub that manages communications, synchronizes processes, and facilitates the exchange of information.
GenRL allows module customization for tailored learning goals and agent interactions, supporting scalable multi-agent RL solutions. It works with any environment, including CodeZero and legacy ones like Reasoning Gym.
Framework Defined Progression
The game progresses on a per-round basis.
Each round, the DataManager initializes the round data, triggering the game’s stages.
For every stage, rollouts are generated, added to the game state, and shared with the swarm.
Once the agent completes all predefined stages, rewards are evaluated and policies are updated.
The user retains full control over this process within the Trainer.train method, allowing policy updates on either a per-stage or per-round basis.

Last updated