The Gensyn Testnet is live for anyone to join (decentralised with no waitlist).
Run a node
Gensyn
  • Home
  • Testnet
  • Litepaper (legacy)
  • Overview
  • RL Swarm
    • How it Works
    • Connecting your Node
    • Troubleshooting
Powered by GitBook
On this page
  1. RL Swarm

How it Works

PreviousRL SwarmNextConnecting your Node

Last updated 22 days ago

RL Swarm is a multi-stage, collaborative training system. Below is a brief summary of how it works, including what happens at each stage.

Stage

What the Model Does

What the Model Shares with the Swarm

How the Model Learns

1 (Generate)

The model receives a question from the dataset and writes several possible answers. It picks the best one based on feedback from the reward function.

The model shares its best answer.

A reward function scores every draft answer for accuracy and format. The score on the best answer is used to update the model’s weights.

2 (Critique)

The model reviews the question plus peer answers from Stage 1. It writes several critiques, each naming which peer answer it thinks is best. It then picks its own best critique based on feedback from the reward function.

The model shares its best critique and the peer answer it thinks is best.

A reward function scores every critique for clarity and correct formatting. The score on the best critique is used to update the weights so the model gets better at spotting strong answers and explaining why.

3 (Vote)

The model reviews the question, peer answers and peer critiques, and predicts the answer the swarm will prefer. It also revises its own answer accordingly.

Nothing - outputs in this stage remain local.

The model’s revised answer is graded for correctness and proper format, using the same reward function as Stage 1. If the model’s prediction about which peer answer would win matches the actual majority choice, it earns an extra bonus point. Those two numbers (answer score + consensus bonus) are added together and used as the reward for updating the model’s weights.

Throughout this process, RL Swarm teaches the model how to answer accurately, critique peers constructively, and align with the swarm consensus. As the swarm grows in size and diversity, it becomes more effective in training. To understand why, we encourage you to read the following papers:

Report: Collaborative Post Training with RL Swarm
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Diversity of Thought Elicits Stronger Reasoning Capabilities in Multi-Agent Debate Frameworks