How it Works
Last updated
Last updated
RL Swarm is a multi-stage, collaborative training system. Below is a brief summary of how it works, including what happens at each stage.
Stage
What the Model Does
What the Model Shares with the Swarm
How the Model Learns
1 (Generate)
The model receives a question from the dataset and writes several possible answers. It picks the best one based on feedback from the reward function.
The model shares its best answer.
A reward function scores every draft answer for accuracy and format. The score on the best answer is used to update the model’s weights.
2 (Critique)
The model reviews the question plus peer answers from Stage 1. It writes several critiques, each naming which peer answer it thinks is best. It then picks its own best critique based on feedback from the reward function.
The model shares its best critique and the peer answer it thinks is best.
A reward function scores every critique for clarity and correct formatting. The score on the best critique is used to update the weights so the model gets better at spotting strong answers and explaining why.
3 (Vote)
The model reviews the question, peer answers and peer critiques, and predicts the answer the swarm will prefer. It also revises its own answer accordingly.
Nothing - outputs in this stage remain local.
The model’s revised answer is graded for correctness and proper format, using the same reward function as Stage 1. If the model’s prediction about which peer answer would win matches the actual majority choice, it earns an extra bonus point. Those two numbers (answer score + consensus bonus) are added together and used as the reward for updating the model’s weights.
Throughout this process, RL Swarm teaches the model how to answer accurately, critique peers constructively, and align with the swarm consensus. As the swarm grows in size and diversity, it becomes more effective in training. To understand why, we encourage you to read the following papers: