Troubleshooting

Stuck? Get unblocked with RL Swarm on Windows (WSL 2), Linux, or macOS, or reach out to support for more help.

Overview

This troubleshooting guide provides a complete reference for diagnosing and resolving known issues when installing, running, or maintaining an RL Swarm node across Windows (WSL 2), Linux, and macOS.

Installation and Dependency Issues

This section covers:

  • "Command not found" errors for Python, Docker, and Git

  • Build failures and/or missing libraries

  • Permission issues (denials) when running Docker/Python scripts

Update your Package Manager

Run the following command to update your package manager:

sudo apt update && sudo apt upgrade -y

Install Missing Packages

You may be missing dependencies. To double-check and install any missing packages, run this command:

python3 python3-venv python3-pip curl wget git docker.io build-essential

Verify your Python Version

RL Swarm requires a specific version of Python to be installed in order to run.

Check your Python version by running python3 --version which must return 3.10 or higher. If you have an older version, upgrade via package manager or pyenv.

Configuring Docker

Many Docker-related issues arise from memory allocation constraints or ports which are already in use.

Start the Docker Daemon

Run the following command to start up the Docker Daemon.

sudo systemctl enable docker && sudo systemctl start docker

You may need to enter your password if using sudo privileges.

Test Docker

The command docker run hello-world should print “Hello from Docker!”

If it doesn't, reinstall (Windows [WSL 2] and Linux) or restart Docker Desktop (macOS).

Memory Allocation

Increase container memory by navigating to Docker Desktop > Settings > Resources > Advanced > Memory then set it to the maximum value (at least 16gb recommended).

Docker and Virtualization issues

Sometimes builds will hang, crash, or be unreachable.

This section deals with the inability to connect to the Docker daemon and docker-compose syntax issues:

  • Try the alternate syntax for modern Docker: docker compose (no hyphen). If that fails, fall back to docker-compose.

  • Ensure virtualization is enabled in BIOS / Firmware.

  • WSL 2 users: enable WSL integration inside Docker Desktop Settings > Resources > WSL Integration, then select your distribution.

  • Linux GPU users: verify that your NVIDIA drivers are up-to-date and the CUDA toolkit is installed. The nvidia-smi must show a running driver.

  • macOS users: RL Swarm can only run CPU-only. GPU mode is not supported.

Login and Identity Issues

If you're experiencing issues logging in, this section provides quick fixes for login modal issues, peer identity issues, and more.

Issue
Fix

Browser window never opens for login

  1. Manually open the login URL by typing http://localhost:3000 into your browser.

  2. If you're using a VM/VPS, use the flag -L 3000:localhost:3000 port forwarding flag when connecting.

Login modal fails to load, or OTP not sent to email

  1. Upgrade viem to version 2.25.0 inside modal-login/package.json.

  2. Run cd modal-login && yarn upgrade && yarn add next@latest viem@latest.

Login works, but training fails after re-login

Delete the old peer identity and restart using sudo rm swarm.pem. Then re-run RL Swarm and log in again with the same email.

Lost swarm.pem identity

You must generate a new one using the same email to retain your on-chain account.

Running multiple nodes

Use the same email login for each node. Each node has its own peer ID,but shares the same EOA.

VPS login fallbacks

If port:3000 is blocked, you can use temporary tunnels such as Cloudflare or nGrok if comfortable with networking tools.

Training and Performance Issues

Some commonly experienced training issues are 'false flags' whereas others require some manual input.

Symptom's we've seen:

  • Training appears stuck, or isn't progressing: Consumer-grade CPUs, especially MacBooks, can take more than ~20 minutes per training cycle. Please be patient!

  • "Skipped round" messages: This is normal. It means your machine was slower than the swarm round pace.

  • OOM (Out of Memory) errors: Try closing other applications and increasing the Docker memory allocation as mentioned above.

  • High CPU usage and/or thermal throttling: This is normal if you're training in CPU-only mode. If your device allows for it, try switching to GPU-only mode.

  • "GPU not detected" warnings: Confirm that your drivers are correctly installed and recognized, and that the container is launched using swarm-gpu.

To force CPU-only mode explicity, use the swarm-cpu command.

Network and Connectivity Issues

Docker may need to be configured in your Firewall settings to allow outbound traffic, or you may be in a region where RL Swarm is currently unavailable.

Common connection issues:

  • Node doesn't appear on the dashboard: Check your internet connection and make sure the firewall allows outbound traffic from Docker. Also, visit the Gensyn Dashboard and confirm that your node is visible under RL Swarm.

  • Predication Market bets are not visible: Make sure you answered 'Y' when asked to join the Prediction Market. Rerun the script if necessary.

  • VPS connection drops: If the SSH tunnel breaks and you see “broken pipe” errors, press ctrl/cmd+c to kill the script, then restart RL Swarm, and it should cleanly re-initalize.

Logs and Diagnostics

Browse the table below to find the most useful log types and locations inside the rl-swarm repository.

Location
Type

/logs/yarn.log

Modal login server activity.

/logs/swarm.log

The main application log.

/logs/wandb/

Training logs and debug.log for Weights & Biases (if this is enabled).

/logs/prg_record.txt and swarm_launcher.log

Prediction Market details.

How to Interpret Logs

Many warnings (e.g., Protobuf "yanked version") are benign and can be safely ignored.

When looking at your logs for errors, look for lines containing ERROR, RuntimeError or Traceback to locate actual failure points.

When posting to Discord or Github, please attach the relevant section of swarm.log as well as your system info.

Advanced and Recovery Scenarios

Below are some specific scenarios you may run into when running multiple nodes, or nodes on different machines.

  • Moving to a new machine: Make sure to back up your swarm.pen from the repo's root, then copy it into the same directory on the new machine before launching RL Swarm.

  • Running multiple GPUs or peers: Install RL Swarm separately for each GPU, and exposre each peer under a different port.

  • Clean rebuilds: Stop all containers and processes by using ctrl/cmd+c then run docker system prune -a to remove old containers. Delete .venv and re-clone the repository if necessary.

  • Using tuneling tools in cases where the login port is blocked: This is only for advanced users who are comfortable with network tools. Use the simplest tool that works, since the local login method is recommended for security and reliablity. Tools include Cloudflare, nGrok, or localtunnel.

When to Esclate

If you're experiencing an issue that none of the above steps are able to resolve, we're here to help.

  1. Check the GitHub Issues page to see if your issue has already been reported.

  2. If you open a new issue or ask for help on Discord, please include the operating system and version, CPU and GPU model, amount of RAM, and as much context on the error(s) as possible.

Last updated