# Troubleshooting

## Common Errors

Quick fixes for Docker, CLI, and compilation issues you might hit while using REE.

| Error                                                          | Cause                                                                                                             | Fix                                                                                                        |
| -------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------- |
| `PermissionError: [Errno 13] Permission denied: '/gensyn'`     | Container runs as non-root `gensyn` user; can't write to root-owned paths                                         | Use `/tmp/` paths for ephemeral runs, or mount a volume with `-v`                                          |
| `one of the arguments --tasks-root --task-dir is required`     | Missing required output directory argument                                                                        | Add `--task-dir /tmp/task` or `--tasks-root /tmp/tasks`                                                    |
| `one of the arguments --prompt-text --prompt-file is required` | Missing prompt input                                                                                              | Add `--prompt-text "your prompt"` or `--prompt-file path.jsonl`                                            |
| `argument command: invalid choice: 'bash'`                     | Trying to launch a shell but entrypoint is locked to `gensyn-sdk`                                                 | Use `--entrypoint bash` to override: `docker run -it --entrypoint bash ree`                                |
| Gibberish / nonsensical output                                 | Using `hf-internal-testing/tiny-random-LlamaForCausalLM` which has random untrained weights                       | Expected behavior for test models; use a real model for meaningful output                                  |
| Shell hangs after pasting command                              | Trailing `\\` on the last line of a command                                                                       | Remove the backslash from the final line                                                                   |
| `--n-partitions` run fails on a single-GPU machine             | Pipeline parallelism requires multiple GPUs; a partition count above `1` can't be satisfied with only one device. | Omit `--n-partitions` or set it to `1`. Single-GPU hosts can only run models that fit in one GPU's memory. |

#### Compiler Trace Warnings

When running REE, you may see trace warnings and verbose compiler output in your terminal. These are expected and can be safely ignored. They originate from the ONNX export and MLIR compilation stages.

#### `--tasks-root` vs. `--task-dir`

If you see errors about missing artifacts, make sure you're using consistent location flags. When using `--tasks-root`, the task directory is automatically derived from the model name. When using `--task-dir`, you must point to the same directory across all operations.

#### CUDA Not Available

If you're running on a machine with a GPU but REE doesn't detect it, ensure you're passing the `--gpus all` flag to Docker:

```bash
docker run --gpus all -v ~/.cache/gensyn:/gensyn ree run \\
  --tasks-root /gensyn/tasks \\
  --model-name <model> \\
  --prompt-text "..." \\
  --operation-set reproducible
```

{% hint style="info" %}
Use `--cpu-only` to explicitly force CPU execution when GPU is not available or not desired.
{% endhint %}

#### Out-of-Memory (OOM) Issues

If you are using Docker Desktop, you may need to [adjust the memory limit.](https://docs.docker.com/desktop/settings-and-maintenance/settings/#advanced) Otherwise, you may attempt to run larger models (models with a higher parameter count) and encounter a failure during the model loading or checkpoint 'sharding' phase.&#x20;

This typically shows up as a `run:failed` status with exit code 137, and the logs will show the process dying partway through *"Loading checkpoint shards."*&#x20;

### NaN Errors & Crashes with Certain Models

Some FP16 models, particularly certain Qwen 2.5 Instruct variants, may produce NaN (Not a Number) errors and crash when run in `default` or `deterministic` mode. This is a numerical stability issue: attention score calculations can overflow the FP16 value range during inference.

If you encounter this, try switching to `reproducible` mode (`--operation-set reproducible`), which handles these edge cases more gracefully. Note that even in `reproducible` mode, some affected models *may still produce degraded output quality* (repetitive text or unexpected tokens).

{% hint style="info" %}
This is a known limitation related to the ONNX export pipeline's use of FP16 precision and is being actively addressed in future releases.
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.gensyn.ai/tech/ree/troubleshooting.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.