Troubleshooting

Common errors, their causes, and how to fix them.

Common Errors

Quick fixes for Docker, CLI, and compilation issues you might hit while using REE.

Error
Cause
Fix

PermissionError: [Errno 13] Permission denied: '/gensyn'

Container runs as non-root gensyn user; can't write to root-owned paths

Use /tmp/ paths for ephemeral runs, or mount a volume with -v

one of the arguments --tasks-root --task-dir is required

Missing required output directory argument

Add --task-dir /tmp/task or --tasks-root /tmp/tasks

one of the arguments --prompt-text --prompt-file is required

Missing prompt input

Add --prompt-text "your prompt" or --prompt-file path.jsonl

argument command: invalid choice: 'bash'

Trying to launch a shell but entrypoint is locked to gensyn-sdk

Use --entrypoint bash to override: docker run -it --entrypoint bash ree

Gibberish / nonsensical output

Using hf-internal-testing/tiny-random-LlamaForCausalLM which has random untrained weights

Expected behavior for test models; use a real model for meaningful output

Shell hangs after pasting command

Trailing \\ on the last line of a command

Remove the backslash from the final line

Compiler Trace Warnings

When running REE, you may see trace warnings and verbose compiler output in your terminal. These are expected and can be safely ignored. They originate from the ONNX export and MLIR compilation stages.

--tasks-root vs. --task-dir

If you see errors about missing artifacts, make sure you're using consistent location flags. When using --tasks-root, the task directory is automatically derived from the model name. When using --task-dir, you must point to the same directory across all operations.

CUDA Not Available

If you're running on a machine with a GPU but REE doesn't detect it, ensure you're passing the --gpus all flag to Docker:

docker run --gpus all -v ~/.cache/gensyn:/gensyn ree run \\
  --tasks-root /gensyn/tasks \\
  --model-name <model> \\
  --prompt-text "..." \\
  --operation-set reproducible
circle-info

Use --cpu-only to explicitly force CPU execution when GPU is not available or not desired.

Out-of-Memory (OOM) Issues

If you are using Docker Desktop, you may need to adjust the memory limit.arrow-up-right Otherwise, you may attempt to run larger models (models with a higher parameter count) and encounter a failure during the model loading or checkpoint 'sharding' phase.

This typically shows up as a run:failed status with exit code 137, and the logs will show the process dying partway through "Loading checkpoint shards."

NaN Errors & Crashes with Certain Models

Some FP16 models, particularly certain Qwen 2.5 Instruct variants, may produce NaN (Not a Number) errors and crash when run in default or deterministic mode. This is a numerical stability issue: attention score calculations can overflow the FP16 value range during inference.

If you encounter this, try switching to reproducible mode (--operation-set reproducible), which handles these edge cases more gracefully. Note that even in reproducible mode, some affected models may still produce degraded output quality (repetitive text or unexpected tokens).

circle-info

This is a known limitation related to the ONNX export pipeline's use of FP16 precision and is being actively addressed in future releases.