# Advanced Usage & CLI Reference

## Advanced Usage & CLI Reference

This section covers the internals of REE for power users, integrators, and anyone who wants to understand what happens under the hood or drive REE directly from the command line.

### CLI Reference

While the TUI is the recommended interface, REE can also be driven directly via the `gensyn-sdk` CLI inside the container, or via the `ree.sh` shell script included in the repository.&#x20;

This is useful for scripting, CI pipelines, or when you need fine-grained control over the pipeline.

#### Global Flags

`--verbose` is a global flag and must appear **before** the subcommand:

```bash
gensyn-sdk --verbose run ...
```

| Flag        | Description                 |
| ----------- | --------------------------- |
| `--verbose` | Enable debug-level logging. |

#### Location Flags

Every command requires exactly one of the following (they are mutually exclusive):

| Flag                  | Description                                                                                                                            |
| --------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
| `--tasks-root <path>` | Root directory for tasks. The task directory is derived as `<tasks-root>/<sanitized-model-name>`. **This is the recommended default.** |
| `--task-dir <path>`   | Use this exact directory for inputs and outputs. Use when you want to pin artifacts to a specific path.                                |

{% hint style="info" %}
When using `--tasks-root`, the SDK auto-creates a subdirectory named after the model using path-safe characters. For example, `--tasks-root /tmp/tasks` with model `Qwen/Qwen3-0.6B` creates `/tmp/tasks/Qwen--Qwen3-0.6B/`.
{% endhint %}

#### Run (Subcommand)

Runs the full pipeline: **\[1]** prepare, **\[2]** generate, **\[3]** receipt and **\[4]** decode.

```bash
gensyn-sdk run \\
  --tasks-root <path> \\
  --model-name <huggingface-model-id> \\
  --prompt-text "Your prompt here" \\
  --operation-set {default,deterministic,reproducible}
```

1. **Required Flags:**

| Flag                               | Description                                             |
| ---------------------------------- | ------------------------------------------------------- |
| `--model-name` / `-m`              | Hugging Face model ID (e.g., `Qwen/Qwen3-0.6B`).        |
| `--prompt-text` or `--prompt-file` | The prompt to run. Mutually exclusive; one is required. |

{% hint style="info" %}
`--operation-set` is not a required flag. Instead, it defaults to `reproducible` mode. You would only use this flag if you wanted to switch to `deterministic` mode.
{% endhint %}

2. **Optional Flags:**

| Flag                     | Default | Description                                                                                                                                                               |
| ------------------------ | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `--model-revision`       | `main`  | Specific Hugging Face model revision.                                                                                                                                     |
| `--max-new-tokens`       | `300`   | Maximum number of tokens to generate.                                                                                                                                     |
| `--cpu-only`             | `false` | Force CPU execution even if CUDA is available.                                                                                                                            |
| `--force-model-export`   | `false` | Re-export ONNX model even if one exists.                                                                                                                                  |
| `--disable-kv-cache`     | `false` | Disable KV cache.                                                                                                                                                         |
| `--short-circuit-length` | —       | Generation index at which to inject the short-circuit token.                                                                                                              |
| `--short-circuit-token`  | —       | Token ID to inject when short-circuiting.                                                                                                                                 |
| `--n-partitions`         | 1       | Number of pipeline partitions to split the model across. Enables running larger models that don't fit on a single GPU. See [Pipeline Parallelism.](#pipeline-parallelism) |

#### Validate (Subcommand)

Checks that a receipt is structurally valid by recomputing hashes and comparing them against the stored values. This does *not* re-run inference.

```bash
gensyn-sdk validate --receipt-path <path-to-receipt.json>
```

1. **Required flags:**

| Flag             | Description                                |
| ---------------- | ------------------------------------------ |
| `--receipt-path` | Path to the receipt JSON file to validate. |

{% hint style="info" %}
There aren't any location flags (`--tasks-root` / `--task-dir`) needed because `validate` only inspects the receipt file itself.
{% endhint %}

#### Verify (Subcommand)

Re-runs the full inference pipeline described in a receipt and compares the output against the receipt's claimed results. This is the strongest form of verification, as it is what proves the result is reproducible on your hardware.

```bash
gensyn-sdk verify \
  --receipt-path <path-to-receipt.json> \
  --tasks-root <path> \
  --cpu-only
```

1. **Required Flags:**

| Flag                           | Description                              |
| ------------------------------ | ---------------------------------------- |
| `--receipt-path`               | Path to the receipt JSON file to verify. |
| `--tasks-root` or `--task-dir` | Where to store re-execution artifacts.   |

2. **Optional Flags:**

| Flag         | Default | Description                              |
| ------------ | ------- | ---------------------------------------- |
| `--cpu-only` | `false` | Force CPU execution during verification. |

{% hint style="info" %}
`verify` needs both a receipt path (what to verify) and a location (where to put the re-run workspace). When using the TUI, `--tasks-root` is passed automatically.
{% endhint %}

#### Sampling Flags

These flags control the sampling behavior during generation.&#x20;

In the **TUI**, they are passed via Extra Args. On the **CLI**, they are passed directly.

| Flag                             | Default  | Description                                 |
| -------------------------------- | -------- | ------------------------------------------- |
| `--do-sample` / `--no-do-sample` | Enabled  | Enable/disable stochastic sampling.         |
| `--temperature`                  | `1.0`    | Sampling temperature. Higher = more random. |
| `--top-k`                        | `50`     | Top-k sampling cutoff.                      |
| `--top-p`                        | `1.0`    | Nucleus sampling threshold.                 |
| `--min-p`                        | Disabled | Min-p sampling threshold.                   |
| `--repetition-penalty`           | `1.0`    | Repetition penalty multiplier.              |

#### Prompt Format (JSONL)

For CLI usage, prompts can be provided via `--prompt-file` using JSONL format. Each line must be either:

* A JSON string: `"What is 2 + 2?"`
* A JSON object with a `prompt` field: `{"prompt": "Explain deterministic inference in one sentence."}`

Here's an example `prompts.jsonl` file:

```json
"What is 2 + 2?"
{"prompt": "Explain deterministic inference in one sentence."}
```

### The Pipeline

Under the hood, REE's `run` command (whether triggered from the TUI or CLI) executes a four-stage pipeline: **\[1]** prepare, **\[2]** generate, **\[3]** receipt, and **\[4]** decode.

#### 1. Prepare

Downloads the model from Hugging Face, exports it to ONNX format, tokenizes the prompt, and writes a task configuration file. All artifacts are written to the task directory.

**Artifacts produced:**

* `model/model.onnx`: The exported ONNX model
* `model/tensors.binary`: Serialized model weights
* `config.json`: Task configuration (sampling settings, token limits, etc.)
* `prompt_tokens.parquet`: Tokenized prompt
* `metadata/prepare.json`: Prepare-stage metadata (model name, commit hash, config hash)

If `model/model.onnx` already exists in the task directory, prepare skips re-export and reuses it. Use `--force-model-export` to override this.

#### 2. Generate

Loads the prepared ONNX model, compiles it through the Gensyn Compiler (applying RepOp kernels if `--operation-set reproducible`), and runs the inference loop.

**Artifacts produced:**

* `output_tokens.parquet`: Generated token IDs
* `metadata/generate.json`: Generate-stage metadata (finish reasons, device info, operation set, seed)
* `compiled-artifacts-*`: Compiler output directories

#### 3. Receipt

Assembles a cryptographically hashed receipt from the prepare and generate metadata, config, and output tokens.

**Artifacts produced:**

* `metadata/receipt_<timestamp>.json`: Hashed receipt for full replication and verification.

#### 4. Decode

Reads `output_tokens.parquet`, decodes the token IDs back into text using the model's tokenizer, and prints the result.

### Pipeline Parallelism

Models larger than a single GPU's memory can be split across multiple partitions using the `--n-partitions` flag. This unlocks models up to 72B parameters on suitable multi-GPU hosts while preserving bitwise-reproducible output.

{% hint style="warning" %}
`--n-partitions` requires multiple GPUs. On a single-GPU host, setting `--n-partitions` above `1` is expected to fail. You can leave it at its default, or omit the flag entirely.&#x20;

The largest model you can run on a single GPU is bounded by that GPU's memory regardless of anything else. For a conceptual overview of how partitioning works, see [Pipeline Parallelism](https://docs.gensyn.ai/tech/ree/advanced-usage/internals#pipeline-parallelism).
{% endhint %}

`--n-partitions` is compatible with all three operation sets (`default`, `deterministic`, `reproducible`). Reproducibility guarantees are preserved across partition counts: the same model and prompt will produce bitwise-identical output whether run with `--n-partitions 1` or `--n-partitions 8`.

In the TUI, pass `--n-partitions <N>` via the *Extra Args* field. An example of this flag sequence (with parallelism enabled) would look like:

​`gensyn-sdk run \ --tasks-root /tmp/tasks \ --model-name Qwen/Qwen2.5-72B-Instruct \ --prompt-text "Explain pipeline parallelism." \ --n-partitions 4 ​`

### Persisting Data & Caching

The REE container mounts your host's `~/.cache` directory into the container automatically. This persists both Hugging Face model downloads (`~/.cache/huggingface`) and SDK artifacts like ONNX exports, compiled models, and receipts (`~/.cache/gensyn`).

This means subsequent runs of the same model will skip the download and export steps automatically. No additional volume mounts or configuration are needed.

{% hint style="info" %}
When using the TUI, this caching is handled for you. The details above apply if you're running the container directly via the CLI.
{% endhint %}

## EULA

Use of REE and its components (Gensyn SDK, Gensyn Compiler, RepOp kernels) is subject to the Gensyn End User License Agreement. [Please review the EULA before use.](https://github.com/gensyn-ai/ree/blob/main/REE-Binary-License)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.gensyn.ai/tech/ree/advanced-usage.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.