# Internals

### Gensyn Compiler

The Gensyn Compiler converts ONNX-serialized ML models into PyTorch modules, optionally with reproducible RepOp kernels replacing standard operations.&#x20;

It is an MLIR-based, multi-stage compiler with a Python execution layer.

#### How It Works

The compiler uses MLIR dialects to reason about the incoming model:

1. A dialect that determines which operations need to be lowered to RepOp kernels (rather than standard PyTorch kernels).
2. A dialect that generates the final PyTorch module from a given set of operations.

#### Python API

```python
from gensyn_mjolnir import convert, CompileOptions

# Convert an ONNX model to a PyTorch module with reproducible kernels
module = convert(
    "model/model.onnx",
    options=CompileOptions(requires_reproducibility=True)
)
```

#### Convert()

`convert()` is the core function of the Gensyn Compiler. It takes an ONNX model and converts it into a PyTorch module that can be used for inference.&#x20;

When `requires_reproducibility` is enabled (which it is by default), the compiler replaces standard PyTorch operations with RepOp kernels that guarantee bitwise-identical results across hardware.

It has two parameters:&#x20;

* `onnx_model_or_path`: A `str`, `Path` or `ModelProto` type. This is either a file path to an ONNX model on a disk or an in-memory `ModelProto` object.
* `options` (`CompileOptions`) is the configuration for the compilation process, which you can read more about below. It defaults to reproducible mode.

```python
def convert(
    onnx_model_or_path: str | Path | ModelProto,
    options: CompileOptions = CompileOptions(),
) -> torch.nn.Module
```

#### CompileOptions

`CompileOptions` controls how the compiler processes the model. In most cases the defaults are what you want, which corresponds to reproducible mode with symlinked tensors and a temporary artifacts directory.

The fields are:

* `artifacts_dir`: always `str` or `None`. This is the directory where the compiler writes intermediate artifacts. If not set, a temporary directory is used (preserved when the `MJOLNIR_DEBUG` environment variable is set, which is useful for inspecting compiler output during debugging).
* `colocate_tensors`: A `bool` that is `False` by default. When set to `True` it copies external tensor files into the artifacts directory. When `False`, it creates symbolic links instead.&#x20;

{% hint style="info" %}
Symlinking is faster and saves disk space, but copying may be needed if you plan to move the artifacts directory to another location.
{% endhint %}

* `requires_reproducibility`: Also a `bool` but set to `True` by default. When `True`, the compiler replaces standard PyTorch operations with RepOp kernels for cross-hardware reproducibility. When `False`, it uses standard PyTorch kernels which are faster but not reproducible across different hardware.

```python
@dataclass(frozen=True, slots=True, kw_only=True)
class CompileOptions:
    artifacts_dir: str | None = None
    # Directory for compiler artifacts. Uses a temp directory if not specified
    # (preserved when MJOLNIR_DEBUG env var is set).

    colocate_tensors: bool = False
    # When True, copies external tensor files into the artifacts directory.
    # When False, creates symbolic links instead.

    requires_reproducibility: bool = True
    # When True, compiles with RepOp kernels for cross-hardware reproducibility.
    # When False, uses standard PyTorch kernels.
```

### RepOps

RepOps (Reproducible Operators) are purpose-built GPU kernels that guarantee bitwise-identical outputs regardless of hardware architecture. They cover the full set of operators needed for neural network inference and training.

For a standalone demo of RepOp kernels, see the [RepOps Demo repository](https://github.com/gensyn-ai/repops-demo).

#### How RepOps Achieve Cross-Hardware Reproducibility

* **Fixed reduction ordering:** Every kernel accumulates values in a single canonical order. The reduction tile size is fixed across all GPU architectures. All accumulation is in FP32 using fused multiply-add instructions.
* **Correctly rounded transcendentals:** Custom implementations of `exp`, `sin`, `tanh`, etc. that produce identical results on every CUDA-capable GPU.
* **Extended-precision arithmetic:** Operations like the error function (used in GELU) use extended-precision fixed-point arithmetic for cross-hardware consistency.
* **Architecture-adaptive output tiling:** Kernels adapt output tile dimensions to different GPU architectures (using available shared memory), but never change the reduction dimension, so reproducibility is preserved.

### Pipeline Parallelism

Large models often exceed the memory of a single GPU. Pipeline parallelism splits the model's layers into sequential *partitions* (also called stages), each placed on a different GPU.&#x20;

{% hint style="danger" %}
Pipeline Parallelism is only possible on multi-GPU devices. Using the flag to enable this mechanism on a single-GPU host will result in a failed run.
{% endhint %}

A forward pass walks the input through partition 1, then passes its activations to partition 2, and so on, with each GPU holding only its own slice of the weights. This trades a single large memory footprint for several smaller ones, making it possible to run models that would otherwise be impossible to load.

{% hint style="info" %}
Pipeline parallelism is orthogonal to the operation set. You can combine `--n-partitions` with `default`, `deterministic`, or `reproducible` mode. Only `reproducible` mode guarantees bitwise identity across different hardware.
{% endhint %}

REE exposes this through the `--n-partitions` flag, which controls how many partitions the model is divided into. On a host with enough aggregate GPU memory, this lets REE run models up to 72B parameters while preserving the reproducibility guarantees provided by [RepOps](#repops).&#x20;

Partition boundaries are deterministic for a given model and partition count, so splitting a model does not introduce new sources of numerical drift: the same `--n-partitions` value on different supported hardware produces bitwise-identical output, and cross-partition-count runs of the same model also match when using `--operation-set reproducible`.

### Container Details

| Property     | Value                     |
| ------------ | ------------------------- |
| Base OS      | Ubuntu 24.04.1 LTS        |
| Python       | 3.11.14                   |
| PyTorch      | 2.9.1                     |
| Transformers | 4.51.0                    |
| ONNX         | 1.16.1                    |
| SDK Version  | gensyn-sdk 0.1.0          |
| Entrypoint   | `/runtime/bin/gensyn-sdk` |
| User         | `gensyn` (non-root)       |
| Working Dir  | `/home/gensyn`            |

### Interactive Mode

To explore REE's components directly (SDK, Compiler), start the container in interactive mode:

```bash
docker run -it --entrypoint bash -v ~/.cache/gensyn:/gensyn ree
```

From inside the container, you can run `gensyn-sdk` commands directly and inspect intermediate artifacts.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.gensyn.ai/tech/ree/advanced-usage/internals.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
