# Internals

### Gensyn Compiler

The Gensyn Compiler converts ONNX-serialized ML models into PyTorch modules, optionally with reproducible RepOp kernels replacing standard operations.&#x20;

It is an MLIR-based, multi-stage compiler with a Python execution layer.

#### How It Works

The compiler uses MLIR dialects to reason about the incoming model:

1. A dialect that determines which operations need to be lowered to RepOp kernels (rather than standard PyTorch kernels).
2. A dialect that generates the final PyTorch module from a given set of operations.

#### Python API

```python
from gensyn_mjolnir import convert, CompileOptions

# Convert an ONNX model to a PyTorch module with reproducible kernels
module = convert(
    "model/model.onnx",
    options=CompileOptions(requires_reproducibility=True)
)
```

#### Convert()

`convert()` is the core function of the Gensyn Compiler. It takes an ONNX model and converts it into a PyTorch module that can be used for inference.&#x20;

When `requires_reproducibility` is enabled (which it is by default), the compiler replaces standard PyTorch operations with RepOp kernels that guarantee bitwise-identical results across hardware.

It has two parameters:&#x20;

* `onnx_model_or_path`: A `str`, `Path` or `ModelProto` type. This is either a file path to an ONNX model on a disk or an in-memory `ModelProto` object.
* `options` (`CompileOptions`) is the configuration for the compilation process, which you can read more about below. It defaults to reproducible mode.

```python
def convert(
    onnx_model_or_path: str | Path | ModelProto,
    options: CompileOptions = CompileOptions(),
) -> torch.nn.Module
```

#### CompileOptions

`CompileOptions` controls how the compiler processes the model. In most cases the defaults are what you want, which corresponds to reproducible mode with symlinked tensors and a temporary artifacts directory.

The fields are:

* `artifacts_dir`: always `str` or `None`. This is the directory where the compiler writes intermediate artifacts. If not set, a temporary directory is used (preserved when the `MJOLNIR_DEBUG` environment variable is set, which is useful for inspecting compiler output during debugging).
* `colocate_tensors`: A `bool` that is `False` by default. When set to `True` it copies external tensor files into the artifacts directory. When `False`, it creates symbolic links instead.&#x20;

{% hint style="info" %}
Symlinking is faster and saves disk space, but copying may be needed if you plan to move the artifacts directory to another location.
{% endhint %}

* `requires_reproducibility`: Also a `bool` but set to `True` by default. When `True`, the compiler replaces standard PyTorch operations with RepOp kernels for cross-hardware reproducibility. When `False`, it uses standard PyTorch kernels which are faster but not reproducible across different hardware.

```python
@dataclass(frozen=True, slots=True, kw_only=True)
class CompileOptions:
    artifacts_dir: str | None = None
    # Directory for compiler artifacts. Uses a temp directory if not specified
    # (preserved when MJOLNIR_DEBUG env var is set).

    colocate_tensors: bool = False
    # When True, copies external tensor files into the artifacts directory.
    # When False, creates symbolic links instead.

    requires_reproducibility: bool = True
    # When True, compiles with RepOp kernels for cross-hardware reproducibility.
    # When False, uses standard PyTorch kernels.
```

### RepOps

RepOps (Reproducible Operators) are purpose-built GPU kernels that guarantee bitwise-identical outputs regardless of hardware architecture. They cover the full set of operators needed for neural network inference and training.

For a standalone demo of RepOp kernels, see the [RepOps Demo repository](https://github.com/gensyn-ai/repops-demo).

#### How RepOps Achieve Cross-Hardware Reproducibility

* **Fixed reduction ordering:** Every kernel accumulates values in a single canonical order. The reduction tile size is fixed across all GPU architectures. All accumulation is in FP32 using fused multiply-add instructions.
* **Correctly rounded transcendentals:** Custom implementations of `exp`, `sin`, `tanh`, etc. that produce identical results on every CUDA-capable GPU.
* **Extended-precision arithmetic:** Operations like the error function (used in GELU) use extended-precision fixed-point arithmetic for cross-hardware consistency.
* **Architecture-adaptive output tiling:** Kernels adapt output tile dimensions to different GPU architectures (using available shared memory), but never change the reduction dimension, so reproducibility is preserved.

### Container Details

| Property     | Value                     |
| ------------ | ------------------------- |
| Base OS      | Ubuntu 24.04.1 LTS        |
| Python       | 3.11.14                   |
| PyTorch      | 2.9.1                     |
| Transformers | 4.51.0                    |
| ONNX         | 1.16.1                    |
| SDK Version  | gensyn-sdk 0.1.0          |
| Entrypoint   | `/runtime/bin/gensyn-sdk` |
| User         | `gensyn` (non-root)       |
| Working Dir  | `/home/gensyn`            |

### Interactive Mode

To explore REE's components directly (SDK, Compiler), start the container in interactive mode:

```bash
docker run -it --entrypoint bash -v ~/.cache/gensyn:/gensyn ree
```

From inside the container, you can run `gensyn-sdk` commands directly and inspect intermediate artifacts.
