Internals

How the Gensyn Compiler and RepOp kernels achieve bitwise reproducibility under the hood.

Gensyn Compiler

The Gensyn Compiler converts ONNX-serialized ML models into PyTorch modules, optionally with reproducible RepOp kernels replacing standard operations.

It is an MLIR-based, multi-stage compiler with a Python execution layer.

How It Works

The compiler uses MLIR dialects to reason about the incoming model:

  1. A dialect that determines which operations need to be lowered to RepOp kernels (rather than standard PyTorch kernels).

  2. A dialect that generates the final PyTorch module from a given set of operations.

Python API

from gensyn_mjolnir import convert, CompileOptions

# Convert an ONNX model to a PyTorch module with reproducible kernels
module = convert(
    "model/model.onnx",
    options=CompileOptions(requires_reproducibility=True)
)

Convert()

convert() is the core function of the Gensyn Compiler. It takes an ONNX model and converts it into a PyTorch module that can be used for inference.

When requires_reproducibility is enabled (which it is by default), the compiler replaces standard PyTorch operations with RepOp kernels that guarantee bitwise-identical results across hardware.

It has two parameters:

  • onnx_model_or_path: A str, Path or ModelProto type. This is either a file path to an ONNX model on a disk or an in-memory ModelProto object.

  • options (CompileOptions) is the configuration for the compilation process, which you can read more about below. It defaults to reproducible mode.

CompileOptions

CompileOptions controls how the compiler processes the model. In most cases the defaults are what you want, which corresponds to reproducible mode with symlinked tensors and a temporary artifacts directory.

The fields are:

  • artifacts_dir: always str or None. This is the directory where the compiler writes intermediate artifacts. If not set, a temporary directory is used (preserved when the MJOLNIR_DEBUG environment variable is set, which is useful for inspecting compiler output during debugging).

  • colocate_tensors: A bool that is False by default. When set to True it copies external tensor files into the artifacts directory. When False, it creates symbolic links instead.

circle-info

Symlinking is faster and saves disk space, but copying may be needed if you plan to move the artifacts directory to another location.

  • requires_reproducibility: Also a bool but set to True by default. When True, the compiler replaces standard PyTorch operations with RepOp kernels for cross-hardware reproducibility. When False, it uses standard PyTorch kernels which are faster but not reproducible across different hardware.

RepOps

RepOps (Reproducible Operators) are purpose-built GPU kernels that guarantee bitwise-identical outputs regardless of hardware architecture. They cover the full set of operators needed for neural network inference and training.

For a standalone demo of RepOp kernels, see the RepOps Demo repositoryarrow-up-right.

How RepOps Achieve Cross-Hardware Reproducibility

  • Fixed reduction ordering: Every kernel accumulates values in a single canonical order. The reduction tile size is fixed across all GPU architectures. All accumulation is in FP32 using fused multiply-add instructions.

  • Correctly rounded transcendentals: Custom implementations of exp, sin, tanh, etc. that produce identical results on every CUDA-capable GPU.

  • Extended-precision arithmetic: Operations like the error function (used in GELU) use extended-precision fixed-point arithmetic for cross-hardware consistency.

  • Architecture-adaptive output tiling: Kernels adapt output tile dimensions to different GPU architectures (using available shared memory), but never change the reduction dimension, so reproducibility is preserved.

Container Details

Property
Value

Base OS

Ubuntu 24.04.1 LTS

Python

3.11.14

PyTorch

2.9.1

Transformers

4.51.0

ONNX

1.16.1

SDK Version

gensyn-sdk 0.1.0

Entrypoint

/runtime/bin/gensyn-sdk

User

gensyn (non-root)

Working Dir

/home/gensyn

Interactive Mode

To explore REE's components directly (SDK, Compiler), start the container in interactive mode:

From inside the container, you can run gensyn-sdk commands directly and inspect intermediate artifacts.