Simulator Architecture
Celox is an engine that generates JIT-compiled native code from Veryl RTL and executes cycle-based simulation.
Design Philosophy and Target
This simulator is designed with the goal of maximizing verification efficiency for modern synchronous circuit designs (RTL).
- RTL-focused: Physical timing reproduction that trades off against simulation speed -- such as gate-level delays (# delays) and detailed delta-cycle behavior -- is intentionally simplified by restricting the design scope to RTL-level logic verification.
- Performance-first: Rather than interpreter-style emulation, the simulator compiles from SIR (Simulator IR) to native machine code to achieve execution throughput close to hand-written C.
- Consistency as a design goal: Mechanisms such as "multi-phase evaluation" and "cascade clock detection" have been designed and implemented to guarantee consistency for challenges encountered in real RTL designs, such as multi-clock domains and zero-delay clock trees. However, there are currently race condition limitations under certain conditions.
Compilation Pipeline
The transformation from Veryl source code to execution consists of the following three major phases.
Frontend (Parser/Analyzer):
- Parses Veryl source and generates the analyzer IR.
parser::parse_irtakes this as input and converts each module into aSimModule(a struct containing SLT (logic expressions) and SIR (instruction sequences)).
Middle-end (Flattening/Scheduling/Optimization):
- Flattening: Flattens the instance hierarchy and converts module-local
VarIds into globalAbsoluteAddrs. Port connections are converted intoLogicPaths. - Atomization: Splits
LogicPaths at bit boundaries (atoms) to analyze dependencies at bit-level precision. - Scheduling: Topologically sorts the split atoms to determine the execution order of combinational logic. Detects SCCs via Tarjan's algorithm and handles cycles with static unrolling or dynamic convergence loops.
- SIR Optimization: Applies per-pass optimization (store-load forwarding, commit sinking, dead store elimination, instruction scheduling, etc.) controlled by
OptimizeOptions.
- Flattening: Flattens the instance hierarchy and converts module-local
Backend (Code Generation):
- Memory Layout: Determines memory offsets for all variables and places them on a single memory buffer with Stable, Working, Triggered-bits, and Scratch regions. Layout is pre-computed in
Programafter optimization, before backend codegen, so all backends share the same layout. - Code Generation: Compiles SIR into executable machine code via one of the available backends.
- Runtime: Manages compiled function pointers as event handles and executes the simulation.
- Testbench VM (optional): A stack-based bytecode VM that executes Veryl
initialblocks and testbench functions. Opcodes includeConstU64,ConstWide,LoadU64,LoadWide,BinOp,UnaryOp,Ternary,LoadIndexed,LoadBitSelect,StoreU64, supporting both narrow (≤64-bit) and wide signals.
- Memory Layout: Determines memory offsets for all variables and places them on a single memory buffer with Stable, Working, Triggered-bits, and Scratch regions. Layout is pre-computed in
Backends
Celox supports multiple compilation backends, selected at build time based on the target architecture.
Native x86-64 Backend (Default)
The self-hosted native backend is the default on x86-64 platforms. It compiles SIR through a dedicated pipeline:
SIR (bit-level)
→ ISel (Instruction Selection)
→ MIR (word-level SSA with VRegs)
→ mir_opt (MIR optimization passes)
→ regalloc (Braun & Hack MIN algorithm)
→ emit (x86-64 machine code via iced-x86)Key features of the native backend:
- MIR: A word-level SSA IR with virtual registers (
VReg). Instructions operate on 64-bit values; bit-level access information is preserved inSpillDescside-tables for cost-aware spill decisions. - MIR Optimization: Constant folding, copy propagation, algebraic simplification, GVN, DCE, if-conversion (Branch → Select/cmov), CFG simplification, PEXT fusion for XOR chains, and more. An adaptive pipeline runs the full pass set iteratively for high-pressure functions (VRegs > 40) and a lightweight variant for small functions.
- Register Allocator: A unified single-pass allocator based on the Braun & Hack (2009) extended MIN algorithm. Performs simultaneous spilling and assignment in one forward pass with cost-aware eviction. Supports three spill kinds: SimState (reload from simulation memory), Stack, and Remat (rematerialize immediates).
- EU Merge: Multiple execution units are merged into a single function with shared prologue/epilogue and
jmp-linked boundaries, reducing call overhead. - Cmp+Branch Fusion: When a comparison result only feeds a branch, the
setcc+movzx+testsequence is replaced by a directcmp+jcc.
Cranelift Backend (Fallback)
The Cranelift-based JIT backend (JitBackend) remains available for non-x86-64 targets and as a fallback. It compiles SIR directly to native code via Cranelift. Cranelift-specific options (CraneliftOptLevel, RegallocAlgorithm, enable_alias_analysis, enable_verifier) are configured through CraneliftOptions.
WASM Backend
A WebAssembly backend (wasm_codegen) generates WASM bytecode for browser-based simulation (Playground). It compiles SIR to WASM and executes via wasmtime.
Backend Trait
All backends implement the SimBackend trait, which provides a unified interface for:
- Combinational evaluation (
eval_comb) - Merged FF + comb evaluation (
eval_apply_ff_and_comb) — single-call fast path when only one domain fires - Single-phase FF evaluation (
eval_apply_ff_at) - Split-phase FF evaluation (
eval_only_ff_at,apply_ff_at) — for cascade clock consistency - Signal access (
resolve_signal,resolve_event) - Get/set operations (
get,set,set_wide,get_four_state,set_four_state) - Triggered-bits management (
clear_triggered_bits,mark_triggered_bit,get_triggered_bits)
Memory Model
The simulator employs a multi-region model on a single memory buffer.
- Stable region: Holds the current committed values. Combinational logic inputs and outputs reference this region.
- Working region: Temporarily holds the next state of flip-flops. Only variables that are actually written have Working region slots allocated.
- Triggered-bits region: One bit per event, used for cascade/gated clock trigger detection. After a
Storeinstruction, the backend compares old and new values and sets the corresponding trigger bit if changed. - Scratch region: Used by the tail-call splitting pass for inter-chunk register value spilling.
- SignalRef: A handle that caches offsets and metadata, enabling direct memory access without going through a
HashMap. - Address Aliases: The
IdentityStoreBypassoptimization detects variables that are identity copies (Store→Load roundtrips) and registers them as aliases inProgram::address_aliases. Aliased variables share physical memory, eliminating redundant copies.
For 4-state variables, each variable occupies 2 × ceil(width/8) bytes (value + mask pair).
Execution Control Logic
Simulation::step advances the simulation time by one step using the following flow.
- Event extraction: Retrieves all events occurring at the current time (such as clock changes) from the scheduler.
- Clock edge detection:
- Previous values are retained in a
BitSetand compared with the updated values to determineposedge/negedge. - Based on
DomainKind, checks whether the target flip-flop groups have been triggered.
- Previous values are retained in a
- Silent edge skipping: When a signal value has changed but the flip-flop trigger condition is not met (e.g., a falling edge when a rising edge is specified), unnecessary flip-flop evaluation is skipped.
- Multi-phase evaluation:
- When multiple domains are triggered simultaneously, to maintain consistency as an event-driven model, next-state computation via
eval_onlyis first performed across all domains. Then, after all computations are complete, values are written to the Stable region all at once viaapply. This avoids value inconsistencies between simultaneously occurring events.
- When multiple domains are triggered simultaneously, to maintain consistency as an event-driven model, next-state computation via
- Cascade clock detection:
- To handle cases where a flip-flop output serves as the clock for another flip-flop (zero-delay clock tree), clock signal changes are re-scanned after domain evaluation, and evaluation is repeated until the state stabilizes.
Related Components
SimBackend: Trait abstracting over compilation backends.NativeBackend(x86-64),JitBackend(Cranelift), andWasmBackend(wasmtime) implement this trait.Scheduler: Manages events using aBinaryHeapand dispatches them in chronological order with deterministic ordering (time → event ID → signal).VcdWriter: Records signal changes during simulation in VCD format.MemoryLayout: Pre-computed offset map shared by all backends. Contains stable/working region offsets, variable widths, 4-state flags, triggered-bits region, and scratch region for inter-chunk spilling.