RTL Sampled Simulation

Vighnesh Iyer, Borivoje Nikolić

+

Rust-y RISC-V

An Experimental ISS + Baremetal Environment + Benchmark Construction Methodology

Safin Singh, Ansh Maroo, Connor Chang, Pramath Krishna, Vighnesh Iyer, Joonho Whangbo

SLICE Winter Retreat
Monday, January 13th, 2025

Overview

RTL sampled simulation today and its problems
What does Rust have to do with this?
An experimental RISC-V instruction set simulator (ISS)
RISC-V Rust baremetal environment
Baremetal benchmark generation strategy

What Are We Trying to Solve?

RTL-first design and evaluation methodology
- Can't afford to build both performance models and implementations
- Build high-level models/simulators and then pipe straight down to implementation
- Iterate directly on the RTL
- Extract collateral at the RTL abstraction (e.g. waveforms, power traces)
Existing RTL simulation techniques have unfavorable tradeoffs
- Software RTL simulation (e.g. VCS, Verilator): fast startup, low throughput
- FPGA-based emulation (e.g. ZeBu, FireSim): slow startup, high throughput

Our Proposal

Sampled simulation using software RTL simulation
- Short sampling units with functional uArch warmup (a la SMARTs)
- Representative sampling (a la Simpoints)
Custom uArch (RTL) state injection
- L1 i/d cache functional warmup model to RTL state injection
- Can extend to any long-lived functional unit

Enables high throughput and accurate simulation of long workloads

Sampled Simulation

Don't run the full workload in RTL simulation

Use an instruction set simulator (ISS) and pick samples to run in RTL simulation

The full workload is represented by a selection of sampling units.

Sampling unit length: trade off runtime vs resolution
Sampling unit selection: how sampling units are selected and used for extrapolation

Functional Warmup

A sampling unit is defined by an architectural checkpoint.

The microarchitectural state of the RTL simulation starts at the reset state!

Solution: inject reconstructed uArch state at the start of each sampling unit

This process is called functional warmup

RTL Sampled Simulation Flow

Functional Warmup Flow

Full run of the binary on spike + sampling unit embedding + clustering
Re-run spike to capture arch checkpoints at the start of sampling units
Reconstruct L1d cache state for each arch checkpoint
Inject sampling units into RTL sim and extrapolate

IPC Trace Reconstruction - wikisort

wikisort benchmark from embench, $N = 10000$, $C = 18$, $n_{\text{detailed}} = 2000$

$MAPE_{IPC} = 12.3\% \rightarrow 4.5\%$

IPC Trace Reconstruction - huffbench

huffbench benchmark from embench, $N = 10000$, $C = 18$, $n_{\text{detailed}} = 2000$

L1d functional warmup prevents gross IPC underprediction in most cases
$MAPE_{IPC} = 6.6\% \rightarrow 4.1\%$

Uninteresting Benchmarks

It's just doing the same thing over and over again!
Pretty much all baremetal benchmarks look like this

Problem 1: Existing baremetal benchmarks (e.g. Embench, Coremark, etc.) are not interesting.

Ad-Hoc Checkpointing / Injection

IPC trace reconstruction of Coremark (3M instructions visible)

Some workloads are way off, even with warmup
A bug in checkpointing, arch / uArch state injection, sampling, or extrapolation

Problem 2: No systematic methodology for complete checkpointing and injection.

What Does Rust Have to do With Anything?

What's so Great About Rust?

First-class algebraic data types (ADTs) with typeclass derivation
build.rs for programmatic code generation
A comprehensive package library (crates.io)
RISC-V is a first-class target
Baremetal support is good
- Stdlib runs baremetal (using alloc and collections)
- Large number of no_std crates

Heavily used libraries in the wild + baremetal support

An Experimental RISC-V Instruction Set Simulator (ISS)

Why?

Spike already exists. What's wrong?
- The 'golden model' of RISC-V
- Extensive set of ISA extensions
- Reasonably fast: 50+ MIPS (much slower when tracing)
Non-unified testbench/IO models between spike and Chipyard/FireSim makes injection tricky
Deficiencies of spike
- Hard to create custom tops (e.g. for live sampled simulation)
- Ad-hoc arch state checkpointing

Features

This is a purely experimental project

Support for rv64imfd_Zicsr (no privileged ISA)
Exact diff testing with spike's commit log
Runs RISC-V ISA tests (and more) for supported extensions cleanly
Leverages riscv-opcodes for instruction encodings

WIP on Github

Codegen-Based ISS

Swap out C++ macro system for direct Rust code generation
- Programatically read riscv-opcodes → emit Rust for instruction decoding, immediate extraction, and switch table
- It's just code! No restrictions
- Can also generate code for CSRs with semantic bitfields (this is done manually in spike)
Device tree from a Chipyard generator → ISS generation
- The goal is exact SoC modeling
- Modeled and serializable state includes testbench components and IO models

Simple State (De)Serialization

The ISS is generated after the RTL generator is run
All SoC state is contained within a single Rust struct
Typeclass derivation makes it easy to derive serdes on any struct


#[derive(Serialize)]
pub struct Cpu {
    pub regs: [u64; 32],
    pub pc: u64,
    pub csrs: Csrs
}

#[derive(Serialize)]
pub struct System {
  pub cpus: Vec<Cpu>,
  bus: Bus
}

Disentangling state and updates seems obvious, but is not easy with spike

Big Problems

How can anyone trust a new ISS is precisely emulating the RISC-V spec?
How can we make the ISS performant?
How can we exactly model an SoC generated from Chipyard? We can't build a point-design ISS.

RISC-V Baremetal Environment + Benchmarks

Why Baremetal Software?

Easy to run in execution driven simulation (RTL simulation)
Fewer edge cases for successful arch state checkpoint + inject
Focus on common-case userspace code
But, baremetal programs of course cannot
- Run kernel code / syscalls and stress the privileged ISA
- Perform userspace/kernel interactions and witness cache pollution
- Witness preemptive multithreading

Baremetal Support in Rust

Leverage upstream rust-embedded/riscv support
- target = "riscv64gc-unknown-none-elf": that's all it takes to pull in a cross compiler!
- Include and use any no_std dependency with a custom allocator
- Build baremetal projects with cargo build like any other Rust project!
Github
- Implementation of target-side HTIF
- 1:1 port of some benchmarks from riscv-tests/benchmarks
- WIP: semantic port of embench
- Very little code → lots of value

Leveraging `no_std` Crates

There are many no_std crates on crates.io that are very popular
- Data structures: stdlib, hashbrown, btrees, bigint, petgraph, yada
- Strings: regex, nom
- (De)Serialization: serde, json, yaml, bincode
- Compilers/JIT: cranelift, wasmtime, revm-interpreter
- Hashing / crypto: hmac, aes, rsa, rustls
- Numerics: nalgebra, faer-rs, rust-num, ndarray
Unlike other languages / environments, these crates are easy to use baremetal out-of-the-box

The missing piece: stimulus

Extracting Stimuli from Applications

Tiers of crates
- Base libraries (e.g. data structures)
- Application-level libraries (e.g. HTTP servers)
- Deployed applications (e.g. ripgrep, Alacritty, moka, Servo, Meilisearch)
Run real applications to derive library-level stimulus
- We can instrument any crate with function argument capturing
- Cargo always compiles from source → adding patched versions of dependencies is easy

A path towards representative, high quality baremetal benchmarks

Conclusion

All these components tie into a robust sampled simulation flow

Architectural Description Languages (ADLs)

ADLs Broadly

Formal definition of arch state and update rules at the ISA-level
Instruction and state encodings
Execution semantics
Methods to resolve ambiguities in a spec
- At uArch defition time (e.g. handling of misaligned memory loads/stores, FP exceptions, unimplemented CSR access handling)
- At uArch runtime (e.g. interrupt handling)

Existing Work

spike is considered a 'golden model', but it isn't a spec in itself
- Undefined behaviors are concretized
sail-riscv is the official 'formal model' for RISC-V
- Custom DSL (Sail) for specifying execution semantics
- Primarily intended for formal methods (although ISS generators exist - Pydrofoil)
Others in-the-wild: Vienna ADL, Codasip's CodAL, Qualcomm's riscv-unified-db
- Each one creates a new DSL for describing state, encodings, and update rules
Instruction-Level Abstraction (ILA)
- C++ eDSL for describing state + update rules
- C++ framework that describes allowable behaviors + bindings to a Verilog implementation
- Designed for accelerator specification
ARM's Architecture Specification Language (explainer blog post)
- A custom DSL for encoding the formal semantics of the ARM ISA, used mostly for bug hunting against Verilog implementations

Leveraging Chisel

ADLs don't need a new language! It's hardware after all.

Use Scala/Chisel (with some augmentation) for state, encoding, update rules
But an ADL doesn't merely describe a single-cycle processor
- Formalize the notion of SoC components (cores, interrupt controllers, IO devices, host tethers, etc.) and how they interact with each other
- Define exactly when architectural state advances
- Specify undefined behaviors and how they should be concretized for modeling a given implementation
WIP: A Chisel-based ADL embedded in Scala

Generating an ISS from an ADL

If an ADL simply described a single-cycle SoC, then isn't RTL simulation sufficient?
- Naively doing this will produce a low throughput ISS (e.g. the default C backend of sail)
- Leveraging DBT gives better performance (e.g. Pydrofoil's use of PyPy)
Key feature: user control over the compilation of an ADL
Separate the execution semantics of the ADL from codegen optimizations
- Memory block element (scratchpad, cache bank, DRAM) → implementation as a fixed length array, a resizable vector, as a paged hash table, ...
- Naive serial instruction decode, page table translations → implementation with a cache + automatic flushing before state serialization
- "Halide for ISS generation"

Extra Slides

Existing Sampling Techniques

SimPoint

Workloads can be split into phases that exhibit similar μArch behavior
SimPoint-style representative sampling
- Compute an embedding for each program interval (e.g. blocks of 100M instructions)
- Cluster interval embeddings using k-means
- Choose representative intervals from each cluster as sampling units

SMARTS

If we sample from a population, we can estimate the population mean
SMARTS-style random sampling
- Pick a large number of samples to take before program execution
- If the sample variance is too high after simulation, then collect more sampling units
- Use CLT to derive a confidence bound for the aggregate performance metric

Our proposal: Combine SimPoint-style representative sampling with SMARTS-style small intervals

Implementation Details For TidalSim

Basic block identification
- BB identification from spike commit log or from static ELF analysis
Basic block embedding of intervals
Clustering and checkpointing
- k-means, PCA-based n-clusters
- spike-based checkpoints
RTL simulation and performance metric extraction
- Custom force-based RTL state injection, out-of-band IPC measurement
Extrapolation
- Estimate IPC of each interval based on its embedding and distances to RTL-simulated intervals

Memory Timestamp Record

Construct MTR table from a memory trace, save MTR tables at checkpoint times
Given a cache with n sets, group block addresses by set index
Given a cache with k ways, pick the k most recently accessed addresses from each set
Knowing every resident cache line, fetch the data from the DRAM dump

Overview

Tidalsim provides fast, accurate, low-latency RTL-sim-based sampled simulation
Ongoing work to leverage Google workload traces for sampling investigation
TraceKit is a unified trace analysis framework that will be merged with Tidalsim for a multicore live sampling flow

RTL Sampled Simulation

+

Rust-y RISC-V

An Experimental ISS + Baremetal Environment + Benchmark Construction Methodology

SLICE Winter Retreat Monday, January 13th, 2025

Overview

What Are We Trying to Solve?

Our Proposal

Sampled Simulation

Functional Warmup

RTL Sampled Simulation Flow

Functional Warmup Flow

IPC Trace Reconstruction - wikisort

IPC Trace Reconstruction - huffbench

Uninteresting Benchmarks

Ad-Hoc Checkpointing / Injection

What Does Rust Have to do With Anything?

What's so Great About Rust?

An Experimental RISC-V Instruction Set Simulator (ISS)

Why?

Features

Codegen-Based ISS

Simple State (De)Serialization

Big Problems

RISC-V Baremetal Environment + Benchmarks

Why Baremetal Software?

Baremetal Support in Rust

Leveraging no_std Crates

Extracting Stimuli from Applications

Conclusion

Architectural Description Languages (ADLs)

ADLs Broadly

Existing Work

Leveraging Chisel

Generating an ISS from an ADL

Extra Slides

Existing Sampling Techniques

SimPoint

SMARTS

Implementation Details For TidalSim

Memory Timestamp Record

Overview

SLICE Winter Retreat
Monday, January 13th, 2025

Leveraging `no_std` Crates