ATHLETE Update

RTL Sampled Simulation

Vighnesh Iyer, Borivoje Nikolić

+

Rust-y RISC-V

An Experimental ISS + Baremetal Environment + Benchmark Construction Methodology

Safin Singh, Ansh Maroo, Connor Chang, Pramath Krishna, Vighnesh Iyer, Joonho Whangbo

SLICE Winter Retreat
Monday, January 13th, 2025

Overview

  1. RTL sampled simulation today and its problems
  2. What does Rust have to do with this?
  3. An experimental RISC-V instruction set simulator (ISS)
  4. Architectural description languages + ISS generation + RTL injection
  5. RISC-V Rust baremetal environment
  6. Baremetal benchmark generation strategy
  7. Live sampled simulation leveraging the Rust ISS + benchmarks

What Are We Trying to Solve?

  • RTL-first design and evaluation methodology
    • Can't afford to build both performance models and implementations
    • Build high-level models/simulators and then pipe straight down to implementation
    • Iterate directly on the RTL
    • Extract collateral at the RTL abstraction (e.g. waveforms, power traces)
  • Existing RTL simulation techniques have unfavorable tradeoffs
    • Software RTL simulation (e.g. VCS, Verilator): fast startup, low throughput
    • FPGA-based emulation (e.g. ZeBu, FireSim): slow startup, high throughput

Our Proposal

  • Sampled simulation using software RTL simulation
    • Short sampling units with functional uArch warmup (a la SMARTs)
    • Representative sampling (a la Simpoints)
  • Custom uArch (RTL) state injection
    • L1 i/d cache functional warmup model to RTL state injection
    • Can extend to any long-lived functional unit

Enables high throughput and accurate simulation of long workloads

Sampled Simulation

Don't run the full workload in RTL simulation

Use an instruction set simulator (ISS) and pick samples to run in RTL simulation

The full workload is represented by a selection of sampling units.

  1. Sampling unit length: trade off runtime vs resolution
  2. Sampling unit selection: how sampling units are selected and used for extrapolation

Functional Warmup

A sampling unit is defined by an architectural checkpoint.

The microarchitectural state of the RTL simulation starts at the reset state!

  • Solution: inject reconstructed uArch state at the start of each sampling unit

This process is called functional warmup

RTL Sampled Simulation Flow

Functional Warmup Flow

  1. Full run of the binary on spike + sampling unit embedding + clustering
  2. Re-run spike to capture arch checkpoints at the start of sampling units
  3. Reconstruct L1d cache state for each arch checkpoint
  4. Inject sampling units into RTL sim and extrapolate

IPC Trace Reconstruction - wikisort

wikisort benchmark from embench, $N = 10000$, $C = 18$, $n_{\text{detailed}} = 2000$

$MAPE_{IPC} = 12.3\% \rightarrow 4.5\%$

IPC Trace Reconstruction - huffbench

huffbench benchmark from embench, $N = 10000$, $C = 18$, $n_{\text{detailed}} = 2000$
  • L1d functional warmup prevents gross IPC underprediction in most cases
  • $MAPE_{IPC} = 6.6\% \rightarrow 4.1\%$

Uninteresting Benchmarks

  • It's just doing the same thing over and over again!
  • Pretty much all baremetal benchmarks look like this

Problem 1: Existing baremetal benchmarks (e.g. Embench, Coremark, etc.) are not interesting.

Ad-Hoc Checkpointing / Injection

IPC trace reconstruction of Coremark (3M instructions visible)
  • Some workloads are way off, even with warmup
  • A bug in checkpointing, arch / uArch state injection, sampling, or extrapolation

Problem 2: No systematic methodology for complete checkpointing and injection.

What Does Rust Have to do With Anything?

What's so Great About Rust?

  • First-class algebraic data types (ADTs) with typeclass derivation
  • build.rs for programmatic code generation
  • A comprehensive package library (crates.io)
  • RISC-V is a first-class target
  • Baremetal support is good
    • Stdlib runs baremetal (using alloc and collections)
    • Large number of no_std crates

Heavily used libraries in the wild + baremetal support

An Experimental RISC-V Instruction Set Simulator (ISS)

Why?

  • Spike already exists. What's wrong?
    • The 'golden model' of RISC-V
    • Extensive set of ISA extensions
    • Reasonably fast: 50+ MIPS (much slower when tracing)
  • Non-unified testbench/IO models between spike and Chipyard/FireSim makes injection tricky
  • Deficiencies of spike
    • Hard to create custom tops (e.g. for live sampled simulation)
    • Ad-hoc arch state checkpointing

Features

This is a purely experimental project

  • Support for rv64imfd_Zicsr (no privileged ISA)
  • Exact diff testing with spike's commit log
  • Runs RISC-V ISA tests (and more) for supported extensions cleanly
  • Leverages riscv-opcodes for instruction encodings

WIP on Github

Codegen-Based ISS

  • Swap out C++ macro system for direct Rust code generation
    • Programatically read riscv-opcodes emit Rust for instruction decoding, immediate extraction, and switch table
    • It's just code! No restrictions
    • Can also generate code for CSRs with semantic bitfields (this is done manually in spike)
  • Device tree from a Chipyard generator ISS generation
    • The goal is exact SoC modeling
    • Modeled and serializable state includes testbench components and IO models

Simple State (De)Serialization

  • The ISS is generated after the RTL generator is run
  • All SoC state is contained within a single Rust struct
  • Typeclass derivation makes it easy to derive serdes on any struct

#[derive(Serialize)]
pub struct Cpu {
    pub regs: [u64; 32],
    pub pc: u64,
    pub csrs: Csrs
}

#[derive(Serialize)]
pub struct System {
  pub cpus: Vec<Cpu>,
  bus: Bus
}
    

Disentangling state and updates seems obvious, but is not easy with spike

Big Problems

  1. How can anyone trust a new ISS is precisely emulating the RISC-V spec?
  2. How can we make the ISS performant?
  3. How can we exactly model an SoC generated from Chipyard? We can't build a point-design ISS.

RISC-V Baremetal Environment + Benchmarks

Why Baremetal Software?

  • Easy to run in execution driven simulation (RTL simulation)
  • Fewer edge cases for successful arch state checkpoint + inject
  • Focus on common-case userspace code
  • But, baremetal programs of course cannot
    • Run kernel code / syscalls and stress the privileged ISA
    • Perform userspace/kernel interactions and witness cache pollution
    • Witness preemptive multithreading

Baremetal Support in Rust

  • Leverage upstream rust-embedded/riscv support
    • target = "riscv64gc-unknown-none-elf": that's all it takes to pull in a cross compiler!
    • Include and use any no_std dependency with a custom allocator
    • Build baremetal projects with cargo build like any other Rust project!
  • Github
    • Implementation of target-side HTIF
    • 1:1 port of some benchmarks from riscv-tests/benchmarks
    • WIP: semantic port of embench
    • Very little code lots of value

Leveraging no_std Crates

  • There are many no_std crates on crates.io that are very popular
    • Data structures: stdlib, hashbrown, btrees, bigint, petgraph, yada
    • Strings: regex, nom
    • (De)Serialization: serde, json, yaml, bincode
    • Compilers/JIT: cranelift, wasmtime, revm-interpreter
    • Hashing / crypto: hmac, aes, rsa, rustls
    • Numerics: nalgebra, faer-rs, rust-num, ndarray
  • Unlike other languages / environments, these crates are easy to use baremetal out-of-the-box

The missing piece: stimulus

Extracting Stimuli from Applications

  • Tiers of crates
    • Base libraries (e.g. data structures)
    • Application-level libraries (e.g. HTTP servers)
    • Deployed applications (e.g. ripgrep, Alacritty, moka, Servo, Meilisearch)
  • Run real applications to derive library-level stimulus
    • We can instrument any crate with function argument capturing
    • Cargo always compiles from source adding patched versions of dependencies is easy

A path towards representative, high quality baremetal benchmarks

Conclusion

All these components tie into a robust sampled simulation flow

Architectural Description Languages (ADLs)

ADLs Broadly

  • Formal definition of arch state and update rules at the ISA-level
  • Instruction and state encodings
  • Execution semantics
  • Methods to resolve ambiguities in a spec
    • At uArch defition time (e.g. handling of misaligned memory loads/stores, FP exceptions, unimplemented CSR access handling)
    • At uArch runtime (e.g. interrupt handling)

Existing Work

Leveraging Chisel

ADLs don't need a new language! It's hardware after all.

  • Use Scala/Chisel (with some augmentation) for state, encoding, update rules
  • But an ADL doesn't merely describe a single-cycle processor
    • Formalize the notion of SoC components (cores, interrupt controllers, IO devices, host tethers, etc.) and how they interact with each other
    • Define exactly when architectural state advances
    • Specify undefined behaviors and how they should be concretized for modeling a given implementation
  • WIP: A Chisel-based ADL embedded in Scala

Generating an ISS from an ADL

  • If an ADL simply described a single-cycle SoC, then isn't RTL simulation sufficient?
    • Naively doing this will produce a low throughput ISS (e.g. the default C backend of sail)
    • Leveraging DBT gives better performance (e.g. Pydrofoil's use of PyPy)
  • Key feature: user control over the compilation of an ADL
  • Separate the execution semantics of the ADL from codegen optimizations
    • Memory block element (scratchpad, cache bank, DRAM) implementation as a fixed length array, a resizable vector, as a paged hash table, ...
    • Naive serial instruction decode, page table translations implementation with a cache + automatic flushing before state serialization
    • "Halide for ISS generation"

Extra Slides

Existing Sampling Techniques

SimPoint

  • Workloads can be split into phases that exhibit similar μArch behavior
  • SimPoint-style representative sampling
    • Compute an embedding for each program interval (e.g. blocks of 100M instructions)
    • Cluster interval embeddings using k-means
    • Choose representative intervals from each cluster as sampling units

SMARTS

  • If we sample from a population, we can estimate the population mean
  • SMARTS-style random sampling
    • Pick a large number of samples to take before program execution
    • If the sample variance is too high after simulation, then collect more sampling units
    • Use CLT to derive a confidence bound for the aggregate performance metric

Our proposal: Combine SimPoint-style representative sampling with SMARTS-style small intervals

Implementation Details For TidalSim

  • Basic block identification
    • BB identification from spike commit log or from static ELF analysis
  • Basic block embedding of intervals
  • Clustering and checkpointing
    • k-means, PCA-based n-clusters
    • spike-based checkpoints
  • RTL simulation and performance metric extraction
    • Custom force-based RTL state injection, out-of-band IPC measurement
  • Extrapolation
    • Estimate IPC of each interval based on its embedding and distances to RTL-simulated intervals

Memory Timestamp Record

  • Construct MTR table from a memory trace, save MTR tables at checkpoint times
  • Given a cache with n sets, group block addresses by set index
  • Given a cache with k ways, pick the k most recently accessed addresses from each set
  • Knowing every resident cache line, fetch the data from the DRAM dump

Overview

  • Tidalsim provides fast, accurate, low-latency RTL-sim-based sampled simulation
  • Ongoing work to leverage Google workload traces for sampling investigation
  • TraceKit is a unified trace analysis framework that will be merged with Tidalsim for a multicore live sampling flow