ATHLETE (Jan 2024)

TidalSim: Fast and Accurate Microarchitectural Simulation via Sampled RTL Simulation

Vighnesh Iyer, Raghav Gupta, Dhruv Vaish, Charles Hong, Sophia Shao, Bora Nikolic

ATHLETE Meeting

Monday, January 22nd, 2024

Talk Outline

  1. Motivation
  2. Our proposal
  3. Background and prior work in microarchitectural simulation and sampling
  4. Implementation of TidalSim v0.1
  5. Results for IPC trace reconstruction
  6. Next steps towards TidalSim v1
  7. Leveraging TidalSim for coverpoint synthesis

Motivation

The New-Era of Domain-Specialized Heterogeneous SoCs

  • Two trends in SoC design:
    • Heterogeneous cores targeting different power/perf curves + workloads
    • Domain-specific accelerators
  • Need a pre-silicon evaluation strategy for rapid, PPA optimal design of these units
    • Limited time per design cycle limited time per evaluation
    • More evaluations = more opportunities for optimization

The Microarchitectural Iteration Loop (Industry)

An idealized iteration loop for microarchitectural design. The 'Evaluator' starts off as a performance simulator and transitions to RTL as the design is iterated.
During RTL implementation we need performance validation against the model.
Existing techniques for RTL performance validation

Rapid RTL performance validation that can be used in the RTL design cycle is valuable.

The Microarchitectural Iteration Loop (Academia)

The typical manner in which microarchitectural ideas are evaluated in academia.

Academics rarely write RTL partly due to the difficulty of evaluation, instead opting for uArch simulators.

Academia needs rapid RTL evaluation as a part of an RTL-first research methodology

Limitations of Existing Evaluators

  • ISA simulation: no accuracy
  • Trace/Cycle uArch simulation: low accuracy
  • RTL simulation: low throughput
  • FPGA prototyping: high startup latency
  • HW emulators: high cost

We will propose a simulation methodology that can deliver on all axes (accuracy, throughput, startup latency, cost).

Our Proposal

TidalSim Overview

TidalSim: a fast, accurate, low latency, low cost microarchitectural simulation methodology that produces RTL-level collateral for performance estimation and verification on real workloads.

TidalSim Components

Overview of the components of TidalSim.

TidalSim is not a new simulator. It is a simulation methodology that combines the strengths of architectural simulators, uArch models, and RTL simulators.

TidalSim Execution

TidalSim moves simulation execution back and forth between architectural, uArch, and RTL simulators based on dynamic workload analysis.

What Could Our Proposal Enable?

Industry

RTL performance validation is too costly.

Industry

Rapid RTL performance validation becomes viable.

Academia

Academics resort to inaccurate uArch simulators.

Academia

RTL-first evaluation strategy becomes viable.

TidalSim enables new design methodologies for industry, academia, and lean chip design teams.

Background and Prior Work

Simulator Metrics

Simulation techniques span the gamut on various axes. Each simulation technique assumes a particular hardware abstraction.

  • Throughput
    • How many instructions can be simulated per real second? (MIPS = millions of instructions per second)
  • Accuracy
    • Do the output metrics of the simulator match those of the modeled SoC in its real environment?
  • Startup latency
    • How long does it take from the moment the simulator's parameters/inputs are modified to when the first instruction is executed?
  • Cost
    • What hardware platform does the simulator run on?
    • How much does it cost to run a simulation?

Existing Hardware Simulation Techniques

Examples Throughput Latency Accuracy Cost
JIT-based Simulators / VMs qemu, KVM, VMWare Fusion 1-3 GIPS <1 second None Minimal
Architectural Simulators spike, dromajo 10-100+ MIPS <1 second None Minimal
General-purpose μArch Simulators gem5, Sniper, ZSim, SST 100 KIPS (gem5) - 100 MIPS (Sniper) <1 minute 10-50% IPC error Minimal
Bespoke μArch Simulators Industry performance models ≈ 0.1-1 MIPS <1 minute Close $1M+
RTL Simulators Verilator, VCS, Xcelium 1-10 KIPS 2-10 minutes Cycle-exact Minimal
FPGA-Based Emulators Firesim ≈ 10 MIPS 2-6 hours Cycle-exact $10k+
ASIC-Based Emulators Palladium, Veloce ≈ 0.5-10 MIPS <1 hour Cycle-exact $10M+
Multi-level Sampled Simulation TidalSim 10+ MIPS <1 minute <1% IPC error Minimal

TidalSim combines the strengths of each technique to produce a meta-simulator that achieves high throughput, low latency, high accuracy, and low cost.

Accuracy of Microarchitectural Simulators

Raw IPC errors on 64-bit workloads vs real Haswell[1]. Microarchitectural simulators have substantial errors exceeding 20%.
Impact of using a bimodal branch predictor vs the Haswell BP[1]. Simulators disagree with each other! The sensitivity of each simulator is wildly different!

Trends aren't enough[2]. Note the sensitivity differences - gradients are critical!

uArch simulators are not accurate enough for microarchitectural evaluation.


[1]: Akram, A. and Sawalha, L., 2019. A survey of computer architecture simulation techniques and tools. IEEE Access
[2]: Nowatzki, T., Menon, J., Ho, C.H. and Sankaralingam, K., 2015. Architectural simulators considered harmful. Micro.

Sampled Simulation

Instead of running the entire program in uArch simulation, run the entire program in functional simulation and only run samples in uArch simulation

The full workload is represented by a selection of sampling units.

  1. How should sampling units be selected?
  2. How can we accurately estimate the performance of a sampling unit?
  3. How can we estimate errors when extrapolating from sampling units?

Existing Sampling Techniques

SimPoint

  • Program execution traces aren’t random
    • They execute the same code again-and-again
    • Workload execution traces can be split into phases that exhibit similar μArch behavior
  • SimPoint-style representative sampling
    • Compute an embedding for each program interval (e.g. blocks of 100M instructions)
    • Cluster interval embeddings using k-means
    • Choose representative intervals from each cluster as sampling units

SMARTS

  • Rigorous statistical sampling enables computation of confidence bounds
    • Use random sampling on a full execution trace to derive a population sample
    • Central limit theorem provides confidence bounds
  • SMARTS-style random sampling
    • Pick a large number of samples to take before program execution
    • If the sample variance is too high after simulation, then collect more sampling units
    • Use CLT to derive a confidence bound for the aggregate performance metric

Functional Warmup

The state from a sampling unit checkpoint is only architectural state. The microarchitectural state of the uArch simulator starts at the reset state!

  • We need to seed long-lived uArch state at the beginning of each sampling unit
  • This process is called functional warmup

Importance of Functional Warmup

Long-lived microarchitectural state (caches, branch predictors, prefetchers, TLBs) has a substantial impact on the performance of a sampling unit

AMAT Error vs # of detailed warmup instructions [1]
MPKI vs warmup vs sampling unit length for different branch predictors[2]

[1]: Hassani, Sina, et al. "LiveSim: Going live with microarchitecture simulation." HPCA 2016.
[2]: Eeckhout, L., 2008. Sampled processor simulation: A survey. Advances in Computers. Elsevier.

Why RTL-Level Sampled Simulation?

  • Eliminate modeling errors
    • Remaining errors can be handled via statistical techniques
  • No need to correlate performance model and RTL
    • Let the RTL serve as the source of truth
  • Can produce RTL-level collateral
    • Leverage for applications in verification and power modeling

This RTL-first evaluation flow is enabled by highly parameterized RTL generators and SoC design frameworks (e.g. Chipyard).

Implementation of TidalSim v0.1

Overview of the TidalSim v0.1 Flow

Implementation Details For TidalSim v0.1

  • Basic block identification
    • BB identification from spike commit log or from static ELF analysis
  • Basic block embedding of intervals
  • Clustering and checkpointing
    • k-means, PCA-based n-clusters
    • spike-based checkpoints
  • RTL simulation and performance metric extraction
    • Custom force-based RTL state injection, out-of-band IPC measurement
  • Extrapolation
    • Estimate IPC of each interval based on its embedding and distances to RTL-simulated intervals

Results for IPC Trace Reconstruction

IPC Trace Prediction: huffbench

  • Huffman compression from Embench (huffbench)
  • N=10000, C=18
  • Full RTL sim takes 15 minutes, TidalSim runs in 10 seconds
  • Large IPC variance

IPC Trace Prediction: wikisort

  • Merge sort benchmark from Embench (wikisort)
  • N=10000, C=18
  • Full RTL sim takes 15 minutes, TidalSim runs in 10 seconds
  • Can capture general trends and time-domain IPC variation

Aggregate IPC Prediction for Embench Suite

Typical IPC error (without functional warmup and with fine time-domain precision of 10k instructions) is < 5%

WIP: TidalSim v1

From Tidalsim v0.1 to v1

  • Functional L1-only cache warmup
  • Functional branch predictor warmup
  • Use robust checkpointing fork of spike
    • Better arch state initialization technique (via program snippet + selective forces depending on bits that can't be set via ISA)
  • Characterization on other baremetal workloads
    • dhrystone, coremark, riscv-tests benchmarks, MiBench
  • Explore more sophisticated clustering and extrapolation techniques
    • Binary and PC-agnostic interval embeddings

Demonstrate we can hit <1% IPC error

Leveraging HDLs for TidalSim Methodology

  • HW DSE with TidalSim requires an RTL injection harness
  • Automatic harness generation using high-level HDLs
    • Chisel API to semantically mark arch and uArch state
    • FIRRTL pass to generate a state-injecting test harness

class RegFile(n: Int, w: Int, zero: Boolean = false) {
  val rf = Mem(n, UInt(w.W))
  (0 until n).map { archStateAnnotation(rf(n), Riscv.I.GPR(n)) }
  // ...
}
  

class L1MetadataArray[T <: L1Metadata] extends L1HellaCacheModule()(p) {
  // ...
  val tag_array = SyncReadMem(nSets, Vec(nWays, UInt(metabits.W)))
  (0 until nSets).zip((0 until nWays)).map { case (set, way) =>
    uArchStateAnnotation(tag_array.read(set)(way), Uarch.L1.tag(set, way, cacheType=I))
  }
}
  

WIP: TidalSim for Coverpoint Synthesis

TidalSim for Verification

  • Property synthesis techniques require waveforms for analysis
    • Specification mining for invariant synthesis or RTL bug localization
    • Coverpoint synthesis for tuning stimulus generators towards bugs

TidalSim provides a way to extract many small, unique, RTL waveforms from large workloads with low latency

5.b: Past Work on Specification Mining

  • Take waveforms from RTL simulation and attempt to mine unfalsified specifications involving 2+ RTL signals[1]
  • Specifications are constructed from LTL templates
    • Until: $ \mathbf{G}\, (a \rightarrow \mathbf{X}\, (a\, \mathbf{U}\, b)) $
    • Next: $ \mathbf{G}\, (a \rightarrow \mathbf{X}\, b) $
    • Eventual: $ \mathbf{G}\, (a \rightarrow \mathbf{X F}\, b) $
    • $a$ and $b$ are atomic propositions constructed from signals in the RTL design

[1]: Iyer, Vighnesh, et. al., 2019. RTL bug localization through LTL specification mining. MEMOCODE.

5.b: Specification Mining Used for RTL Bug Localization

Introduce a bug in the riscv-mini cache


-  hit := v(idx_reg) && rmeta.tag === tag_reg
+  hit := v(idx_reg) && rmeta.tag =/= tag_reg
  
  • This bug does not affect most ISA tests but a multiply benchmark failed by hanging
  • Checking the VCD against the mined properties gives these violations
Template $\textbf{a}$ $\textbf{b}$ Violated at Time
Until Tile.arb_io_dcache_r_ready Tile.dcache.hit 418
Until Tile.dcache_io_nasti_r_valid Tile.dcache.hit 418
Until Tile.dcache.is_alloc Tile.dcache.hit 418
Until Tile.arb.io_dcache_ar_ready Tile.arb_io_nasti_r_ready 640

The violated properties point to an anomaly with the hit signal and localize the bug

5.b: Coverpoint Synthesis as Complement of Spec Mining

  • Coverpoint synthesis is an alternative take on spec mining where we synthesize μArch properties that we want to see more of
    • Instead of monitoring properties just for falsification, we also monitor them for completion
    • Properties that are falsified or completed, but not too often, are good candidates for coverpoints
  • Evaluation
    • Synthesize coverpoints on Rocket using waveforms from TidalSim and regular RTL sim with the same compute budget
    • Demonstrate we can synthesize more, and more interesting coverpoints using TidalSim data
    • Evaluate off-the-shelf RISC-V instgen on synthesized coverpoints vs structural coverage

Conclusion

  • Rapid and accurate microarchitectural evaluation is important
  • We propose TidalSim, a simulation methodology based on sampled RTL simulation
  • We demonstrate its utility in IPC trace reconstruction with under 5% error
  • We examine coverpoint synthesis as an application for RTL-level collateral

TidalSim (github.com/euphoric-hardware/tidalsim) Forks of spike, chipyard, testchipip + top-level runner

TidalSim: a fast, accurate, low latency, low cost simulation methodology that produces RTL-level collateral for performance and power estimation and verification on real workloads.