TidalSim: Fast and Accurate Microarchitectural Simulation via Sampled RTL Simulation

Vighnesh Iyer, Raghav Gupta, Dhruv Vaish, Charles Hong, Sophia Shao, Bora Nikolic

ATHLETE Meeting

Monday, January 22nd, 2024

Talk Outline

Motivation
Our proposal
Background and prior work in microarchitectural simulation and sampling
Implementation of TidalSim v0.1
Results for IPC trace reconstruction
Next steps towards TidalSim v1
Leveraging TidalSim for coverpoint synthesis

Motivation

The New-Era of Domain-Specialized Heterogeneous SoCs

@Frederic_Orange Twitter: A17 Pro Die Analysis

@highyieldYT Twitter: M3 Max Die Analysis

Two trends in SoC design:
- Heterogeneous cores targeting different power/perf curves + workloads
- Domain-specific accelerators
Need a pre-silicon evaluation strategy for rapid, PPA optimal design of these units
- Limited time per design cycle → limited time per evaluation
- More evaluations = more opportunities for optimization

The Microarchitectural Iteration Loop (Industry)

An idealized iteration loop for microarchitectural design. The 'Evaluator' starts off as a performance simulator and transitions to RTL as the design is iterated.

During RTL implementation we need performance validation against the model.

Existing techniques for RTL performance validation

Rapid RTL performance validation that can be used in the RTL design cycle is valuable.

The Microarchitectural Iteration Loop (Academia)

The typical manner in which microarchitectural ideas are evaluated in academia.

Academics rarely write RTL partly due to the difficulty of evaluation, instead opting for uArch simulators.

Academia needs rapid RTL evaluation as a part of an RTL-first research methodology

Limitations of Existing Evaluators

ISA simulation: no accuracy
Trace/Cycle uArch simulation: low accuracy
RTL simulation: low throughput
FPGA prototyping: high startup latency
HW emulators: high cost

We will propose a simulation methodology that can deliver on all axes (accuracy, throughput, startup latency, cost).

Our Proposal

TidalSim Overview

TidalSim: a fast, accurate, low latency, low cost microarchitectural simulation methodology that produces RTL-level collateral for performance estimation and verification on real workloads.

TidalSim Components

Overview of the components of TidalSim.

TidalSim is not a new simulator. It is a simulation methodology that combines the strengths of architectural simulators, uArch models, and RTL simulators.

TidalSim Execution

TidalSim moves simulation execution back and forth between architectural, uArch, and RTL simulators based on dynamic workload analysis.

What Could Our Proposal Enable?

Industry

RTL performance validation is too costly.

Industry

Rapid RTL performance validation becomes viable.

Academia

Academics resort to inaccurate uArch simulators.

Academia

RTL-first evaluation strategy becomes viable.

TidalSim enables new design methodologies for industry, academia, and lean chip design teams.

Background and Prior Work

Simulator Metrics

Simulation techniques span the gamut on various axes. Each simulation technique assumes a particular hardware abstraction.

Throughput
- How many instructions can be simulated per real second? (MIPS = millions of instructions per second)
Accuracy
- Do the output metrics of the simulator match those of the modeled SoC in its real environment?
Startup latency
- How long does it take from the moment the simulator's parameters/inputs are modified to when the first instruction is executed?
Cost
- What hardware platform does the simulator run on?
- How much does it cost to run a simulation?

Existing Hardware Simulation Techniques

	Examples	Throughput	Latency	Accuracy	Cost
JIT-based Simulators / VMs	qemu, KVM, VMWare Fusion	1-3 GIPS	<1 second	None	Minimal
Architectural Simulators	spike, dromajo	10-100+ MIPS	<1 second	None	Minimal
General-purpose μArch Simulators	gem5, Sniper, ZSim, SST	100 KIPS (gem5) - 100 MIPS (Sniper)	<1 minute	10-50% IPC error	Minimal
Bespoke μArch Simulators	Industry performance models	≈ 0.1-1 MIPS	<1 minute	Close	$1M+
RTL Simulators	Verilator, VCS, Xcelium	1-10 KIPS	2-10 minutes	Cycle-exact	Minimal
FPGA-Based Emulators	Firesim	≈ 10 MIPS	2-6 hours	Cycle-exact	$10k+
ASIC-Based Emulators	Palladium, Veloce	≈ 0.5-10 MIPS	<1 hour	Cycle-exact	$10M+
Multi-level Sampled Simulation	TidalSim	10+ MIPS	<1 minute	<1% IPC error	Minimal

TidalSim combines the strengths of each technique to produce a meta-simulator that achieves high throughput, low latency, high accuracy, and low cost.

Accuracy of Microarchitectural Simulators

Raw IPC errors on 64-bit workloads vs real Haswell^[1]. Microarchitectural simulators have substantial errors exceeding 20%.

Impact of using a bimodal branch predictor vs the Haswell BP^[1]. Simulators disagree with each other! The sensitivity of each simulator is wildly different!

Trends aren't enough^[2]. Note the sensitivity differences - gradients are critical!

uArch simulators are not accurate enough for microarchitectural evaluation.

[1]: Akram, A. and Sawalha, L., 2019. A survey of computer architecture simulation techniques and tools. IEEE Access
[2]: Nowatzki, T., Menon, J., Ho, C.H. and Sankaralingam, K., 2015. Architectural simulators considered harmful. Micro.

Sampled Simulation

Instead of running the entire program in uArch simulation, run the entire program in functional simulation and only run samples in uArch simulation

The full workload is represented by a selection of sampling units.

How should sampling units be selected?
How can we accurately estimate the performance of a sampling unit?
How can we estimate errors when extrapolating from sampling units?

Existing Sampling Techniques

SimPoint

Program execution traces aren’t random
- They execute the same code again-and-again
- Workload execution traces can be split into phases that exhibit similar μArch behavior
SimPoint-style representative sampling
- Compute an embedding for each program interval (e.g. blocks of 100M instructions)
- Cluster interval embeddings using k-means
- Choose representative intervals from each cluster as sampling units

SMARTS

Rigorous statistical sampling enables computation of confidence bounds
- Use random sampling on a full execution trace to derive a population sample
- Central limit theorem provides confidence bounds
SMARTS-style random sampling
- Pick a large number of samples to take before program execution
- If the sample variance is too high after simulation, then collect more sampling units
- Use CLT to derive a confidence bound for the aggregate performance metric

Functional Warmup

The state from a sampling unit checkpoint is only architectural state. The microarchitectural state of the uArch simulator starts at the reset state!

We need to seed long-lived uArch state at the beginning of each sampling unit
This process is called functional warmup

Importance of Functional Warmup

Long-lived microarchitectural state (caches, branch predictors, prefetchers, TLBs) has a substantial impact on the performance of a sampling unit

AMAT Error vs # of detailed warmup instructions ^[1]

MPKI vs warmup vs sampling unit length for different branch predictors^[2]

[1]: Hassani, Sina, et al. "LiveSim: Going live with microarchitecture simulation." HPCA 2016.
[2]: Eeckhout, L., 2008. Sampled processor simulation: A survey. Advances in Computers. Elsevier.

Why RTL-Level Sampled Simulation?

Eliminate modeling errors
- Remaining errors can be handled via statistical techniques
No need to correlate performance model and RTL
- Let the RTL serve as the source of truth
Can produce RTL-level collateral
- Leverage for applications in verification and power modeling

This RTL-first evaluation flow is enabled by highly parameterized RTL generators and SoC design frameworks (e.g. Chipyard).

Implementation of TidalSim v0.1

Overview of the TidalSim v0.1 Flow

Implementation Details For TidalSim v0.1

Basic block identification
- BB identification from spike commit log or from static ELF analysis
Basic block embedding of intervals
Clustering and checkpointing
- k-means, PCA-based n-clusters
- spike-based checkpoints
RTL simulation and performance metric extraction
- Custom force-based RTL state injection, out-of-band IPC measurement
Extrapolation
- Estimate IPC of each interval based on its embedding and distances to RTL-simulated intervals

Results for IPC Trace Reconstruction

IPC Trace Prediction: huffbench

Huffman compression from Embench (huffbench)
N=10000, C=18
Full RTL sim takes 15 minutes, TidalSim runs in 10 seconds
Large IPC variance

IPC Trace Prediction: wikisort

Merge sort benchmark from Embench (wikisort)
N=10000, C=18
Full RTL sim takes 15 minutes, TidalSim runs in 10 seconds
Can capture general trends and time-domain IPC variation

Aggregate IPC Prediction for Embench Suite

Typical IPC error (without functional warmup and with fine time-domain precision of 10k instructions) is < 5%

WIP: TidalSim v1

From Tidalsim v0.1 to v1

Functional L1-only cache warmup
Functional branch predictor warmup
Use robust checkpointing fork of spike
- Better arch state initialization technique (via program snippet + selective forces depending on bits that can't be set via ISA)
Characterization on other baremetal workloads
- dhrystone, coremark, riscv-tests benchmarks, MiBench
Explore more sophisticated clustering and extrapolation techniques
- Binary and PC-agnostic interval embeddings

Demonstrate we can hit <1% IPC error

Leveraging HDLs for TidalSim Methodology

HW DSE with TidalSim requires an RTL injection harness
Automatic harness generation using high-level HDLs
- Chisel API to semantically mark arch and uArch state
- FIRRTL pass to generate a state-injecting test harness


class RegFile(n: Int, w: Int, zero: Boolean = false) {
  val rf = Mem(n, UInt(w.W))
  (0 until n).map { archStateAnnotation(rf(n), Riscv.I.GPR(n)) }
  // ...
}


class L1MetadataArray[T <: L1Metadata] extends L1HellaCacheModule()(p) {
  // ...
  val tag_array = SyncReadMem(nSets, Vec(nWays, UInt(metabits.W)))
  (0 until nSets).zip((0 until nWays)).map { case (set, way) =>
    uArchStateAnnotation(tag_array.read(set)(way), Uarch.L1.tag(set, way, cacheType=I))
  }
}

WIP: TidalSim for Coverpoint Synthesis

TidalSim for Verification

Property synthesis techniques require waveforms for analysis
- Specification mining for invariant synthesis or RTL bug localization
- Coverpoint synthesis for tuning stimulus generators towards bugs

TidalSim provides a way to extract many small, unique, RTL waveforms from large workloads with low latency

5.b: Past Work on Specification Mining

Take waveforms from RTL simulation and attempt to mine unfalsified specifications involving 2+ RTL signals^[1]

Specifications are constructed from LTL templates
- Until: $ \mathbf{G}\, (a \rightarrow \mathbf{X}\, (a\, \mathbf{U}\, b)) $
- Next: $ \mathbf{G}\, (a \rightarrow \mathbf{X}\, b) $
- Eventual: $ \mathbf{G}\, (a \rightarrow \mathbf{X F}\, b) $
- $a$ and $b$ are atomic propositions constructed from signals in the RTL design

[1]: Iyer, Vighnesh, et. al., 2019. RTL bug localization through LTL specification mining. MEMOCODE.

5.b: Specification Mining Used for RTL Bug Localization

Introduce a bug in the riscv-mini cache


-  hit := v(idx_reg) && rmeta.tag === tag_reg
+  hit := v(idx_reg) && rmeta.tag =/= tag_reg

This bug does not affect most ISA tests but a multiply benchmark failed by hanging
Checking the VCD against the mined properties gives these violations

Template	$\textbf{a}$	$\textbf{b}$	Violated at Time
Until	`Tile.arb_io_dcache_r_ready`	`Tile.dcache.hit`	418
Until	`Tile.dcache_io_nasti_r_valid`	`Tile.dcache.hit`	418
Until	`Tile.dcache.is_alloc`	`Tile.dcache.hit`	418
Until	`Tile.arb.io_dcache_ar_ready`	`Tile.arb_io_nasti_r_ready`	640

The violated properties point to an anomaly with the hit signal and localize the bug

5.b: Coverpoint Synthesis as Complement of Spec Mining

Coverpoint synthesis is an alternative take on spec mining where we synthesize μArch properties that we want to see more of
- Instead of monitoring properties just for falsification, we also monitor them for completion
- Properties that are falsified or completed, but not too often, are good candidates for coverpoints
Evaluation
- Synthesize coverpoints on Rocket using waveforms from TidalSim and regular RTL sim with the same compute budget
- Demonstrate we can synthesize more, and more interesting coverpoints using TidalSim data
- Evaluate off-the-shelf RISC-V instgen on synthesized coverpoints vs structural coverage

Conclusion

Rapid and accurate microarchitectural evaluation is important
We propose TidalSim, a simulation methodology based on sampled RTL simulation
We demonstrate its utility in IPC trace reconstruction with under 5% error
We examine coverpoint synthesis as an application for RTL-level collateral

TidalSim (github.com/euphoric-hardware/tidalsim) Forks of spike, chipyard, testchipip + top-level runner

TidalSim: a fast, accurate, low latency, low cost simulation methodology that produces RTL-level collateral for performance and power estimation and verification on real workloads.