TidalSim: Multi-Level Microarchitecture Simulation

Vighnesh Iyer, Raghav Gupta, Dhruv Vaish, Young-Jin Park, Bora Nikolic, Sophia Shao

CS294 Project Presentation

Tuesday, December 12th, 2023

Motivation and Background

Problem Overview

Fast RTL-level μArch simulation and performance trace estimation

enables

Rapid RTL iteration with performance evaluation on real workloads

How can we achieve high throughput, high fidelity, low latency μArch simulation?

Existing μArch Evaluation Strategies

	Throughput	Latency	Fidelity
ISA Simulation	10-100+ MIPS	<1 second	None
μArch Perf Sim	100 KIPS (gem5)	5-10 seconds	5-10% avg IPC error
RTL Simulation	1-10 KIPS	5-10 minutes	cycle-exact
FireSim (FPGA)	1-50 MIPS	2-6 hours	cycle-exact
TidalSim	10 MIPS (unoptimized)	<1 minute	<5% error, 10k intervals

Combine the strengths of ISA, μArch, and RTL simulators
- Multi-level simulation

Phase Behavior of Programs

Program execution traces aren’t random
- They execute the same code again-and-again
- Application execution traces can be split into phases that exhibit similar μArch behavior
Prior work: SimPoint
- Identify basic blocks executed in a given interval (e.g. 1M instruction intervals)
- Embed each interval using their ‘basic block vector’
- Cluster intervals using k-means
Similar intervals → similar μArch behaviors
- Only execute unique intervals in low-level RTL simulation!

Prior Work

Sampled simulation techniques have been used in μArch simulators for decades
- SimPoint-style sampling (interval clustering, large intervals (1-100M))
- SMARTs-style sampling (reservoir sampling, small intervals (100k-1M))
- Implemented in gem5, Sniper, ZSim, SST
LiveSim proposed 2-level simulation (ISA → μArch sim) for rapid iteration of μArch parameters
- Functional warmup was used for the cache and branch predictor models

What's New

What makes RTL-level sampled simulation interesting?

No need to perform correlation between perf model and RTL
- Error is introduced by sampling, but it can be understood/bounded with statistical methods
- Additional error comes from modeling RTL constructs (which is often done poorly and can't be bounded)
Possible to derive accurate PPA numbers
- Real frequency and area numbers from synthesis
- Can extrapolate up to full power traces
Leverage special collateral (waveforms) from RTL simulation
- Power macromodel construction and training
- Coverpoint synthesis, bootstrapping RTL fuzzing

Multi-level simulation with RTL-level injection hasn't been done before. So we should try!

The TidalSim Flow

Overview

Components of the Flow

Basic block identification
- BB identification from spike commit log or from static ELF analysis
Basic block embedding of program intervals
Clustering and checkpointing
- k-means, PCA-based n-clusters, spike-based checkpoints
RTL simulation and performance metric extraction
- Custom force-based RTL state injection, out-of-band IPC measurement
Extrapolation

Arch Snapshotting Details

For each cluster, take the sample that is closest to its centroid

Capture arch checkpoints at the start each chosen sample


pc = 0x0000000080000332
priv = M
fcsr = 0x0000000000000000
mtvec = 0x8000000a00006000
...
x1 = 0x000000008000024a
x2 = 0x0000000080028fa0
...

An arch checkpoints = arch state + raw memory contents

RTL Simulation and Arch-State Injection

Arch checkpoints are run in parallel in RTL simulation for N instructions
Injection with force/release via custom test harness
RTL state injection caveats
- Not all arch state maps 1:1 with an RTL-level register
- e.g. fflags in fcsr are FP exception bits from FPU μArch state
- e.g. FPRs in Rocket are stored in recoded 65-bit format (not IEEE floats)
Performance metrics extracted from RTL simulation


cycles,instret
1219,100
125,100
126,100
123,100
114,100
250,100
113,100

Extrapolation

Performance metrics for one sample in a cluster are representative of all samples in that cluster

Extrapolate on the entire execution trace to get a full IPC trace

Results

Clustering on Embench Benchmarks

Cluster centroids indicate which basic blocks are traversed most frequently in each cluster

We observe that most clusters capture unique traversal patterns

IPC Trace Prediction

Montgomery multiplication from Embench (aha-mont64)
N=1000, C=12
Full RTL sim takes 10 minutes, TidalSim runs in 10 seconds
IPC is correlated (mean error <5%); very weak correlation between distance and error

IPC Trace Prediction

Huffman compression from Embench (huffbench)
N=10000, C=18
Full RTL sim takes 15 minutes, TidalSim runs in 10 seconds
Larger IPC variance

Work in Progress

Functional Cache Warmup

Each checkpoint is run in RTL simulation with a cold cache → inaccurate IPC due to incomplete cache warming during detailed warmup
WIP: "Memory Timestamp Record"^[2] based cache model and RTL cache state injection

AMAT Error vs # of Functional Warmup Instructions (from LiveSim^[1])

[1]: Hassani, Sina, et al. "LiveSim: Going live with microarchitecture simulation." HPCA 2016.
[2]: Barr, Kenneth C., et al. "Accelerating multiprocessor simulation with a memory timestamp record." ISPASS 2005.

Dealing with Long-Lived μArch State

Caches aren't the only long-lived CPU structures
A general warmup methodology ingests a subset of a functional simulation log
Each unit needs a custom model, injection logic, and perf metric extraction

Conclusion

We want rapid RTL iteration with PPA evaluation
We need fast RTL-level simulation
We propose TidalSim, a multi-level simulation methodology to enable rapid HW iteration

TidalSim (github.com/euphoric-hardware/tidalsim) Forks of spike, chipyard, testchipip + top-level runner