CS294 (Hyperscale Class)

TidalSim: Multi-Level Microarchitecture Simulation

Vighnesh Iyer, Raghav Gupta, Dhruv Vaish, Young-Jin Park, Bora Nikolic, Sophia Shao

CS294 Project Presentation

Tuesday, December 12th, 2023

Motivation and Background

Problem Overview

Fast RTL-level μArch simulation and performance trace estimation

enables

Rapid RTL iteration with performance evaluation on real workloads


How can we achieve high throughput, high fidelity, low latency μArch simulation?

Existing μArch Evaluation Strategies

Throughput Latency Fidelity
ISA Simulation 10-100+ MIPS <1 second None
μArch Perf Sim 100 KIPS (gem5) 5-10 seconds 5-10% avg IPC error
RTL Simulation 1-10 KIPS 5-10 minutes cycle-exact
FireSim (FPGA) 1-50 MIPS 2-6 hours cycle-exact
TidalSim 10 MIPS (unoptimized) <1 minute <5% error, 10k intervals
  • Combine the strengths of ISA, μArch, and RTL simulators
    • Multi-level simulation

Phase Behavior of Programs

  • Program execution traces aren’t random
    • They execute the same code again-and-again
    • Application execution traces can be split into phases that exhibit similar μArch behavior
  • Prior work: SimPoint
    • Identify basic blocks executed in a given interval (e.g. 1M instruction intervals)
    • Embed each interval using their ‘basic block vector’
    • Cluster intervals using k-means
  • Similar intervals → similar μArch behaviors
    • Only execute unique intervals in low-level RTL simulation!

Prior Work

  • Sampled simulation techniques have been used in μArch simulators for decades
    • SimPoint-style sampling (interval clustering, large intervals (1-100M))
    • SMARTs-style sampling (reservoir sampling, small intervals (100k-1M))
    • Implemented in gem5, Sniper, ZSim, SST
  • LiveSim proposed 2-level simulation (ISA → μArch sim) for rapid iteration of μArch parameters
    • Functional warmup was used for the cache and branch predictor models

What's New

What makes RTL-level sampled simulation interesting?

  • No need to perform correlation between perf model and RTL
    • Error is introduced by sampling, but it can be understood/bounded with statistical methods
    • Additional error comes from modeling RTL constructs (which is often done poorly and can't be bounded)
  • Possible to derive accurate PPA numbers
    • Real frequency and area numbers from synthesis
    • Can extrapolate up to full power traces
  • Leverage special collateral (waveforms) from RTL simulation
    • Power macromodel construction and training
    • Coverpoint synthesis, bootstrapping RTL fuzzing

Multi-level simulation with RTL-level injection hasn't been done before. So we should try!

The TidalSim Flow

Overview

Components of the Flow

  • Basic block identification
    • BB identification from spike commit log or from static ELF analysis
  • Basic block embedding of program intervals
  • Clustering and checkpointing
    • k-means, PCA-based n-clusters, spike-based checkpoints
  • RTL simulation and performance metric extraction
    • Custom force-based RTL state injection, out-of-band IPC measurement
  • Extrapolation

Arch Snapshotting Details

For each cluster, take the sample that is closest to its centroid

Capture arch checkpoints at the start each chosen sample


pc = 0x0000000080000332
priv = M
fcsr = 0x0000000000000000
mtvec = 0x8000000a00006000
...
x1 = 0x000000008000024a
x2 = 0x0000000080028fa0
...
    

An arch checkpoints = arch state + raw memory contents

RTL Simulation and Arch-State Injection

  • Arch checkpoints are run in parallel in RTL simulation for N instructions
  • Injection with force/release via custom test harness
  • RTL state injection caveats
    • Not all arch state maps 1:1 with an RTL-level register
    • e.g. fflags in fcsr are FP exception bits from FPU μArch state
    • e.g. FPRs in Rocket are stored in recoded 65-bit format (not IEEE floats)
  • Performance metrics extracted from RTL simulation

cycles,instret
1219,100
125,100
126,100
123,100
114,100
250,100
113,100
    

Extrapolation

Performance metrics for one sample in a cluster are representative of all samples in that cluster

Extrapolate on the entire execution trace to get a full IPC trace

Results

Clustering on Embench Benchmarks

  • Cluster centroids indicate which basic blocks are traversed most frequently in each cluster
  • We observe that most clusters capture unique traversal patterns

IPC Trace Prediction

  • Montgomery multiplication from Embench (aha-mont64)
  • N=1000, C=12
  • Full RTL sim takes 10 minutes, TidalSim runs in 10 seconds
  • IPC is correlated (mean error <5%); very weak correlation between distance and error

IPC Trace Prediction

  • Huffman compression from Embench (huffbench)
  • N=10000, C=18
  • Full RTL sim takes 15 minutes, TidalSim runs in 10 seconds
  • Larger IPC variance

Work in Progress

Functional Cache Warmup

  • Each checkpoint is run in RTL simulation with a cold cache → inaccurate IPC due to incomplete cache warming during detailed warmup
  • WIP: "Memory Timestamp Record"[2] based cache model and RTL cache state injection
AMAT Error vs # of Functional Warmup Instructions (from LiveSim[1])

[1]: Hassani, Sina, et al. "LiveSim: Going live with microarchitecture simulation." HPCA 2016.
[2]: Barr, Kenneth C., et al. "Accelerating multiprocessor simulation with a memory timestamp record." ISPASS 2005.

Dealing with Long-Lived μArch State

  • Caches aren't the only long-lived CPU structures
  • A general warmup methodology ingests a subset of a functional simulation log
  • Each unit needs a custom model, injection logic, and perf metric extraction

Conclusion

  • We want rapid RTL iteration with PPA evaluation
  • We need fast RTL-level simulation
  • We propose TidalSim, a multi-level simulation methodology to enable rapid HW iteration

TidalSim (github.com/euphoric-hardware/tidalsim) Forks of spike, chipyard, testchipip + top-level runner