ATHLETE Quarterly Review (Dec 2023)

TidalSim: Multi-Level Microarchitecture Simulation and Applications in Verification

Vighnesh Iyer, Raghav Gupta, Dhruv Vaish, Bora Nikolic, Sophia Shao

ATHLETE Quarterly Review

Monday, December 4th, 2023

Motivation and Background

Motivation

  • We want fast design iteration and evaluation of PPA + verification given real workloads
    • Performance estimation: Impact of μArch optimizations / HW parameters on real workloads
    • Power macromodeling: Identification of important netlist nodes in power model + traces for training
    • Verification: Bootstrapping fuzzing loops + coverpoint synthesis
  • The enablers are: fast and accurate μArch simulation and a way to identify unique execution fragments
    • Performance estimation: Performance metric extraction from fast RTL simulation
    • Power macromodeling: Extraction of interesting program traces for clustering/training
    • Verification: Extraction of traces for coverpoint/specification synthesis + state seeding for fuzzing

Problem Overview

Fast RTL-level μArch simulation and performance metric / interesting trace extraction

enables

Rapid RTL iteration with performance, power modeling, and verification evaluation on real workloads


How can we achieve high throughput, high fidelity, low latency μArch simulation with RTL-level interesting trace extraction?

Existing μArch Evaluation Strategies

Throughput Latency Fidelity
ISA Simulation 10-100+ MIPS <1 second None
μArch Perf Sim 100 KIPS (gem5) 5-10 seconds 5-10% avg IPC error
RTL Simulation 1-10 KIPS 5-10 minutes cycle-exact
FireSim (FPGA) 1-50 MIPS 2-6 hours cycle-exact
TidalSim 10 MIPS <1 minute <5% error, 10k intervals
  • Combine the strengths of ISA, μArch, and RTL simulators
    • Multi-level simulation

Phase Behavior of Programs

  • Program execution traces aren’t random
    • They execute the same code again-and-again
    • Application execution traces can be split into phases that exhibit similar μArch behavior
  • Prior work: SimPoint
    • Identify basic blocks executed in a given interval (e.g. 1M instruction intervals)
    • Embed each interval using their ‘basic block vector’
    • Cluster intervals using k-means
  • Similar intervals → similar μArch behaviors
    • Only execute unique intervals in low-level RTL simulation!

Prior Work

  • Sampled simulation techniques have been used in μArch simulators for decades
    • SimPoint-style sampling (interval clustering, large intervals)
    • SMARTs-style sampling (reservoir sampling, small intervals)
    • Implemented in gem5, Sniper, ZSim, SST
  • LiveSim proposed 2-level simulation (ISA → μArch sim) for rapid iteration of μArch parameters
    • Functional warmup was used for the cache and branch predictor models

What's New

  • No prior work on ISA ↔ μArch models ↔ RTL multi-level simulation with functional warmup
  • No substantial work on binary-agnostic interval embeddings
  • No one has leveraged the special collateral (waveforms + high-fidelity performance metrics) you can only get from RTL simulation

Overview of Multi-Level Simulation Flow

Details of TidalSim Flow

Basic Block Identification

Basic blocks are extracted from the dynamic commit log emitted by spike


core   0: >>>>  memchr
core   0: 0x00000000800012f6 (0x0ff5f593) andi    a1, a1, 255
core   0: 0x00000000800012fa (0x0000962a) c.add   a2, a0
core   0: 0x00000000800012fc (0x00c51463) bne     a0, a2, pc + 8
core   0: 0x0000000080001304 (0x00054783) lbu     a5, 0(a0)
core   0: 0x0000000080001308 (0xfeb78de3) beq     a5, a1, pc - 6
    
  • Control flow instructions mark the end of a basic block
  • Previously identified basic blocks can be split if a new entry point is found
  • 0: 0x8000_12f6 ⮕ 0x8000_12fc
  • 1: 0x8000_1304 ⮕ 0x8000_1308

Program Intervals

A execution trace is captured from ISA-level simulation


core   0: >>>>  memchr
core   0: 0x00000000800012f6 (0x0ff5f593) andi    a1, a1, 255
core   0: 0x00000000800012fa (0x0000962a) c.add   a2, a0
core   0: 0x00000000800012fc (0x00c51463) bne     a0, a2, pc + 8
core   0: 0x0000000080001304 (0x00054783) lbu     a5, 0(a0)
core   0: 0x0000000080001308 (0xfeb78de3) beq     a5, a1, pc - 6
core   0: 0x000000008000130c (0x00000505) c.addi  a0, 1
core   0: 0x000000008000130e (0x0000b7fd) c.j     pc - 18
core   0: 0x00000000800012fc (0x00c51463) bne     a0, a2, pc + 8
    

The trace is grouped into intervals of N instructions

Typical N for SimPoint samples is 1M

Typical N for SMARTs samples is 10-100k

Interval Embedding and Clustering

Embed each interval with the frequency it traversed every identified basic block

Interval index Interval length Embedding
n 100 [40, 50, 0, 10]
n+1 100 [0, 50, 10, 40]
n+2 100 [0, 20, 20, 80]

Intervals are clustered using k-means clustering on their embeddings

Arch Snapshotting

For each cluster, take the sample that is closest to its centroid

Capture arch checkpoints at the start each chosen sample


pc = 0x0000000080000332
priv = M
fcsr = 0x0000000000000000
mtvec = 0x8000000a00006000
...
x1 = 0x000000008000024a
x2 = 0x0000000080028fa0
...
    

An arch checkpoints = arch state + raw memory contents

RTL Simulation and Arch-State Injection

  • Arch checkpoints are run in parallel in RTL simulation for N instructions
  • RTL state injection caveats
    • Not all arch state maps 1:1 with an RTL-level register
    • e.g. fflags in fcsr are FP exception bits from FPU μArch state
    • e.g. FPRs in Rocket are stored in recoded 65-bit format (not IEEE floats)
  • Performance metrics extracted from RTL simulation

cycles,instret
1219,100
125,100
126,100
123,100
114,100
250,100
113,100
    

Extrapolation

Performance metrics for one sample in a cluster are representative of all samples in that cluster

Extrapolate on the entire execution trace to get a full IPC trace

Early Results

Clustering on Embench Benchmarks

  • Cluster centroids indicate which basic blocks are traversed most frequently in each cluster
  • We observe that most clusters capture unique traversal patterns

IPC Trace Prediction

  • Montgomery multiplication from Embench (aha-mont64)
  • N=1000, C=12
  • Full RTL sim takes 10 minutes, TidalSim runs in 10 seconds
  • IPC is correlated (mean error <5%) and mild correlation between distance and error

Work in Progress

Functional Cache Warmup

  • Each checkpoint is run in RTL simulation with a cold cache → inaccurate IPC due to incomplete cache warming during detailed warmup
  • WIP: "Memory Timestamp Record"[2] based cache model and RTL cache state injection
AMAT Error vs # of Functional Warmup Instructions (from LiveSim[1])

[1]: Hassani, Sina, et al. "LiveSim: Going live with microarchitecture simulation." HPCA 2016.
[2]: Barr, Kenneth C., et al. "Accelerating multiprocessor simulation with a memory timestamp record." ISPASS 2005.

Performance Optimizations

  • Currently two runs of the binary through spike are needed
    • One to get a commit log for basic block extraction, embedding, and clustering
    • One more to dump arch checkpoints for chosen samples
  • We can take regular checkpoints during the first execution to make arch checkpoint collection fast

Validation of State Injection

  • There is no arch state comparison at the end of a checkpoint run in RTL simulation
  • We will standardize a arch state schema and dump a reference state from spike to check against

Handling Large Interval Lengths

  • Real programs will use large intervals (1-10M instructions)
  • Selected intervals can't be run in their entirety in RTL simulation
  • Sub-sampling intervals with random sampling is required

Applications

Performance and Power Evaluation

  • Fast, low-latency evaluation of HW parameters on long running workloads
    • Cache sizing between L1d vs L1i
    • Balancing of 2-level cache hierarchies
    • Unified vs separate i/d L2 caches
  • Trace extraction for power model construction
    • Currently power macromodels are built + trained only on workloads that can run in RTL simulation
    • TidalSim enables extraction of unique, short traces from full workloads
    • Potential to improve signal selection and uncover holes in training datasets

Issues with HW Fuzzer Evaluations

  • Last time: discussed deficiencies in existing HW fuzzing evaluations due to bad success/feedback metrics
    • Structural coverage is too easy to hit
    • Time to rediscover old bugs is too biased and forces us to use old RTL
  • Bad metrics ⮕ dubious conclusions
    • We should save >50% of mutated stimuli (vs <1% for SW fuzzers)
    • RTL-level feedback is useless for hitting bugs or improving coverage (vs SW fuzzers making no progress without feedback)

Can we synthesize metrics that lead to reasonable HW fuzzer evaluations?

Coverpoint Synthesis

  • Specification mining takes waveforms of an RTL design and synthesizes properties involving 2+ signals that are unfalsified on all traces
  • Coverpoint synthesis is an alternative take on spec mining where we synthesize μArch properties that we want to see more of
  • This technique is far more effective if we have many unique, realistic traces
    • Leverage interval clustering and sampled RTL simulation

Bootstrapping Fuzzing

  • Most HW fuzzers start from reset and run a binary on the SoC to hit some objective
  • Interesting objectives are harder to hit from reset vs from the middle of a workload
    • e.g. post-OS boot, in the middle of an application
  • Arch and μArch checkpoints from TidalSim guarantee reachability and provides starting points for HW fuzzers

Conclusion

  • We want rapid iteration wrt PPA evaluation and verification objectives
  • We need fast RTL-level simulation with trace extraction
  • We propose TidalSim, a multi-level simulation methodology to enable rapid HW iteration

TidalSim (github.com/euphoric-hardware/tidalsim)