Discussion with Intel (March 2024)

TidalSim: Fast and Accurate Microarchitectural Simulation via Sampled RTL Simulation

Vighnesh Iyer, Raghav Gupta, Dhruv Vaish, Charles Hong, Sophia Shao, Bora Nikolic

Overview for Intel

Tuesday, March 5th, 2024

Talk Outline

  1. Motivation
  2. What is TidalSim?
  3. Background on Sampled Simulation
  4. Our Prototype Implementation
  5. Functional Warmup for Sampled RTL Simulation
  6. Next steps
    • Taking multiple samples per cluster
    • Binary-agnostic interval embeddings
    • Streaming sampled simulation
    • FIRRTL pass for injection testharness generation

Motivation

Microarchitecture Design Challenges

  • We want optimal designs for heterogeneous, domain-specialized, workload-tuned SoCs
  • Limited time to iterate on microarchitecture and optimize PPA on real workloads
  • Time per evaluation (microarchitectural iteration loop) limits number of evaluations

More evaluations = more opportunities for optimization

The Microarchitectural Iteration Loop

We want an "Evaluator" that has low latency, high throughput, high accuracy, low cost, and rich output collateral

Existing tools cannot deliver on all axes

Existing Hardware Simulation Techniques

Examples Throughput Latency Accuracy Cost
JIT-based Simulators / VMs qemu, KVM, VMWare Fusion 1-3 GIPS <1 second None Minimal
Architectural Simulators spike, dromajo 10-100+ MIPS <1 second None Minimal
General-purpose μArch Simulators gem5, Sniper, ZSim, SST 100 KIPS (gem5) - 100 MIPS (Sniper) <1 minute 10-50% IPC error Minimal
Bespoke μArch Simulators Industry performance models ≈ 0.1-1 MIPS <1 minute Close $1M+
RTL Simulators Verilator, VCS, Xcelium 1-10 KIPS 2-10 minutes Cycle-exact Minimal
FPGA-Based Emulators Firesim ≈ 10 MIPS 2-6 hours Cycle-exact $10k+
ASIC-Based Emulators Palladium, Veloce ≈ 0.5-10 MIPS <1 hour Cycle-exact $10M+
Multi-level Sampled Simulation TidalSim 10+ MIPS <1 minute <1% IPC error Minimal

Our Proposal

What if we had a simulator that:

  1. Is fast enough to run real workloads on heterogeneous SoCs
  2. Is accurate enough to use confidently for microarchitectural DSE
  3. Has low latency to not bottleneck the hardware iteration loop
  4. Can produce RTL-level collateral for applications from power modeling to application-level profiling to verification
    • Can automatically isolate and extract benchmarks from long workloads by identifying unique aspects with respect to power, performance, and functionality

TidalSim Overview

TidalSim: a fast, accurate, low latency, low cost microarchitectural simulation methodology that produces RTL-level collateral for performance estimation and verification on real workloads.

TidalSim Components

Overview of the components of TidalSim.

TidalSim is not a new simulator. It is a simulation methodology that combines the strengths of architectural simulators, uArch models, and RTL simulators.

TidalSim Execution

TidalSim moves simulation execution back and forth between architectural, uArch, and RTL simulators based on dynamic workload analysis.

Background on Sampled Simulation

Sampled Simulation

Instead of running the entire program in uArch simulation, run the entire program in functional simulation and only run samples in uArch simulation

The full workload is represented by a selection of sampling units.

  1. How should sampling units be selected?
  2. How can we accurately estimate the performance of a sampling unit?
  3. How can we estimate errors when extrapolating from sampling units?

Existing Sampling Techniques

SimPoint

  • Program execution traces aren’t random
    • They execute the same code again-and-again
    • Workload execution traces can be split into phases that exhibit similar μArch behavior
  • SimPoint-style representative sampling
    • Compute an embedding for each program interval (e.g. blocks of 100M instructions)
    • Cluster interval embeddings using k-means
    • Choose representative intervals from each cluster as sampling units

SMARTS

  • Rigorous statistical sampling enables computation of confidence bounds
    • Use random sampling on a full execution trace to derive a population sample
    • Central limit theorem provides confidence bounds
  • SMARTS-style random sampling
    • Pick a large number of samples to take before program execution
    • If the sample variance is too high after simulation, then collect more sampling units
    • Use CLT to derive a confidence bound for the aggregate performance metric

Functional Warmup

The state from a sampling unit checkpoint is only architectural state. The microarchitectural state of the uArch simulator starts at the reset state!

  • We need to seed long-lived uArch state at the beginning of each sampling unit
  • This process is called functional warmup

Importance of Functional Warmup

Long-lived microarchitectural state (caches, branch predictors, prefetchers, TLBs) has a substantial impact on the performance of a sampling unit

AMAT Error vs # of detailed warmup instructions [1]
MPKI vs warmup vs sampling unit length for different branch predictors[2]

[1]: Hassani, Sina, et al. "LiveSim: Going live with microarchitecture simulation." HPCA 2016.
[2]: Eeckhout, L., 2008. Sampled processor simulation: A survey. Advances in Computers. Elsevier.

Why RTL-Level Sampled Simulation?

  • Eliminate modeling errors
    • Remaining errors can be handled via statistical techniques
  • No need to correlate performance model and RTL
    • Let the RTL serve as the source of truth
  • Can produce RTL-level collateral
    • Leverage for applications in verification and power modeling

This RTL-first evaluation flow is enabled by highly parameterized RTL generators and SoC design frameworks (e.g. Chipyard).

Implementation of TidalSim v0.1

Sampled RTL simulation with no warmup

Overview of the TidalSim v0.1 Flow

Implementation Details For TidalSim v0.1

  • Basic block identification
    • BB identification from spike commit log or from static ELF analysis
  • Basic block embedding of intervals
  • Clustering and checkpointing
    • k-means, PCA-based n-clusters
    • spike-based checkpoints
  • RTL simulation and performance metric extraction
    • Custom force-based RTL state injection, out-of-band IPC measurement
  • Extrapolation
    • Estimate IPC of each interval based on its embedding and distances to RTL-simulated intervals

IPC Trace Prediction: huffbench

  • Huffman compression from Embench (huffbench)
  • N=10000, C=18
  • Full RTL sim takes 15 minutes, TidalSim runs in 10 seconds
  • Large IPC variance

IPC Trace Prediction: wikisort

  • Merge sort benchmark from Embench (wikisort)
  • N=10000, C=18
  • Full RTL sim takes 15 minutes, TidalSim runs in 10 seconds
  • Can capture general trends and time-domain IPC variation

Aggregate IPC Prediction for Embench Suite

Typical IPC error (without functional warmup and with fine time-domain precision of 10k instructions) is < 5%

CoreMark Smoke Test

  • NO functional warmup
  • 10k instruction intervals, 30 clusters, 2k detailed warmup
  • Larger working set means functional warmup is crucial

Functional Warmup of L1d Cache + Early Results

Overall Functional Warmup Flow

  • uarch-agnostic cache checkpoints as memory timestamp record (MTR) checkpoints
  • Convert MTR checkpoints into concrete cache state with specific cache parameters DRAM contents
  • RTL simulation harness injects cache state into L1d tag+data arrays via 2d reg forcing

Memory Timestamp Record

  • Construct MTR table from a memory trace, save MTR tables at checkpoint times
  • Given a cache with n sets, group block addresses by set index
  • Given a cache with k ways, pick the k most recently accessed addresses from each set
  • Knowing every resident cache line, fetch the data from the DRAM dump

Results on wikisort

No functional warmup, there are significant IPC underpredictions
L1d functional warmup, errors are substantially reduced

L1d functional warmup brings IPC error from 7% to 2%

Caveats

  • Currently all cache lines are marked as dirty, even if read-only
  • There seems to be a lingering bug causing unaligned DRAM accesses from L1, still under investigation
  • Injection harness is hardcoded for a specific Rocket L1 cache configuration
  • Cannot perform L2 injection or handle multiple cores

Next Steps

Multiple Samples per Cluster

  • Currently we take the one interval closest to each cluster centroid for RTL simulation
  • Problems
    • We have no idea how representative the chosen interval is of the cluster
    • We have few data points when performing extrapolation
    • No simulations are performed for points that are in-between two or more clusters
  • Pick both the cluster centroid point and random points within each cluster for simulation

Binary-agnostic Interval Embeddings

  • BBVs are tied to a binary
    • Can't share embeddings and perf data across simulations
    • Embeddings are based on PCs, which is complicated with virtual memory
    • There may be different execution patterns of the same code
  • We will perform a detailed study of alternative embeddings[1]
    • Instruction mix: loads, stores, control, arith, integer, fp
    • ILP: in varying window sizes (32, 64, ...)
    • Register traffic: avg input operands, number of times a register is consumed, register dependency chains
    • Working set: number of unique 32B/4K blocks touched in an interval
    • Data stream strides: measure of spatial locality in temporally adjacent memory accesses
    • Branch predictability: use an upper-limit branch prediction algorithm (Prediction by Partial Matching)
  • Need to balance embedding complexity and accuracy
    • Some 'neural' approaches are quite expensive (see: NPS)

[1]: Eeckout, Lieven, et. al. - Exploiting Program Microarchitecture Independent Characteristics and Phase Behavior for Reduced Benchmark Suite Simulation (IISWC 2005)

Streaming Sampled simulation

  • Currently, we need multiple passes of spike (2 full runs, 2 log traversals)
    • Collect commit log to extract basic blocks
    • Re-traverse commit log to build embedding matrix
    • After clustering, collect arch checkpoints
  • For longer programs, this is expensive and wasteful
  • Integrate embedding, clustering, and checkpointing in one pass
    • Fixed feature vector size
    • Streaming clustering algorithm
    • Ability to take checkpoints programatically during spike execution
    • Multicore pipelining to mitigate throughput bottlenecks

Leveraging HDLs for TidalSim Methodology

  • HW DSE with TidalSim requires an RTL injection harness
  • Automatic harness generation using high-level HDLs
    • Chisel API to semantically mark arch and uArch state
    • FIRRTL pass to generate a state-injecting test harness

  class RegFile(n: Int, w: Int, zero: Boolean = false) {
    val rf = Mem(n, UInt(w.W))
    (0 until n).map { archStateAnnotation(rf(n), Riscv.I.GPR(n)) }
    // ...
  }
    

  class L1MetadataArray[T <: L1Metadata] extends L1HellaCacheModule()(p) {
    // ...
    val tag_array = SyncReadMem(nSets, Vec(nWays, UInt(metabits.W)))
    (0 until nSets).zip((0 until nWays)).map { case (set, way) =>
      uArchStateAnnotation(tag_array.read(set)(way), Uarch.L1.tag(set, way, cacheType=I))
    }
  }
    

Conclusion

  • We want to enable rapid RTL iteration with performance evaluation and generation of RTL-level collateral
  • We need fast, low-startup-latency RTL-level simulation
  • We propose TidalSim, a simulation methodology based on sampled RTL simulation
  • Everything is open source