More evaluations = more opportunities for optimization
We want an "Evaluator" that has low latency, high throughput, high accuracy, low cost, and rich output collateral
Existing tools cannot deliver on all axes
Examples | Throughput | Latency | Accuracy | Cost | |
---|---|---|---|---|---|
JIT-based Simulators / VMs | qemu, KVM, VMWare Fusion | 1-3 GIPS | <1 second | None | Minimal |
Architectural Simulators | spike, dromajo | 10-100+ MIPS | <1 second | None | Minimal |
General-purpose μArch Simulators | gem5, Sniper, ZSim, SST | 100 KIPS (gem5) - 100 MIPS (Sniper) | <1 minute | 10-50% IPC error | Minimal |
Bespoke μArch Simulators | Industry performance models | ≈ 0.1-1 MIPS | <1 minute | Close | $1M+ |
RTL Simulators | Verilator, VCS, Xcelium | 1-10 KIPS | 2-10 minutes | Cycle-exact | Minimal |
FPGA-Based Emulators | Firesim | ≈ 10 MIPS | 2-6 hours | Cycle-exact | $10k+ |
ASIC-Based Emulators | Palladium, Veloce | ≈ 0.5-10 MIPS | <1 hour | Cycle-exact | $10M+ |
Multi-level Sampled Simulation | TidalSim | 10+ MIPS | <1 minute | <1% IPC error | Minimal |
What if we had a simulator that:
TidalSim is not a new simulator. It is a simulation methodology that combines the strengths of architectural simulators, uArch models, and RTL simulators.
Instead of running the entire program in uArch simulation, run the entire program in functional simulation and only run samples in uArch simulation
The full workload is represented by a selection of sampling units.
The state from a sampling unit checkpoint is only architectural state. The microarchitectural state of the uArch simulator starts at the reset state!
Long-lived microarchitectural state (caches, branch predictors, prefetchers, TLBs) has a substantial impact on the performance of a sampling unit
[1]: Hassani, Sina, et al. "LiveSim: Going live with microarchitecture simulation." HPCA 2016.
[2]: Eeckhout, L., 2008. Sampled processor simulation: A survey. Advances in Computers. Elsevier.
This RTL-first evaluation flow is enabled by highly parameterized RTL generators and SoC design frameworks (e.g. Chipyard).
N=10000
, C=18
N=10000
, C=18
Typical IPC error (without functional warmup and with fine time-domain precision of 10k instructions) is < 5%
L1d functional warmup brings IPC error from 7% to 2%
[1]: Eeckout, Lieven, et. al. - Exploiting Program Microarchitecture Independent Characteristics and Phase Behavior for Reduced Benchmark Suite Simulation (IISWC 2005)
class RegFile(n: Int, w: Int, zero: Boolean = false) {
val rf = Mem(n, UInt(w.W))
(0 until n).map { archStateAnnotation(rf(n), Riscv.I.GPR(n)) }
// ...
}
class L1MetadataArray[T <: L1Metadata] extends L1HellaCacheModule()(p) {
// ...
val tag_array = SyncReadMem(nSets, Vec(nWays, UInt(metabits.W)))
(0 until nSets).zip((0 until nWays)).map { case (set, way) =>
uArchStateAnnotation(tag_array.read(set)(way), Uarch.L1.tag(set, way, cacheType=I))
}
}