Examples | Throughput | Latency | Accuracy | Cost | |
---|---|---|---|---|---|
JIT-based Simulators / VMs | qemu, KVM, VMWare Fusion | 1-3 GIPS | <1 second | None | Minimal |
Architectural Simulators | spike, dromajo | 10-100+ MIPS | <1 second | None | Minimal |
General-purpose μArch Simulators | gem5, Sniper, ZSim, SST | 100 KIPS (gem5) - 100 MIPS (Sniper) | <1 minute | 10-50% IPC error | Minimal |
Bespoke μArch Simulators | Industry performance models | ≈ 0.1-1 MIPS | <1 minute | Close | $1M+ |
RTL Simulators | Verilator, VCS, Xcelium | 1-10 KIPS | 2-10 minutes | Cycle-exact | Minimal |
FPGA-Based Emulators | Firesim | ≈ 10 MIPS | 2-6 hours | Cycle-exact | $10k+ |
ASIC-Based Emulators | Palladium, Veloce | ≈ 0.5-10 MIPS | <1 hour | Cycle-exact | $10M+ |
Trends aren't enough[2]. Note the sensitivity differences - gradients are critical!
uArch simulators are not accurate enough for microarchitectural evaluation.
We want a tool to evaluate (RTL-level) microarchitectural changes on real workloads at high fidelity
We want a tool to evaluate (RTL-level) microarchitectural changes on real workloads at high fidelity
Build a critical and novel tool to enable the "design-first" methodology and leverage our RTL designs
Instead of running the entire program in uArch simulation, run the entire program in functional simulation and only run samples in uArch simulation
The full workload is represented by a selection of sampling units.
Our proposal: Combine SimPoint-style representative sampling with SMARTS-style small intervals
The state from a sampling unit checkpoint is only architectural state. The microarchitectural state of the uArch simulator starts at the reset state!
wikisort benchmark from embench, $N = 10000$, $C = 18$, $n_{\text{detailed}} = 2000$
huffbench benchmark from embench, $N = 10000$, $C = 18$, $n_{\text{detailed}} = 2000$
Let's build a methodology for answering these questions
WIP: This methodology enables error analysis of sampling and warmup.
For a given workload interval and a interval length $N$ (e.g. $N = 10000$) and without functional warmup, we can compute this table. (each cell is IPC error wrt the full RTL simulation)
Detailed warmup instructions ($ n_{\text{warmup}} $) | |||||||
---|---|---|---|---|---|---|---|
0 | 100 | 500 | 1000 | 2000 | 5000 | ||
Detailed warmup offset ($ n_{\text{offset}} $) | 0 | Worst case | Offset error ↑ Warmup error ↓ |
Offset error 2↑ Warmup error 2↓ |
Offset error 3↑ Warmup error 3↓ |
Offset error 4↑ Warmup error 4↓ |
Maximum offset error |
-100 | Invalid | No offset error | '' | '' | '' | '' | |
-500 | No offset error | '' | '' | '' | |||
-1000 | No offset error | '' | '' | ||||
-2000 | No offset error | '' | |||||
-5000 | No offset error, best case |
Given the data in the table for every interval and for different interval lengths $N$, fit the following model: