Fast RTL-level μArch simulation and performance trace estimation
enables
Rapid RTL iteration with performance evaluation on real workloads
How can we achieve high throughput, high fidelity, low latency μArch simulation?
Throughput | Latency | Fidelity | |
---|---|---|---|
ISA Simulation | 10-100+ MIPS | <1 second | None |
μArch Perf Sim | 100 KIPS (gem5) | 5-10 seconds | 5-10% avg IPC error |
RTL Simulation | 1-10 KIPS | 5-10 minutes | cycle-exact |
FireSim (FPGA) | 1-50 MIPS | 2-6 hours | cycle-exact |
TidalSim | 10 MIPS (unoptimized) | <1 minute | <5% error, 10k intervals |
What makes RTL-level sampled simulation interesting?
Multi-level simulation with RTL-level injection hasn't been done before. So we should try!
For each cluster, take the sample that is closest to its centroid
Capture arch checkpoints at the start each chosen sample
pc = 0x0000000080000332
priv = M
fcsr = 0x0000000000000000
mtvec = 0x8000000a00006000
...
x1 = 0x000000008000024a
x2 = 0x0000000080028fa0
...
An arch checkpoints = arch state + raw memory contents
force
/release
via custom test harnessfflags
in fcsr
are FP exception bits from FPU μArch stateFPRs
in Rocket are stored in recoded 65-bit format (not IEEE floats)
cycles,instret
1219,100
125,100
126,100
123,100
114,100
250,100
113,100
Performance metrics for one sample in a cluster are representative of all samples in that cluster
Extrapolate on the entire execution trace to get a full IPC trace
N=1000
, C=12
N=10000
, C=18
[1]: Hassani, Sina, et al. "LiveSim: Going live with microarchitecture simulation." HPCA 2016.
[2]: Barr, Kenneth C., et al. "Accelerating multiprocessor simulation with a memory timestamp record." ISPASS 2005.
TidalSim (github.com/euphoric-hardware/tidalsim) Forks of spike, chipyard, testchipip + top-level runner