Fast RTL-level μArch simulation and performance metric / interesting trace extraction
enables
Rapid RTL iteration with performance, power modeling, and verification evaluation on real workloads
How can we achieve high throughput, high fidelity, low latency μArch simulation with RTL-level interesting trace extraction?
Throughput | Latency | Fidelity | |
---|---|---|---|
ISA Simulation | 10-100+ MIPS | <1 second | None |
μArch Perf Sim | 100 KIPS (gem5) | 5-10 seconds | 5-10% avg IPC error |
RTL Simulation | 1-10 KIPS | 5-10 minutes | cycle-exact |
FireSim (FPGA) | 1-50 MIPS | 2-6 hours | cycle-exact |
TidalSim | 10 MIPS | <1 minute | <5% error, 10k intervals |
Basic blocks are extracted from the dynamic commit log emitted by spike
core 0: >>>> memchr
core 0: 0x00000000800012f6 (0x0ff5f593) andi a1, a1, 255
core 0: 0x00000000800012fa (0x0000962a) c.add a2, a0
core 0: 0x00000000800012fc (0x00c51463) bne a0, a2, pc + 8
core 0: 0x0000000080001304 (0x00054783) lbu a5, 0(a0)
core 0: 0x0000000080001308 (0xfeb78de3) beq a5, a1, pc - 6
0: 0x8000_12f6 ⮕ 0x8000_12fc
1: 0x8000_1304 ⮕ 0x8000_1308
A execution trace is captured from ISA-level simulation
core 0: >>>> memchr
core 0: 0x00000000800012f6 (0x0ff5f593) andi a1, a1, 255
core 0: 0x00000000800012fa (0x0000962a) c.add a2, a0
core 0: 0x00000000800012fc (0x00c51463) bne a0, a2, pc + 8
core 0: 0x0000000080001304 (0x00054783) lbu a5, 0(a0)
core 0: 0x0000000080001308 (0xfeb78de3) beq a5, a1, pc - 6
core 0: 0x000000008000130c (0x00000505) c.addi a0, 1
core 0: 0x000000008000130e (0x0000b7fd) c.j pc - 18
core 0: 0x00000000800012fc (0x00c51463) bne a0, a2, pc + 8
The trace is grouped into intervals of N instructions
Typical N for SimPoint samples is 1M
Typical N for SMARTs samples is 10-100k
Embed each interval with the frequency it traversed every identified basic block
Interval index | Interval length | Embedding |
---|---|---|
n | 100 | [40, 50, 0, 10] |
n+1 | 100 | [0, 50, 10, 40] |
n+2 | 100 | [0, 20, 20, 80] |
Intervals are clustered using k-means clustering on their embeddings
For each cluster, take the sample that is closest to its centroid
Capture arch checkpoints at the start each chosen sample
pc = 0x0000000080000332
priv = M
fcsr = 0x0000000000000000
mtvec = 0x8000000a00006000
...
x1 = 0x000000008000024a
x2 = 0x0000000080028fa0
...
An arch checkpoints = arch state + raw memory contents
fflags
in fcsr
are FP exception bits from FPU μArch stateFPRs
in Rocket are stored in recoded 65-bit format (not IEEE floats)
cycles,instret
1219,100
125,100
126,100
123,100
114,100
250,100
113,100
Performance metrics for one sample in a cluster are representative of all samples in that cluster
Extrapolate on the entire execution trace to get a full IPC trace
N=1000
, C=12
[1]: Hassani, Sina, et al. "LiveSim: Going live with microarchitecture simulation." HPCA 2016.
[2]: Barr, Kenneth C., et al. "Accelerating multiprocessor simulation with a memory timestamp record." ISPASS 2005.
Can we synthesize metrics that lead to reasonable HW fuzzer evaluations?