End of Moore's Law
→ $/transistor not falling
→ Transistors are no longer free
→ Need aggressive PPA optimization
End of Dennard Scaling
→ Power density increasing
→ GPP performance stagnating
→ Need domain-specialization to not stagnate
Motivates two trends in SoC design
More evaluations = more opportunities for optimization
We want an "Evaluator" that has low latency, high throughput, high accuracy, low cost, and rich output collateral
Existing tools cannot deliver on all axes
We will propose a simulator that can deliver on all axes.
What if we had a simulator that:
Sampling Technique | Interval Length | # of Intervals Simulated | Interval Selection | Functional Warmup | Detailed Warmup | Time Granularity |
---|---|---|---|---|---|---|
SimPoint | 10-100M | 50-100 | BBFV + k-means | Optional | ≈0.1-1M | Interval length |
SMARTs | 10-100k | 10k | Reservoir sampling | Required | 1k | Entire workload |
TidalSim | 10k | 10-100 | BBFV + k-means | Required | 1k | Interval Length |
TidalSim leverages RTL simulation for performance estimation!
Multi-level simulation with RTL-level injection hasn't been done before. So we should try!
Basic blocks are extracted from the dynamic commit log emitted by spike
core 0: >>>> memchr
core 0: 0x00000000800012f6 (0x0ff5f593) andi a1, a1, 255
core 0: 0x00000000800012fa (0x0000962a) c.add a2, a0
core 0: 0x00000000800012fc (0x00c51463) bne a0, a2, pc + 8
core 0: 0x0000000080001304 (0x00054783) lbu a5, 0(a0)
core 0: 0x0000000080001308 (0xfeb78de3) beq a5, a1, pc - 6
0: 0x8000_12f6 ⮕ 0x8000_12fc
1: 0x8000_1304 ⮕ 0x8000_1308
Embed each interval with the frequency it traversed every identified basic block
Interval index | Interval length | Embedding |
---|---|---|
n | 100 | [40, 50, 0, 10] |
n+1 | 100 | [0, 50, 10, 40] |
n+2 | 100 | [0, 20, 20, 80] |
Intervals are clustered using k-means clustering on their embeddings
For each cluster, take the sample that is closest to its centroid
Capture arch checkpoints at the start each chosen sample
pc = 0x0000000080000332
priv = M
fcsr = 0x0000000000000000
mtvec = 0x8000000a00006000
...
x1 = 0x000000008000024a
x2 = 0x0000000080028fa0
...
An arch checkpoints = arch state + raw memory contents
fflags
in fcsr
are FP exception bits from FPU μArch stateFPRs
in Rocket are stored in recoded 65-bit format (not IEEE floats)
cycles,instret
1219,100
125,100
126,100
123,100
114,100
250,100
113,100
Performance metrics for one sample in a cluster are representative of all samples in that cluster
Extrapolate on the entire execution trace to get a full IPC trace
Running TidalSim on the Embench Wikisort benchmark (~2M dynamic instructions) and reconstructing an IPC trace.
N=1000
, C=12
N=10000
, C=18
Typical IPC error (without functional warmup) is < 5%
[1]: Hassani, Sina, et al. "LiveSim: Going live with microarchitecture simulation." HPCA 2016.
[2]: Barr, Kenneth C., et al. "Accelerating multiprocessor simulation with a memory timestamp record." ISPASS 2005.
TidalSim (github.com/euphoric-hardware/tidalsim) Forks of spike, chipyard, testchipip + top-level runner