Rapid RTL performance validation that can be used in the RTL design cycle is valuable.
Academics rarely write RTL partly due to the difficulty of evaluation, instead opting for uArch simulators.
Academia needs rapid RTL evaluation as a part of an RTL-first research methodology
We will propose a simulation methodology that can deliver on all axes (accuracy, throughput, startup latency, cost).
TidalSim is not a new simulator. It is a simulation methodology that combines the strengths of architectural simulators, uArch models, and RTL simulators.
TidalSim enables new design methodologies for industry, academia, and lean chip design teams.
Simulation techniques span the gamut on various axes. Each simulation technique assumes a particular hardware abstraction.
Examples | Throughput | Latency | Accuracy | Cost | |
---|---|---|---|---|---|
JIT-based Simulators / VMs | qemu, KVM, VMWare Fusion | 1-3 GIPS | <1 second | None | Minimal |
Architectural Simulators | spike, dromajo | 10-100+ MIPS | <1 second | None | Minimal |
General-purpose μArch Simulators | gem5, Sniper, ZSim, SST | 100 KIPS (gem5) - 100 MIPS (Sniper) | <1 minute | 10-50% IPC error | Minimal |
Bespoke μArch Simulators | Industry performance models | ≈ 0.1-1 MIPS | <1 minute | Close | $1M+ |
RTL Simulators | Verilator, VCS, Xcelium | 1-10 KIPS | 2-10 minutes | Cycle-exact | Minimal |
FPGA-Based Emulators | Firesim | ≈ 10 MIPS | 2-6 hours | Cycle-exact | $10k+ |
ASIC-Based Emulators | Palladium, Veloce | ≈ 0.5-10 MIPS | <1 hour | Cycle-exact | $10M+ |
Multi-level Sampled Simulation | TidalSim | 10+ MIPS | <1 minute | <1% IPC error | Minimal |
TidalSim combines the strengths of each technique to produce a meta-simulator that achieves high throughput, low latency, high accuracy, and low cost.
Trends aren't enough[2]. Note the sensitivity differences - gradients are critical!
uArch simulators are not accurate enough for microarchitectural evaluation.
[1]: Akram, A. and Sawalha, L., 2019. A survey of computer architecture simulation techniques and tools. IEEE Access
[2]: Nowatzki, T., Menon, J., Ho, C.H. and Sankaralingam, K., 2015. Architectural simulators considered harmful. Micro.
Instead of running the entire program in uArch simulation, run the entire program in functional simulation and only run samples in uArch simulation
The full workload is represented by a selection of sampling units.
The state from a sampling unit checkpoint is only architectural state. The microarchitectural state of the uArch simulator starts at the reset state!
Long-lived microarchitectural state (caches, branch predictors, prefetchers, TLBs) has a substantial impact on the performance of a sampling unit
[1]: Hassani, Sina, et al. "LiveSim: Going live with microarchitecture simulation." HPCA 2016.
[2]: Eeckhout, L., 2008. Sampled processor simulation: A survey. Advances in Computers. Elsevier.
This RTL-first evaluation flow is enabled by highly parameterized RTL generators and SoC design frameworks (e.g. Chipyard).
N=10000
, C=18
N=10000
, C=18
Typical IPC error (without functional warmup and with fine time-domain precision of 10k instructions) is < 5%
Demonstrate we can hit <1% IPC error
class RegFile(n: Int, w: Int, zero: Boolean = false) {
val rf = Mem(n, UInt(w.W))
(0 until n).map { archStateAnnotation(rf(n), Riscv.I.GPR(n)) }
// ...
}
class L1MetadataArray[T <: L1Metadata] extends L1HellaCacheModule()(p) {
// ...
val tag_array = SyncReadMem(nSets, Vec(nWays, UInt(metabits.W)))
(0 until nSets).zip((0 until nWays)).map { case (set, way) =>
uArchStateAnnotation(tag_array.read(set)(way), Uarch.L1.tag(set, way, cacheType=I))
}
}
TidalSim provides a way to extract many small, unique, RTL waveforms from large workloads with low latency
[1]: Iyer, Vighnesh, et. al., 2019. RTL bug localization through LTL specification mining. MEMOCODE.
Introduce a bug in the riscv-mini cache
- hit := v(idx_reg) && rmeta.tag === tag_reg
+ hit := v(idx_reg) && rmeta.tag =/= tag_reg
Template | $\textbf{a}$ | $\textbf{b}$ | Violated at Time | |
---|---|---|---|---|
Until | Tile.arb_io_dcache_r_ready |
Tile.dcache.hit |
418 | |
Until | Tile.dcache_io_nasti_r_valid |
Tile.dcache.hit |
418 | |
Until | Tile.dcache.is_alloc |
Tile.dcache.hit |
418 | |
Until | Tile.arb.io_dcache_ar_ready |
Tile.arb_io_nasti_r_ready |
640 |
The violated properties point to an anomaly with the hit
signal and localize the bug
TidalSim (github.com/euphoric-hardware/tidalsim) Forks of spike, chipyard, testchipip + top-level runner