TidalSim Overview and Applications

Vighnesh Iyer, Raghav Gupta, Dhruv Vaish, Charles Hong, Sophia Shao, Bora Nikolic

ATHLETE Internal Meeting

Monday, March 4rd, 2024

Talk Outline

Overview of TidalSim
Collateral produced by TidalSim and Applications

Overview of Tidalsim

TidalSim Overview

TidalSim: a fast, accurate, low latency, low cost microarchitectural simulation methodology that produces RTL-level collateral for performance estimation and verification on real workloads.

TidalSim Components

Overview of the components of TidalSim.

TidalSim is not a new simulator. It is a simulation methodology that combines the strengths of architectural simulators, uArch models, and RTL simulators.

Sampled Simulation

Instead of running the entire program in uArch simulation, run the entire program in functional simulation and only run samples in uArch simulation

The full workload is represented by a selection of sampling units.

Why RTL-Level Sampled Simulation?

Eliminate modeling errors
- Remaining errors can be handled via statistical techniques
No need to correlate performance model and RTL
- Let the RTL serve as the source of truth
Can produce RTL-level collateral
- Leverage for applications in verification and power modeling

Overview of the TidalSim v0.1 Flow

IPC Trace Prediction: wikisort

Merge sort benchmark from Embench (wikisort)
N=10000, C=18
Full RTL sim takes 15 minutes, TidalSim runs in 10 seconds
Can capture general trends and time-domain IPC variation

CoreMark Smoke Test

NO functional warmup
10k instruction intervals, 30 clusters, 2k detailed warmup
Larger working set means functional warmup is crucial

Overall Functional Warmup Flow

uarch-agnostic cache checkpoints as memory timestamp record (MTR) checkpoints
Convert MTR checkpoints into concrete cache state with specific cache parameters DRAM contents
RTL simulation harness injects cache state into L1d tag+data arrays via 2d reg forcing

Memory Timestamp Record

Construct MTR table from a memory trace, save MTR tables at checkpoint times
Given a cache with n sets, group block addresses by set index
Given a cache with k ways, pick the k most recently accessed addresses from each set
Knowing every resident cache line, fetch the data from the DRAM dump

wikisort with Functional Warmup

No functional warmup, there are significant IPC underpredictions

L1d functional warmup, errors are substantially reduced

L1d functional warmup brings IPC error from 7% to 2%

Tidalsim Collateral and Applications

Embedding Matrix

Each interval of a program is represented with an embedding
Basic block vectors indicate which code paths were traversed in each interval
uArch-aware embeddings
- Instruction mix: loads, stores, control, arith, integer, fp
- ILP: in varying window sizes (32, 64, ...)
- Register traffic: avg input operands, number of times a register is consumed, register dependency chains
- Working set: number of unique 32B/4K blocks touched in an interval
- Data stream strides: measure of spatial locality in temporally adjacent memory accesses
- Branch predictability: use an upper-limit branch prediction algorithm (Prediction by Partial Matching)
Generally useful for analyzing time-varying program behavior

Sampled Simulation Inputs/Outputs

Architectural and (partial) uArch states for every sampling interval that's simulated
Outputs from RTL simulation of each sampling interval


    checkpoints
      0x80000000.680000
        loadarch (all arch register state)
        mem.bin (all DRAM state)
        mtr (memory timestamp record - cache uArch agnostic)
        dcache_{data,tag}_array (reconstructed concrete L1 cache state)
        perf.csv (IPC, MPKI, cache miss performance metrics)
        dump.fsdb (full waveform dump)
      0x80000000.120000
        loadarch
        mem.bin
        mtr
        dcache_{data,tag}_array
        perf.csv
        dump.fsdb

Each of these files is quite small
Can be used in several ways

Applications

Power modeling
- Per-interval fsdb's can be used for proxy signal selection and macromodel training
- Embedding matrix with Joules power simulation can be used to reconstruct full workload power trace
- Explore power/perf pareto curves of different core microarchitectures
Coverpoint synthesis
- Per-interval fsdb's can be used for specification mining / coverpoint synthesis
- Synthesized coverpoints can be used for evaluation of instruction generators or as fuzzing targets for bughunting
Event graph based uArch analysis
- Once we can emit event graphs from RTL sim, we can also analyze those for anomalies and pinpointing differences between uArches
Hardware parameter DSE
- Investigate the impact of changing a HW parameter on a large workload
- Requires uArch-specific functional warmup + specialized injection harness for each design point

More Applications

Benchmark extraction
Application-level profiling
Combining the above two intelligently
- e.g. JS runtimes are an event loop that blindly might seem to benefit from high IPC. But improving that IPC has no impact on end-to-end application behavior!