ATHLETE Internal (March 2024)

TidalSim Overview and Applications

Vighnesh Iyer, Raghav Gupta, Dhruv Vaish, Charles Hong, Sophia Shao, Bora Nikolic

ATHLETE Internal Meeting

Monday, March 4rd, 2024

Talk Outline

  1. Overview of TidalSim
  2. Collateral produced by TidalSim and Applications

Overview of Tidalsim

TidalSim Overview

TidalSim: a fast, accurate, low latency, low cost microarchitectural simulation methodology that produces RTL-level collateral for performance estimation and verification on real workloads.

TidalSim Components

Overview of the components of TidalSim.

TidalSim is not a new simulator. It is a simulation methodology that combines the strengths of architectural simulators, uArch models, and RTL simulators.

Sampled Simulation

Instead of running the entire program in uArch simulation, run the entire program in functional simulation and only run samples in uArch simulation

The full workload is represented by a selection of sampling units.

Why RTL-Level Sampled Simulation?

  • Eliminate modeling errors
    • Remaining errors can be handled via statistical techniques
  • No need to correlate performance model and RTL
    • Let the RTL serve as the source of truth
  • Can produce RTL-level collateral
    • Leverage for applications in verification and power modeling

Overview of the TidalSim v0.1 Flow

IPC Trace Prediction: wikisort

  • Merge sort benchmark from Embench (wikisort)
  • N=10000, C=18
  • Full RTL sim takes 15 minutes, TidalSim runs in 10 seconds
  • Can capture general trends and time-domain IPC variation

CoreMark Smoke Test

  • NO functional warmup
  • 10k instruction intervals, 30 clusters, 2k detailed warmup
  • Larger working set means functional warmup is crucial

Overall Functional Warmup Flow

  • uarch-agnostic cache checkpoints as memory timestamp record (MTR) checkpoints
  • Convert MTR checkpoints into concrete cache state with specific cache parameters DRAM contents
  • RTL simulation harness injects cache state into L1d tag+data arrays via 2d reg forcing

Memory Timestamp Record

  • Construct MTR table from a memory trace, save MTR tables at checkpoint times
  • Given a cache with n sets, group block addresses by set index
  • Given a cache with k ways, pick the k most recently accessed addresses from each set
  • Knowing every resident cache line, fetch the data from the DRAM dump

wikisort with Functional Warmup

No functional warmup, there are significant IPC underpredictions
L1d functional warmup, errors are substantially reduced

L1d functional warmup brings IPC error from 7% to 2%

Tidalsim Collateral and Applications

Embedding Matrix

  • Each interval of a program is represented with an embedding
  • Basic block vectors indicate which code paths were traversed in each interval
  • uArch-aware embeddings
    • Instruction mix: loads, stores, control, arith, integer, fp
    • ILP: in varying window sizes (32, 64, ...)
    • Register traffic: avg input operands, number of times a register is consumed, register dependency chains
    • Working set: number of unique 32B/4K blocks touched in an interval
    • Data stream strides: measure of spatial locality in temporally adjacent memory accesses
    • Branch predictability: use an upper-limit branch prediction algorithm (Prediction by Partial Matching)
  • Generally useful for analyzing time-varying program behavior

Sampled Simulation Inputs/Outputs

  • Architectural and (partial) uArch states for every sampling interval that's simulated
  • Outputs from RTL simulation of each sampling interval

    checkpoints
      0x80000000.680000
        loadarch (all arch register state)
        mem.bin (all DRAM state)
        mtr (memory timestamp record - cache uArch agnostic)
        dcache_{data,tag}_array (reconstructed concrete L1 cache state)
        perf.csv (IPC, MPKI, cache miss performance metrics)
        dump.fsdb (full waveform dump)
      0x80000000.120000
        loadarch
        mem.bin
        mtr
        dcache_{data,tag}_array
        perf.csv
        dump.fsdb
    
  • Each of these files is quite small
  • Can be used in several ways

Applications

  • Power modeling
    • Per-interval fsdb's can be used for proxy signal selection and macromodel training
    • Embedding matrix with Joules power simulation can be used to reconstruct full workload power trace
    • Explore power/perf pareto curves of different core microarchitectures
  • Coverpoint synthesis
    • Per-interval fsdb's can be used for specification mining / coverpoint synthesis
    • Synthesized coverpoints can be used for evaluation of instruction generators or as fuzzing targets for bughunting
  • Event graph based uArch analysis
    • Once we can emit event graphs from RTL sim, we can also analyze those for anomalies and pinpointing differences between uArches
  • Hardware parameter DSE
    • Investigate the impact of changing a HW parameter on a large workload
    • Requires uArch-specific functional warmup + specialized injection harness for each design point

More Applications

  • Benchmark extraction
  • Application-level profiling
  • Combining the above two intelligently
    • e.g. JS runtimes are an event loop that blindly might seem to benefit from high IPC. But improving that IPC has no impact on end-to-end application behavior!