Heterogeneous cores targeting different power/perf curves + workloads
Domain-specific accelerators
Need a pre-silicon evaluation strategy for rapid, PPA optimal design of these units
Limited time per design cycle → limited time per evaluation
More evaluations = more opportunities for optimization
The Microarchitectural Iteration Loop (Industry)
An idealized iteration loop for microarchitectural design. The 'Evaluator' starts off as a performance simulator and transitions to RTL as the design is iterated.
During RTL implementation we need performance validation against the model.Existing techniques for RTL performance validation
Industry needs a methodology for rapid RTL performance validation that can be used in the RTL design cycle.
The Microarchitectural Iteration Loop (Academia)
The typical manner in which microarchitectural ideas are evaluated in academia. Academics rarely write RTL due to the difficulty of evaluation, instead opting for uArch simulators.
Academia needs a methodology for rapid RTL evaluation as a part of an RTL-first evaluation strategy
The Microarchitectural Iteration Loop (Startup)
Startups have limited human resources to allocate to performance modeling, instead choosing to focus on architecture and RTL design.
Cost-constrained, lean, teams can't depend on expensive emulation/prototyping platforms.
Chip startups need a methodology for low-cost and rapid RTL evaluation to enable high productivity and confidence
Limitations of Existing Evaluators
ISA simulation: no accuracy
Trace/Cycle uArch simulation: low accuracy
RTL simulation: low throughput
FPGA prototyping: high startup latency
HW emulators: high cost
We will propose a simulation methodology that can deliver on all axes (accuracy, throughput, startup latency, cost) and is useful for industry, academics, and startups.
Backup Slides
The End of "Free" Technology Scaling
Moore's Law: transistor counts double while cost/transistor halves every 2 years
Per-wafer and per-transistor costs continue to grow with process scaling[1, 2] unless heavily amortized
We want optimal designs for heterogeneous, domain-specialized, workload-tuned SoCs
Limited time to iterate on microarchitecture and optimize PPA on real workloads
Time per evaluation (microarchitectural iteration loop) limits number of evaluations
More evaluations = more opportunities for optimization
2. What: Our Vision for TidalSim
What if we had a magic box that:
Is fast enough to run real workloads
Is accurate enough to use confidently for microarchitectural DSE
Has low latency to not bottleneck the hardware iteration loop
Can produce RTL-level collateral for performance or functional verification
Can run real workloads by identifying unique aspects of the program automatically
What: TidalSim
TidalSim: a fast, accurate, low latency, low cost microarchitectural simulation methodology that produces RTL-level collateral for performance estimation and verification on real workloads.
TidalSim Components
Overview of the components of TidalSim.
TidalSim is not a simulator. It is a simulation methodology that combines the strengths of architectural simulators, uArch models, and RTL simulators.
TidalSim Execution
TidalSim moves simulation execution back and forth between architectural, uArch, and RTL simulators based on dynamic workload analysis.
What Does TidalSim Enable?
Industry
RTL performance validation is too costly.
Industry
Rapid RTL performance validation becomes viable.
Academia
Academics resort to inaccurate uArch simulators.
Academia
RTL-first evaluation strategy becomes viable.
Startup / Lean Team
No rapid performance evaluation strategy for RTL.
Startup / Lean Team
RTL-first design strategy becomes viable.
TidalSim enables new design methodologies for industry, academia, and lean chip design teams.
Scope of Thesis
Implementation of TidalSim
Evaluation of TidalSim for performance estimation on realistic and large workloads
Register-transfer level (RTL) models with full fidelity state and timing
Examples: Verilog, Chisel
Each abstraction makes an accuracy / latency / throughput tradeoff.
Simulator Metrics
Simulation techniques span the gamut on various axes. Each simulation technique assumes a particular hardware abstraction.
Throughput
How many instructions can be simulated per real second? (MIPS = millions of instructions per second)
Accuracy
Do the output metrics of the simulator match those of the modeled SoC in its real environment?
Startup latency
How long does it take from the moment the simulator's parameters/inputs are modified to when the first instruction is executed?
Cost
What hardware platform does the simulator run on?
How much does it cost to run a simulation?
Existing Hardware Simulation Techniques
Examples
Throughput
Latency
Accuracy
Cost
Architectural Simulators
spike, qemu
10-100+ MIPS
<1 second
None
Minimal
μArch Simulators
gem5, Sniper, ZSim, SST
100 KIPS (gem5) - 100 MIPS (Sniper)
<1 minute
10-50% IPC error
Minimal
RTL Simulators
Verilator, VCS, Xcelium
1-10 KIPS
2-10 minutes
Cycle-exact
Minimal
FPGA-Based Emulators
Firesim
≈ 10 MIPS
2-6 hours
Cycle-exact
$10k+
ASIC-Based Emulators
Palladium, Veloce
≈ 0.5-10 MIPS
<1 hour
Cycle-exact
$10M+
Multi-level Sampled Simulation
TidalSim
10+ MIPS
<1 minute
<1% IPC error
Minimal
TidalSim combines the strengths of each technique to produce a meta-simulator that achieves high throughput, low latency, high accuracy, and low cost.
Accuracy of Microarchitectural Simulators
Raw IPC errors on 64-bit workloads vs real Haswell[1]. Microarchitectural simulators have substantial errors exceeding 20%.
Impact of using a bimodal branch predictor vs the Haswell BP[1].
Simulators disagree with each other! The sensitivity of each simulator is wildly different!
Trends aren't enough[2]. Note the sensitivity differences - gradients are critical!
uArch simulators are not accurate enough for microarchitectural evaluation.
[1]: Akram, A. and Sawalha, L., 2019. A survey of computer architecture simulation techniques and tools. IEEE Access
[2]: Nowatzki, T., Menon, J., Ho, C.H. and Sankaralingam, K., 2015. Architectural simulators considered harmful. Micro.
Sampled Simulation
Instead of running the entire program in uArch simulation, run the entire program in functional simulation and only run samples in uArch simulation
The full workload is represented by a selection of sampling units.
How should sampling units be selected?
How can we accurately estimate the performance of a sampling unit?
How can we estimate errors when extrapolating from sampling units?
Existing Sampling Techniques
SimPoint
Program execution traces aren’t random
They execute the same code again-and-again
Workload execution traces can be split into phases that exhibit similar μArch behavior
SimPoint-style representative sampling
Compute an embedding for each program interval (e.g. blocks of 100M instructions)
Cluster interval embeddings using k-means
Choose representative intervals from each cluster as sampling units
SMARTS
Rigorous statistical sampling enables computation of confidence bounds
Use random sampling on a full execution trace to derive a population sample
Central limit theorem provides confidence bounds
SMARTS-style random sampling
Pick a large number of samples to take before program execution
If the sample variance is too high after simulation, then collect more sampling units
Use CLT to derive a confidence bound for the aggregate performance metric
Functional Warmup
The state from a sampling unit checkpoint is only architectural state. The microarchitectural state of the uArch simulator starts at the reset state!
We need to seed long-lived uArch state at the beginning of each sampling unit
This process is called functional warmup
Importance of Functional Warmup
Long-lived microarchitectural state (caches, branch predictors, prefetchers, TLBs) has a substantial impact on the performance of a sampling unit
[1]: Hassani, Sina, et al. "LiveSim: Going live with microarchitecture simulation." HPCA 2016.
[2]: Eeckhout, L., 2008. Sampled processor simulation: A survey. Advances in Computers. Elsevier.
Why RTL-Level Sampled Simulation?
Eliminate modeling errors
Remaining errors can be handled via statistical techniques
No need to correlate performance model and RTL
Let the RTL serve as the source of truth
Can produce RTL-level collateral
Leverage for applications in verification and power modeling
This RTL-first evaluation flow is enabled by highly parameterized RTL generators and SoC design frameworks (e.g. Chipyard).
Backup Slides
A Broad View of Simulation
A high-level, generic view of the input and outputs of a simulator.
Simulation is the workhorse of architecture evaluation
Simulation inputs can have wide variation of fidelity
Hardware spec: high-level models to detailed microarchitecture
Workload: high-level algorithmic description to concrete binary
The fidelity of simulation outputs tracks that of the inputs
Hardware Abstractions
There are roughly 4 levels of hardware abstractions used in architecture evaluation
Architectural (functional) models
"Rough" microarchitectural models with approximate state and timing
"Detailed" microarchitectural models with more refined state and timing
Register-transfer level (RTL) models with full fidelity state and timing
[1]: Shao, Y.S., Reagen, B., Wei, G.Y. and Brooks, D., 2014. Aladdin: A pre-rtl, power-performance accelerator simulator enabling large design space exploration of customized architectures. ACM SIGARCH.
Hardware Abstractions: "Rough" uArch Models - ML Accelerators
Rough uArch models are used for evaluating ML accelerator architectures, dataflows, and workload mappings
Prior Work: Timeloop[1], Accelergy[2] provides a framework for describing accelerator microarchitecture with parameterizable blocks (PEs, scratchpads), workload mappings, and simulating workloads for PPA estimates
An example microarchitecture modeled by Timeloop
Microarchitecture description schema provided by Timeloop
Timeloop's schema for defining workloads and their mapping
[1]: Parashar, A., et. al., 2019. Timeloop: A systematic approach to dnn accelerator evaluation. ISPASS.
[2]: Wu, Y.N., Emer, J.S. and Sze, V., 2019. Accelergy: An architecture-level energy estimation methodology for accelerator designs. ICCAD.
Rough uArch models are also common for evaluating core microarchitectures
Prior Work: McPAT[1] models CPUs with a parameterizable out-of-order pipeline and uncore components coupled to a timing simulator. CACTI[2] models the PPA of SRAM-based caches and DRAM.
The simulation flow provided by McPAT.
McPAT results. Inputs: workload and microarch description. Outputs: PPA
[1]: Li, S., et. al., 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. MICRO.
[2]: Muralimanohar, et al., 2009. CACTI 6.0: A tool to model large caches. HP laboratories.
The most popular way to evaluate core microarchitectural optimizations is with a detailed execution-driven simulator. Many microarchitectural states are modeled.
Prior Work: gem5, ZSim, SST, MARSSx86, Sniper, ESESC. These simulators model the core pipeline and uncore components with cycle-level time-granularity.
The modular architecture of gem5.
Detailed per-instruction core pipeline visualization using Konata.
Hardware Abstractions: RTL
Register-transfer level (RTL) (e.g. Verilog) is the lowest abstraction used in pre-silicon architecture evaluation
Every bit of state and logic is explicitly modeled. RTL is the highest fidelity hardware model.
Can extract very precise power, performance, and area metrics
Which Hardware Abstraction is Suitable?
Architectural (functional) models
Microarchitectural models ("rough") with approximate state and timing
Microarchitectural models ("detailed") with more refined state and timing
Register-transfer level (RTL) models with full fidelity state and timing
Can't compromise on accuracy or latency to enable meaningful and fast microarchitectural iteration.
"detailed" uArch models or RTL are the only options for our performance simulator selection
Can we Use Microarchitectural Simulators?
uArch simulators seem to satisfy most of our requirements
Low startup latency: seconds to 1 minute
Metrics: IPC traces
Cost: minimal
Can we adapt uArch simulators to perform better in terms of accuracy and throughput?
Accuracy of Microarchitectural Simulators
Comparison of estimated IPC from various uArch simulators vs real IPC from Haswell.[1]
MAPE on MiBench: 9.5% (Sniper), 44.6% (gem5), 38.2% (PTLSim) and 47.06% (Multi2Sim).
Raw IPC errors on 64-bit workloads vs real Haswell[2]. Simulators not only disagree with each other, but have substantial errors exceeding 20%.
Impact of halving the pipeline width (widths of fetch, decode, issue/rename, dispatch, and commit are halved)[2]. Simulators disagree with each other.
Impact of halving all the cache sizes[2]. Note how MARSSx86 shows increased IPC for some benchmarks! Again, disagreements are substantial.
Impact of using a bimodal branch predictor vs the Haswell BP.[2].
The sensitivity of each simulator is wildly different!
uArch simulators are not accurate enough for microarchitectural iteration.
Trends aren't enough. Note the sensitivity differences - gradients are critical!
[1]: Akram, A. and Sawalha, L., 2016, October. x86 computer architecture simulators: A comparative study. ICCD.
[2]: Akram, A. and Sawalha, L., 2019. A survey of computer architecture simulation techniques and tools. IEEE Access
Flexibility vs Accuracy Tradeoff of uArch Simulators
Sniper though shows greater accuracy, is not very flexible to allow one to model new micro-architectural features compared to gem5.
On the other hand, gem5 and PTLsim are more flexible and can be used for studies of particular microarchitectural blocks, and full-system workloads, with gem5 being more configurable and showing higher accuracy. [1]
There is an unfavorable tradeoff of simulator flexibility (due to precise silicon calibration) and accuracy.
[1]: Akram, A. and Sawalha, L., 2019. A survey of computer architecture simulation techniques and tools. IEEE Access
The Trends Myth
It is casually stated as, “Although specific details of the simulation are wrong, the overall trends will be correct.”
Relative performance comparisons may be correct, even if there is absolute error caused by a poor assumption or bug.[1]
However, for this to be true, the new technique being evaluated through simulation must be insulated from or statistically uncorrelated with the source of simulation errors.
Because simulators can have significant errors, which are completely unknown, only in rare cases can we be sure this argument holds.[1]
Even accurate relative trends are not enough for microarchitectural iteration - the gradients must also be precise!
[1]: Nowatzki, T., Menon, J., Ho, C.H. and Sankaralingam, K., 2015. Architectural simulators considered harmful. IEEE Micro.
Accuracy of gem5 for RISC-V Cores
Cycles ratio of custom + 3 MiBench (qsort, stringsearch) baremetal binaries on RSD (OoO RV32IMF RISC-V core) vs a gem5 model matching the core microarchitecture parameters[1].
Up to 40% IPC error.
Significant mismatches in L1 MPKI/MAKI are main contributor to error.[1]
The complexity and effort to 'calibrate' a uArch simulator to RTL simulation is significant.[2]
There is no evidence that uArch simulators can achieve sub 5% accuracy for microarchitectural iteration
[1]: Chatzopoulos, O., et. al., 2021. Towards Accurate Performance Modeling of RISC-V Designs. arXiv preprint
[2]: Ta, T., Cheng, L. and Batten, C., 2018. Simulating multi-core RISC-V systems in gem5. CARRV
But Can't We Calibrate uArch Simulators?
[1] calibrates Sniper to Cortex A54 and A72 cores with average IPC errors of 7% and 15% (up to 50%) on SPEC CPU2017 respectively using ML to fine-tune model parameters against silicon measurements given microbenchmarks.
[2] calibrates MARSSx86 to i7-920 with post-calibration IPC error of 7% on SPEC
Absolute errors are still too high when architects must make decisions based on tiny IPC changes
Calibration only applies to a specific design point!
Gradients and errors are not understood nor bounded when microarchitecture parameters are altered to perform HW parameter DSE
uArch simulators are not suitable for pre-silicon microarchitectural iteration
[1]: Adileh, A., et al., 2019. Racing to hardware-validated simulation. ISPASS.
[2]: Asri, M., et al., 2016. Simulator calibration for accelerator-rich architecture studies. SAMOS
False Confidence from Validation
A common misconception is that if the parameters are changed and configured for some other design point, the accuracy will be similar.[1]
tool validation is often carried out by fitting parameters to the “specific” validation targets, not about ensuring the underlying modeling is accurate for individual phenomena or their interactions.[1]
Calibration / validation of a uArch simulator against silicon doesn't make it suitable for microarchitectural iteration
[1]: Nowatzki, T., Menon, J., Ho, C.H. and Sankaralingam, K., 2015. Architectural simulators considered harmful. IEEE Micro.
Our Decision Tree Thus Far
Need high fidelity hardware abstractions → must use detailed uArch / RTL abstractions
Need high accuracy simulators but uArch simulators are not accurate
1) absolutely, 2) with regards to relative trends and gradients, and 3) with regards to parameterizations
Therefore, we must use RTL simulation as the lowest level performance simulator
But RTL simulation has low throughput!
Let's take a technique from uArch simulators that improves their throughput
Random Sampling
Prior work: "SMARTS: Accelerating Microarchitecture Simulation via Rigorous Statistical Sampling"
Before the workload is launched, the number of sampling units is determined
If the sample variance is too high to achieve the target confidence bound, then more sampling units must be collected
Sampling units are selected either using random, reservoir, or systematic sampling
Central limit theorem is used to derive a confidence bound around the performance metrics reported by uArch simulation of the sampling units
Comparison Between Sampling Techniques
Sampling Technique
Interval Length
# of Intervals Simulated
Interval Selection
Functional Warmup
Detailed Warmup
Time Granularity
SimPoint
10-100M
50-100
BBFV + k-means
Optional
≈0.1-1M
Interval length
SMARTS
1-10k
10k
Systematic sampling
Required
1k
Entire workload
TidalSim
10k
10-100
BBFV + k-means
Required
1k
Interval Length
Final Takeaways
Microarchitectural iteration requires high accuracy
→ we must use RTL simulation as our performance simulator
RTL simulation has low throughput
→ we must employ simulation sampling techniques to combine architectural and RTL simulation to improve throughput
We can't execute long sampling units in RTL simulation
→ we must use uArch functional warmup models to minimize errors due to stale uArch state
We want time-domain power, performance, and RTL collateral. We want the ability to extract tiny and unique benchmarks from large workloads.
→ we must combine the SimPoint and SMARTS sampling methodologies
4. How (pt 1): TidalSim v0.1 (A Prototype Implementation)
Implementation details of the TidalSim v0.1 prototype
Cache functional warmup model
Results for IPC trace reconstruction
Going from TidalSim v0.1 to v1
Overview of the TidalSim v0.1 Flow
Implementation Details For TidalSim v0.1
Basic block identification
BB identification from spike commit log or from static ELF analysis
Basic block embedding of intervals
Clustering and checkpointing
k-means, PCA-based n-clusters
spike-based checkpoints
RTL simulation and performance metric extraction
Custom force-based RTL state injection, out-of-band IPC measurement
Extrapolation
Estimate IPC of each interval based on its embedding and distances to RTL-simulated intervals
Functional Cache Warmup with Memory Timestamp Record
Memory Timestamp Record (MTR)[1] is a cache warmup model that can be constructed once and reused for many different cache parameterizations
Updating entries in the (MTR) data structure from an memory access trace
Reconstructing a cache state from a MTR checkpoint for a specific cache parameterization
[1]: Barr, Kenneth C., et al. "Accelerating multiprocessor simulation with a memory timestamp record." ISPASS 2005.
IPC Trace Prediction: huffbench
Huffman compression from Embench (huffbench)
N=10000, C=18
Full RTL sim takes 15 minutes, TidalSim runs in 10 seconds
Large IPC variance
IPC Trace Prediction: wikisort
Merge sort benchmark from Embench (wikisort)
N=10000, C=18
Full RTL sim takes 15 minutes, TidalSim runs in 10 seconds
Can capture general trends and time-domain IPC variation
Aggregate IPC Prediction for Embench Suite
Typical IPC error (without functional warmup and with fine time-domain precision of 10k instructions) is < 5%
Backup Slides
Basic Block Identification
Basic blocks are extracted from the dynamic commit trace emitted by spike
As workload traces grow to billions of dynamic instructions, $N$ will have to go up too, to avoid too many clusters
→ we need to perform sampling unit subsampling using SMARTS-like methodology to tolerate the low throughput of RTL simulation
Functional warmup can provide us with microarchitectural state at the start of each sampling unit, but injecting that state in RTL simulation is error-prone
Correlating microarchitectural cache state via RTL hierarchical paths is tricky and requires manual effort
If the hardware parameterization changes (cache hierarchy/sizing, choice of branch predictor)
→ the functional warmup models and state injection logic must also change
Extrapolation
We gather performance metrics for one sampling unit in each cluster that is taken to be representative of that cluster ($\vec{p}$)
To compute the estimated performance of a given interval
Compute the distances $\vec{d}$ of that interval's embedding to each cluster centroid
Compute a weighted mean using $\vec{d}$ and $\vec{p}$
Compute the estimated performance of all intervals to extrapolate to a full performance trace
IPC Trace Prediction: aha-mont64
Montgomery multiplication from Embench (aha-mont64)
N=1000, C=12
Full RTL sim takes 10 minutes, TidalSim runs in 10 seconds
object MemOp extends ChiselEnum
case class MemTx extends Bundle {
val addr = UInt(32.W)
val data = UInt(64.W)
val op = MemOp
}
val memTxGen: Gen[MemTx] = Gen[MemTx].uniform
[1]: Iyer, Vighnesh, et. al., New Embedded DSLs for HW Design and Verification, PLARCH Workshop 2023
RTL Coverage for Simulation Feedback
Coverage implemented as a hardware IR compiler pass rather than baked into the RTL simulator
Easy to add new coverage metrics via static analysis of the RTL netlist
Leverage simulator independent coverage methodology for coverage instrumentation of long-lived uArch RTL
Leverage fast RTL simulation APIs, parametric stimulus generators, and coverage instrumentation for parametric fuzzing[1]
Parametric fuzzing mutates the bytestream going into a parametric generator rather than the DUT directly[2]
We augment typical parametric fuzzing with mark-driven mutation
[1]: Iyer, Vighnesh, et. al., New Embedded DSLs for HW Design and Verification, PLARCH Workshop 2023
[2]: Padhye, R., Lemieux, C., Sen, K., Papadakis, M. and Le Traon, Y., 2019. Semantic fuzzing with zest. ACM SIGSOFT.
Parametric Fuzzing - Demo Using Spike
Use spike's L1 dcache model's miss rate as feedback to produce RISC-V programs that maximize it
Using parametric fuzzing, we can automatically construct RISC-V programs to maximize any uArch metric given a small set of RISC-V sequence primitives
6.b: Putting the 'Tidal' in TidalSim
Issues with Time Modeling in Sampled Simulation
Prior work runs uArch simulators in "syscall emulation" mode when evaluating workloads (e.g. SPEC), not modeling any OS-application interactions
Real workloads contain many interactions between processes and the OS which are sensitive to the modeling of time
Consider timer interrupts: naive functional simulators will just advance one timestep per commited instruction, not matching RTL!
TidalSim to Model Time Accurately
We propose bouncing between functional and RTL simulation, where performance metrics from RTL sim impacts time advancement in functional sim
To avoid simulating every interval in RTL sim, we leverage interval embeddings to estimate IPC on the fly
6.c: Sampled Simulation with Accelerators
What Makes Accelerators Suitable for Sampled Simulation?
Accelerator architectural state is large and explicit
→ snapshotting is easy
→ functional warmup is unnecessary
Accelerator usage is often repeated in workloads
→ clustering accelerator usage embeddings is reasonable
→ potential for massive simulation throughput improvement
Accelerator behavior is consistent
→ accelerator performance is consistent from one dataset to another
→ embeddings don't need to be aware of the accelerator microarchitecture / latent state
Extending Interval Embeddings to Accelerators
Incorporate accelerator state and the semantics of the accelerator ISA to the embedding
Can capture and embed accelerator interactions with system memory and with internal compute units
In the case of Gemmini, we must also consider instruction dependencies and out-of-order execution + memory contention from multiple accelerators
Backup: 6.d: Generalizing the Spectrum of Sampled Simulation
Sampled Simulation Techniques
Simulation techniques encompass SMARTS, SimPoint, hybrid variants, eager parallel RTL simulation, and many more
A Formalization and Simulation Model
Only considering techniques that can operate in a streaming fashion, develop a parameterized version of TidalSim
Streaming necessitates new incremental unsupervised learning algorithms
Formalize the interfaces between arch sim, uArch models, and RTL sim
Formalize and parameterize simulation methodology to encompass all prior techniques
Consider input parameters such as interval length ($N$), number of host cores ($n_{cores}$), RTL simulation throughput ($T_{rtl}$), sampling technique ($i \rightarrow \{0, 1\}$)
Produce estimated output metrics such as cost, runtime, aggregate throughput, latency, time-granularity of output, error bounds
PC/Binary-Agnostic Embeddings
Basic block embedding assumes
There is a static PC → basic block mapping
Intervals with similar basic block traversals have similar uArch behavior
Our embeddings should be PC/binary-agnostic to support portability and multi-process workloads in an OS
Most prior work only runs single-process workloads using syscall proxies
Real workloads are heavily affected by interactions between the OS and userspace processes
We will explore embeddings with features such as
Instruction mix, function call frequency, instruction dependencies
Microarchitectural behaviors: I/D cache misses, BP model mispredicts, TLB behavior
7. Thesis Outline
Outline
Motivation and background
Implementation and evaluation of TidalSim v0.1 - Completed