Performance Analysis¶
This document analyzes VAJAX performance characteristics, explaining the overhead profile relative to VACASK (C++ reference simulator) and the GPU acceleration crossover point.
Benchmark Results¶
All measurements from GitHub Actions CI runners (CPU: ubuntu-latest, GPU: nvidia-runner-1). VACASK numbers on CPU use live execution; GPU comparisons use reference values.
CPU: VAJAX vs VACASK¶
| Benchmark | Nodes | Steps | JAX (ms/step) | VACASK (ms/step) | Ratio |
|---|---|---|---|---|---|
| rc | 4 | 1M | 0.023 | 0.002 | 12.2x |
| graetz | 6 | 1M | 0.033 | 0.004 | 8.6x |
| mul | 8 | 500k | 0.040 | 0.004 | 10.4x |
| ring | 47 | 20k | 0.516 | 0.108 | 4.8x |
| c6288 | ~5000 | 20 | 163.684 | 288.800 | 0.57x |
| mul64 | ~133k | 15 | 8324.949 | timeout | — |
c6288: VAJAX is 1.8x faster than VACASK using UMFPACK sparse solver on CPU. The crossover where VAJAX exceeds VACASK performance occurs around ~5000 nodes.
mul64 (64x64 array multiplier): ~266k MOSFETs, ~666k unknowns (133k nodes + 533k internal). VACASK times out on CI (>5 min per step). VAJAX completes at ~8.3s/step using UMFPACK sparse solver. On GPU with float32 cuDSS factorization and iterative refinement, mul64 runs at ~648 ms/step on Tesla T4 (16GB) — a 12.8x GPU speedup.
GPU: VAJAX Acceleration¶
| Benchmark | Nodes | GPU (ms/step) | CPU (ms/step) | GPU Speedup | vs VACASK CPU |
|---|---|---|---|---|---|
| mul64 | ~133k | 648.00 | 8324.95 | 12.8x | N/A (timeout) |
| c6288 | ~5000 | 19.81 | 163.68 | 8.3x | 0.07x (faster) |
| ring | 47 | 1.49 | 0.52 | 0.3x | 14x (slower) |
| graetz | 6 | 0.30 | 0.03 | 0.1x | 75x (slower) |
| rc | 4 | 0.24 | 0.02 | 0.08x | 120x (slower) |
mul64 uses float32 cuDSS factorization with iterative refinement, reducing VRAM usage from >16GB (f64 OOM) to ~10GB on Tesla T4 (16GB). The f32 factorization provides near-f64 accuracy via a single refinement step: solve in f32, compute residual r=f+J@x in f64 (SpMV only), solve correction in f32, return x+d.
GPU results for circuits below ~500 nodes reflect GPU kernel overhead on tiny
workloads, not simulation inefficiency. The auto-threshold (gpu_threshold=500)
prevents this in normal usage; the benchmark uses --force-gpu to measure all
circuits for tracking purposes.
Accuracy (vs VACASK)¶
| Benchmark | Dense RMS | Sparse RMS | Threshold |
|---|---|---|---|
| rc | 0.00% | 0.00% | 5% |
| graetz | 0.00% | 0.00% | 15% |
| mul | 0.00% | 0.00% | 2% |
| c6288 | - | 2.01% | 10% |
Per-Step Overhead Analysis¶
The overhead ratio decreases as circuit size increases, confirming a fixed per-step overhead in the JAX path that VACASK doesn't have:
Overhead ratio vs circuit size (CPU):
rc (4 nodes) ████████████████████████████████████ 12.2x
mul (8 nodes) ██████████████████████████████████ 10.4x
graetz (6 nodes) ██████████████████████████ 8.6x
ring (47 nodes) ███████████████ 4.8x
c6288 (5k nodes) ██ 0.57x (VAJAX faster)
mul64 (133k nodes) █ N/A (VACASK timeout)
Overhead Breakdown¶
Each timestep in full_mna.py body_fn executes ~11 major operations beyond the
core Newton-Raphson solve:
| Operation | Estimated Cost | VACASK Equivalent | Notes |
|---|---|---|---|
| LTE estimation | 2-3 us | Similar | Per-node tolerance checking, error coefficients |
| Voltage prediction | 1-2 us | Simpler predictor | Lagrange polynomial from multi-point history |
| BDF2 coefficients | 0.5-1 us | Pre-computed | Variable-step formula recomputed every step |
| History management | 1-2 us | In-place mutation | jnp.roll on V_history, dt_history arrays |
jnp.where branching |
2-3 us | Runtime if |
Evaluates both branches unconditionally |
| Vmap device eval | 1-2 us/NR iter | Sequential loop | Overhead dominates for <10 devices |
| COO assembly | 1-2 us/NR iter | Direct stamping | Concatenate + sum duplicates |
Simparams .at[].set() |
0.5-1 us/NR iter | Struct writes | Creates intermediate arrays |
| Total fixed overhead | ~10-15 us/step | ~0 us |
Scaling Model¶
The per-step cost can be modeled as:
T_jax(n) = T_fixed + T_compute(n)
T_vacask(n) = T_compute_vacask(n)
where:
T_fixed ~ 10-15 us (JAX overhead, independent of circuit size)
T_compute(n) ~ T_compute_vacask(n) for large n (same algorithm)
For large circuits (c6288, mul64), VAJAX's UMFPACK sparse solver is actually faster than VACASK — the fixed overhead is negligible and the solver implementation is competitive:
| Circuit | T_fixed | T_compute | Overhead % | Overall Ratio |
|---|---|---|---|---|
| rc | 10 us | 2 us | 83% | 12.2x |
| graetz | 10 us | 4 us | 71% | 8.6x |
| ring | 10 us | 100 us | 9% | 4.8x |
| c6288 | 10 us | 163,000 us | <0.01% | 0.57x |
Why JAX Has Per-Step Overhead¶
1. Functional Array Updates¶
JAX arrays are immutable. Inside lax.while_loop, conditional updates use
jnp.where which evaluates both branches and selects elementwise:
# JAX: evaluates both new_val and old_val, then selects
state = jnp.where(accept_step, new_val, old_val)
The body_fn has ~15 such conditional updates per step for history management, step acceptance, and NR failure handling.
2. Vmap Batching for Small Counts¶
Device evaluation uses jax.vmap to batch all instances of each device type.
This is essential for GPU parallelism on large circuits but adds overhead for
small batch sizes:
# 2 resistors: vmap adds vectorization overhead > sequential evaluation
vmapped_eval = jax.vmap(device_eval)(batched_params)
For rc (2 resistors), the vmap overhead exceeds the actual computation. For c6288 (~86,000 transistors), vmap amortizes perfectly.
3. COO Matrix Assembly¶
VAJAX builds the Jacobian from COO (coordinate) format: 1. Each device type produces (row, col, value) triplets 2. Triplets are concatenated across device types 3. Duplicate indices are summed to form the final matrix
VACASK stamps directly into a pre-allocated matrix with known positions. The COO approach enables JAX tracing and GPU parallelism but adds indirection.
4. Integration Coefficient Recomputation¶
Variable-step BDF2 requires recomputing integration coefficients each step based on the step-size ratio. VACASK may use lookup tables or simplified formulas for common step ratios.
GPU Acceleration: Why Large Circuits Win¶
GPU acceleration becomes beneficial when the per-step compute time exceeds the kernel launch and memory overhead (~100-500 us per step). This happens around 500+ nodes.
mul64 (133k nodes): GPU Stress Test¶
- ~266k PSP103 MOSFET evaluations per NR iteration
- Sparse Jacobian: 666,409 unknowns (133k nodes + 533k internal + 130 currents)
- 3,259,918 CSR entries (from 30.5M COO triplets)
- Dense solve impossible (~3.3TB memory); sparse solver required
- Float32 factorization: cuDSS factorizes J in f32 (halves VRAM), iterative refinement recovers near-f64 accuracy. Reduces peak VRAM from >16GB to ~10GB.
- Tesla T4 (16GB): 648 ms/step — 12.8x faster than CPU (8325 ms/step)
- CPU uses UMFPACK sparse solver at ~8.3s/step; VACASK times out (>5 min/step)
c6288 (5000 nodes): GPU Advantage¶
- Jacobian: ~5000x5000 sparse matrix (~86k transistors, ~5k nodes)
- Uses cuDSS sparse solver on GPU, UMFPACK on CPU
- GPU: 19.81 ms/step, CPU: 163.68 ms/step — 8.3x GPU speedup
- vs VACASK CPU (288.8 ms/step): 14.6x faster on GPU
rc (4 nodes): GPU Disadvantage¶
- Jacobian: 4x4 matrix operations
- Dense solve: 64 multiply-adds per NR iteration
- GPU overhead: kernel launch > actual compute
- Result: 20x slower than CPU (expected and correct)
The gpu_threshold parameter (default: 500 nodes) automatically routes small
circuits to CPU, avoiding this overhead in normal usage.
Potential Optimizations¶
These are known areas where the per-step overhead could be reduced:
-
Fused operations: Combine BDF2 coefficient computation with history updates into a single kernel to reduce launch overhead.
-
Direct stamping: For CPU path, stamp directly into pre-allocated Jacobian instead of building COO triplets.
-
Sequential device eval on CPU: Use a simple loop instead of vmap when running on CPU with few device instances.
-
Pre-computed coefficients: Cache BDF2 coefficients for common step-size ratios.
-
Reduced history depth: Currently maintains multi-step history for BDF2; simpler methods (trapezoidal) would reduce overhead.
These optimizations would primarily benefit small-circuit CPU performance without affecting the large-circuit GPU path where VAJAX already outperforms VACASK.