Performance Analysis¶

This document analyzes VAJAX performance characteristics, explaining the overhead profile relative to VACASK (C++ reference simulator) and the GPU acceleration crossover point.

Benchmark Results¶

All measurements from GitHub Actions CI runners (CPU: ubuntu-latest, GPU: nvidia-runner-1). VACASK numbers on CPU use live execution; GPU comparisons use reference values.

CPU: VAJAX vs VACASK¶

Benchmark	Nodes	Steps	JAX (ms/step)	VACASK (ms/step)	Ratio
rc	4	1M	0.023	0.002	12.2x
graetz	6	1M	0.033	0.004	8.6x
mul	8	500k	0.040	0.004	10.4x
ring	47	20k	0.516	0.108	4.8x
c6288	~5000	20	163.684	288.800	0.57x
mul64	~133k	15	8324.949	timeout	—

c6288: VAJAX is 1.8x faster than VACASK using UMFPACK sparse solver on CPU. The crossover where VAJAX exceeds VACASK performance occurs around ~5000 nodes.

mul64 (64x64 array multiplier): ~266k MOSFETs, ~666k unknowns (133k nodes + 533k internal). VACASK times out on CI (>5 min per step). VAJAX completes at ~8.3s/step using UMFPACK sparse solver. On GPU with float32 cuDSS factorization and iterative refinement, mul64 runs at ~648 ms/step on Tesla T4 (16GB) — a 12.8x GPU speedup.

GPU: VAJAX Acceleration¶

Benchmark	Nodes	GPU (ms/step)	CPU (ms/step)	GPU Speedup	vs VACASK CPU
mul64	~133k	648.00	8324.95	12.8x	N/A (timeout)
c6288	~5000	19.81	163.68	8.3x	0.07x (faster)
ring	47	1.49	0.52	0.3x	14x (slower)
graetz	6	0.30	0.03	0.1x	75x (slower)
rc	4	0.24	0.02	0.08x	120x (slower)

mul64 uses float32 cuDSS factorization with iterative refinement, reducing VRAM usage from >16GB (f64 OOM) to ~10GB on Tesla T4 (16GB). The f32 factorization provides near-f64 accuracy via a single refinement step: solve in f32, compute residual r=f+J@x in f64 (SpMV only), solve correction in f32, return x+d.

GPU results for circuits below ~500 nodes reflect GPU kernel overhead on tiny workloads, not simulation inefficiency. The auto-threshold (gpu_threshold=500) prevents this in normal usage; the benchmark uses --force-gpu to measure all circuits for tracking purposes.

Accuracy (vs VACASK)¶

Benchmark	Dense RMS	Sparse RMS	Threshold
rc	0.00%	0.00%	5%
graetz	0.00%	0.00%	15%
mul	0.00%	0.00%	2%
c6288	-	2.01%	10%

Per-Step Overhead Analysis¶

The overhead ratio decreases as circuit size increases, confirming a fixed per-step overhead in the JAX path that VACASK doesn't have:

Overhead ratio vs circuit size (CPU):

    rc (4 nodes)       ████████████████████████████████████  12.2x
    mul (8 nodes)      ██████████████████████████████████    10.4x
    graetz (6 nodes)   ██████████████████████████            8.6x
    ring (47 nodes)    ███████████████                       4.8x
    c6288 (5k nodes)   ██                                    0.57x (VAJAX faster)
    mul64 (133k nodes) █                                     N/A (VACASK timeout)

Overhead Breakdown¶

Each timestep in full_mna.py body_fn executes ~11 major operations beyond the core Newton-Raphson solve:

Operation	Estimated Cost	VACASK Equivalent	Notes
LTE estimation	2-3 us	Similar	Per-node tolerance checking, error coefficients
Voltage prediction	1-2 us	Simpler predictor	Lagrange polynomial from multi-point history
BDF2 coefficients	0.5-1 us	Pre-computed	Variable-step formula recomputed every step
History management	1-2 us	In-place mutation	`jnp.roll` on V_history, dt_history arrays
`jnp.where` branching	2-3 us	Runtime `if`	Evaluates both branches unconditionally
Vmap device eval	1-2 us/NR iter	Sequential loop	Overhead dominates for <10 devices
COO assembly	1-2 us/NR iter	Direct stamping	Concatenate + sum duplicates
Simparams `.at[].set()`	0.5-1 us/NR iter	Struct writes	Creates intermediate arrays
Total fixed overhead	~10-15 us/step	~0 us

Scaling Model¶

The per-step cost can be modeled as:

T_jax(n) = T_fixed + T_compute(n)
T_vacask(n) = T_compute_vacask(n)

where:
  T_fixed ~ 10-15 us (JAX overhead, independent of circuit size)
  T_compute(n) ~ T_compute_vacask(n) for large n (same algorithm)

For large circuits (c6288, mul64), VAJAX's UMFPACK sparse solver is actually faster than VACASK — the fixed overhead is negligible and the solver implementation is competitive:

Circuit	T_fixed	T_compute	Overhead %	Overall Ratio
rc	10 us	2 us	83%	12.2x
graetz	10 us	4 us	71%	8.6x
ring	10 us	100 us	9%	4.8x
c6288	10 us	163,000 us	<0.01%	0.57x

Why JAX Has Per-Step Overhead¶

1. Functional Array Updates¶

JAX arrays are immutable. Inside lax.while_loop, conditional updates use jnp.where which evaluates both branches and selects elementwise:

# JAX: evaluates both new_val and old_val, then selects
state = jnp.where(accept_step, new_val, old_val)

// VACASK (C++): skips the else branch entirely
if (accept_step) state = new_val;

The body_fn has ~15 such conditional updates per step for history management, step acceptance, and NR failure handling.

2. Vmap Batching for Small Counts¶

Device evaluation uses jax.vmap to batch all instances of each device type. This is essential for GPU parallelism on large circuits but adds overhead for small batch sizes:

# 2 resistors: vmap adds vectorization overhead > sequential evaluation
vmapped_eval = jax.vmap(device_eval)(batched_params)

For rc (2 resistors), the vmap overhead exceeds the actual computation. For c6288 (~86,000 transistors), vmap amortizes perfectly.

3. COO Matrix Assembly¶

VAJAX builds the Jacobian from COO (coordinate) format: 1. Each device type produces (row, col, value) triplets 2. Triplets are concatenated across device types 3. Duplicate indices are summed to form the final matrix

VACASK stamps directly into a pre-allocated matrix with known positions. The COO approach enables JAX tracing and GPU parallelism but adds indirection.

4. Integration Coefficient Recomputation¶

Variable-step BDF2 requires recomputing integration coefficients each step based on the step-size ratio. VACASK may use lookup tables or simplified formulas for common step ratios.

GPU Acceleration: Why Large Circuits Win¶

GPU acceleration becomes beneficial when the per-step compute time exceeds the kernel launch and memory overhead (~100-500 us per step). This happens around 500+ nodes.

mul64 (133k nodes): GPU Stress Test¶

~266k PSP103 MOSFET evaluations per NR iteration
Sparse Jacobian: 666,409 unknowns (133k nodes + 533k internal + 130 currents)
3,259,918 CSR entries (from 30.5M COO triplets)
Dense solve impossible (~3.3TB memory); sparse solver required
Float32 factorization: cuDSS factorizes J in f32 (halves VRAM), iterative refinement recovers near-f64 accuracy. Reduces peak VRAM from >16GB to ~10GB.
Tesla T4 (16GB): 648 ms/step — 12.8x faster than CPU (8325 ms/step)
CPU uses UMFPACK sparse solver at ~8.3s/step; VACASK times out (>5 min/step)

c6288 (5000 nodes): GPU Advantage¶

Jacobian: ~5000x5000 sparse matrix (~86k transistors, ~5k nodes)
Uses cuDSS sparse solver on GPU, UMFPACK on CPU
GPU: 19.81 ms/step, CPU: 163.68 ms/step — 8.3x GPU speedup
vs VACASK CPU (288.8 ms/step): 14.6x faster on GPU

rc (4 nodes): GPU Disadvantage¶

Jacobian: 4x4 matrix operations
Dense solve: 64 multiply-adds per NR iteration
GPU overhead: kernel launch > actual compute
Result: 20x slower than CPU (expected and correct)

The gpu_threshold parameter (default: 500 nodes) automatically routes small circuits to CPU, avoiding this overhead in normal usage.

Potential Optimizations¶

These are known areas where the per-step overhead could be reduced:

Fused operations: Combine BDF2 coefficient computation with history updates into a single kernel to reduce launch overhead.
Direct stamping: For CPU path, stamp directly into pre-allocated Jacobian instead of building COO triplets.
Sequential device eval on CPU: Use a simple loop instead of vmap when running on CPU with few device instances.
Pre-computed coefficients: Cache BDF2 coefficients for common step-size ratios.
Reduced history depth: Currently maintains multi-step history for BDF2; simpler methods (trapezoidal) would reduce overhead.

These optimizations would primarily benefit small-circuit CPU performance without affecting the large-circuit GPU path where VAJAX already outperforms VACASK.