GPU Solver Architecture¶
Historical document
This document describes an early GPU solver architecture. The codebase has
since been restructured: DC analysis is in analysis/dc.py, transient
analysis is in analysis/transient/, and Jacobians are now computed
analytically via OpenVAF. The design principles and trade-offs described
here remain relevant.
This document describes the GPU-native solver architecture for VAJAX, including DC operating point and transient analysis implementations.
Overview¶
VAJAX provides two complementary GPU-native solvers:
- DC Solver (
analysis/dc.py) - Computes steady-state operating point - Transient Solver (
analysis/transient/) - Time-domain simulation with adaptive BDF2
Jacobians are computed analytically via OpenVAF-compiled Verilog-A models.
Key Design Decisions¶
Why Transient Instead of DC for Digital Circuits?¶
For digital MOSFET circuits (like the C6288 multiplier), transient analysis with
icmode='uic' is preferred over DC operating point analysis. This matches the
approach used by VACASK and other production simulators.
The problem with DC analysis: - Digital circuits have many possible stable states (high/low combinations) - Newton-Raphson can converge to wrong states or oscillate between them - Floating gate nodes and feedback paths make convergence difficult - Source stepping and GMIN stepping help but don't always work
Why transient works better: - Natural settling behavior from initial conditions - No ambiguity about which state to converge to - The circuit "finds" its own solution through time evolution - Capacitances (intrinsic or explicit) provide natural damping
VACASK Example:
// From c6288.sim - VACASK uses transient, not DC
analysis tranmul tran stop=2n step=2p icmode="uic"
// analysis op1 op <- DC is commented out
Initial Condition Modes (icmode)¶
The transient solver supports two initial condition modes:
icmode='op'(default): Compute DC operating point first, then run transient- Good for analog circuits where DC solution is well-defined
-
Can fail for digital circuits with multiple stable states
-
icmode='uic': Use Initial Conditions directly (zeros + supply voltages) - Skip DC computation entirely
- Let circuit settle through transient simulation
- Preferred for digital logic circuits
- Matches VACASK's behavior for benchmarks
Solver Architecture¶
Data Flow¶
MNASystem (devices)
│
▼
build_device_groups() ◄── Groups devices by type (MOSFET, R, V, C)
│
▼
VectorizedDeviceGroup ◄── Pre-computed JAX arrays per device type
│
├──────────────────┐
▼ ▼
build_gpu_residual_fn() build_transient_circuit_data_fast()
│ │
▼ ▼
residual_fn() TransientCircuitData
│ │
▼ ▼
sparsejac.jacrev() build_transient_residual_fn()
│ │
▼ ▼
Sparse Jacobian residual_fn(V_curr, V_prev, dt)
(BCOO format) │
│ ▼
▼ sparsejac.jacrev()
Newton-Raphson │
Iteration ▼
│ Newton-Raphson per timestep
▼ │
Solution ▼
Time-domain solution
Key Components¶
1. VectorizedDeviceGroup¶
Groups devices of the same type with pre-computed JAX arrays:
class VectorizedDeviceGroup:
device_type: DeviceType # MOSFET, RESISTOR, VSOURCE, etc.
device_names: List[str]
node_indices: Array # Shape: (n_devices, n_terminals)
params: Dict[str, Array] # Device parameters as batched arrays
This enables vectorized device evaluation instead of per-device Python loops.
2. GPU Residual Function¶
The residual function computes KCL violations at each node:
def gpu_residual_fn(V: Array) -> Array:
"""Compute f(V) where f=0 at solution.
Returns:
residual: Array of shape (n_nodes - 1,) - current imbalance at each node
"""
# For each device type, compute currents vectorized:
# - Voltage sources: I = G_big * (V_actual - V_target)
# - MOSFETs: Ids = f(Vgs, Vds, W, L, model_params)
# - Resistors: I = G * (Vp - Vn)
# - Capacitors: I = C/dt * (V - V_prev) [transient only]
# Sum contributions at each node
residual = sum_at_nodes(device_currents)
# Add GMIN to ground (numerical stability)
residual += gmin * V
return residual
3. Sparse Jacobian via sparsejac¶
Instead of manually computing Jacobian entries, we use automatic differentiation:
# Create sparsity pattern (which entries are non-zero)
sparsity = jsparse.BCOO((data, indices), shape=(n, n))
# Create sparse Jacobian function via graph coloring
jacobian_fn = sparsejac.jacrev(residual_fn, sparsity=sparsity)
# Evaluate at a point
J_sparse = jacobian_fn(V) # Returns BCOO sparse matrix
Benefits: - No need for analytical derivatives - Correct derivatives even for complex models - Efficient evaluation via graph coloring
Key insight: The sparsity pattern comes from circuit topology - a device only contributes to Jacobian entries connecting its terminal nodes.
4. Newton-Raphson Iteration¶
Both solvers use Newton-Raphson with voltage limiting:
for iteration in range(max_iterations):
f = residual_fn(V)
if max(abs(f)) < abstol:
break # Converged
J = jacobian_fn(V)
delta_V = solve(J, -f) # Linear solve
# Voltage limiting (prevent overshooting)
max_step = 0.5 * vdd
if max(abs(delta_V)) > max_step:
delta_V *= max_step / max(abs(delta_V))
V = V + delta_V
V = clip(V, -2*vdd, 2*vdd) # Safety clamp
MOSFET Models¶
Transient Solver: Level-1 Model¶
The transient solver uses a simplified level-1 MOSFET model:
beta = kp * W / L
Vgst = Vgs - Vth0
# Smooth subthreshold (avoids discontinuity at Vth)
Vgst_eff = log1p(exp(alpha * Vgst)) / alpha
# Saturation with soft clipping
Vdsat = max(Vgst_eff, epsilon)
Vds_eff = Vdsat * tanh(abs(Vds) / Vdsat)
# Drain current
Ids = beta * Vgst_eff * Vds_eff * (1 + lambda * abs(Vds))
DC Solver: BSIM-like Model¶
The DC solver uses a more sophisticated BSIM-like model from mosfet_simple.py:
- Body effect (gamma, phiB)
- Velocity saturation (vsat)
- Subthreshold conduction (n_sub)
- Short channel effects (theta, a0)
- Temperature dependence
Performance Characteristics¶
JIT Compilation¶
Both solvers benefit from JAX's JIT compilation:
Tip: For benchmarking, always exclude the first iteration or use pre-warming.
CPU vs GPU¶
| Circuit Size | CPU Time | GPU Time | Speedup |
|---|---|---|---|
| Small (< 100 nodes) | Faster | Slower | 0.5x |
| Medium (1K nodes) | Similar | Similar | 1x |
| Large (5K+ nodes) | Slower | Faster | 3-10x |
GPU acceleration becomes beneficial at ~1000+ nodes due to data transfer overhead.
Memory Usage¶
Sparse Jacobian memory scales with O(devices * terminals²) instead of O(nodes²):
C6288 (5123 nodes, 10112 MOSFETs):
- Dense Jacobian: ~200 MB
- Sparse Jacobian: ~1.3 MB
- Savings: 150x
Usage Examples¶
DC Operating Point¶
from vajax import CircuitEngine
engine = CircuitEngine("circuit.sim")
engine.parse()
# DC operating point is computed automatically before transient
# Or run DC explicitly:
# engine.run_dcinc()
Transient Analysis¶
from vajax import CircuitEngine
engine = CircuitEngine("circuit.sim")
engine.parse()
# For digital circuits - use icmode='uic' to skip DC
engine.prepare(t_stop=2e-9, dt=2e-12)
result = engine.run_transient()
# For analog circuits with sparse solver
engine.prepare(t_stop=1e-6, dt=1e-9, use_sparse=True)
result = engine.run_transient()
Current State¶
Since this document was written, major improvements have been made:
- Adaptive timestepping: BDF2 with LTE control is implemented
- Unified device models: All devices use OpenVAF-compiled Verilog-A (PSP103, etc.)
- GPU-resident time loops: Transient uses
lax.scanfor full GPU execution - Sparse solver: BCOO/BCSR with
spsolvefor large circuits - AC and noise analysis: Frequency-domain analyses are available
References¶
- VACASK simulator behavior for digital circuit analysis
- sparsejac: https://github.com/mfschubert/sparsejac
- JAX documentation: https://jax.readthedocs.io
- Newton-Raphson for circuit simulation (Nagel, 1975)