Parallelism Architecture: c6288 Case Study¶

This document traces how VAJAX exploits parallelism at every stage of circuit simulation, using the c6288 16x16 combinational multiplier as a concrete example.

Circuit Overview¶

The c6288 circuit is a 16-bit Wallace-tree multiplier:

256 AND gates (6 MOSFETs each: 3 PMOS + 3 NMOS)
2,128 NOR gates (4 MOSFETs each: 2 PMOS + 2 NMOS)
10,048 PSP103 MOSFETs total (single model, TYPE parameter distinguishes NMOS/PMOS)
32 resistors (in input drivers)
34 voltage sources (32 driver + VDD + VSS)
~5,089 external circuit nodes

After node collapse (PSP103 has 8 internal nodes, 6 collapse), the system has: - 5,089 external nodes + 20,096 internal nodes (2 per MOSFET) + 34 vsource branch currents - ~25,219 unknowns in the augmented MNA system

End-to-End Pipeline¶

The entire simulation -- all timesteps, all Newton-Raphson iterations, all device evaluations -- compiles into a single XLA program via jax.jit. After the one-time JIT warmup, zero Python interpreter overhead remains.

flowchart TB
    subgraph once ["One-time setup (Python)"]
        direction TB
        A[Parse .sim netlist] --> B[Compile PSP103.va via OpenVAF]
        B --> C[Group 10,048 MOSFETs by model type]
        C --> D["Split params: shared vs per-device"]
        D --> E[Pre-compute COO stamp indices]
        E --> F[Trial eval → discover sparsity pattern]
        F --> G[Pre-compute COO→CSR permutation]
    end

    subgraph jit ["JIT-compiled XLA program (GPU/CPU)"]
        direction TB
        H["Transient loop (lax.while_loop)"]
        H --> I[Compute integration coefficients]
        I --> J[Evaluate source waveforms]
        J --> K[Predict voltages from history]
        K --> L["NR loop (lax.while_loop)"]
        L --> M{Converged?}
        M -->|No| L
        M -->|Yes| N[LTE estimation]
        N --> O{Accept step?}
        O -->|No, halve dt| H
        O -->|Yes| P[Store solution, advance time]
        P --> H
    end

    once --> jit

Newton-Raphson Iteration Detail¶

Each NR iteration is the performance-critical inner loop. Here is what happens inside build_system_mna + linear_solve:

flowchart TB
    subgraph nr ["Single NR Iteration"]
        direction TB

        subgraph extract ["1. Voltage Extraction (vectorized gather)"]
            V["V: solution vector (25,219)"]
            VE["V[node1] - V[node2]<br/>10,048 × 13 terminal voltages"]
            V --> VE
        end

        subgraph eval ["2. Batched Device Evaluation (jax.vmap)"]
            direction TB
            E1["PSP103 batch: vmap over 10,048 devices<br/>shared_params (broadcast) + device_params (10,048 × ~20)"]
            E2["Resistor batch: vmap over 32 devices"]
            E1 --> R1["10,048 × residuals + 10,048 × Jacobian entries"]
            E2 --> R2["32 × residuals + 32 × Jacobian entries"]
        end

        subgraph stamp ["3. COO Stamping (pre-computed index arrays)"]
            direction TB
            S1["Map local → global indices via stamp_indices"]
            S2["mask_coo_vector: zero out ground-node entries"]
            S1 --> S2
        end

        subgraph asm ["4. Matrix Assembly (segment_sum)"]
            direction TB
            A1["~320K COO triplets (row, col, val)"]
            A2["segment_sum scatter-add → BCOO sparse matrix"]
            A1 --> A2
        end

        subgraph solve ["5. Sparse Linear Solve"]
            direction TB
            L1["COO→CSR via pre-computed permutation + segment_sum"]
            L2["Enforce NOI constraints (zero rows/cols, unit diagonal)"]
            L3["UMFPACK (CPU) or cuDSS (GPU) sparse LU solve"]
            L1 --> L2 --> L3
            L3 --> D1["δV: Newton update (25,219)"]
        end

        extract --> eval --> stamp --> asm --> solve
    end

Where Parallelism Happens¶

Each stage has a distinct parallelism mechanism:

flowchart LR
    subgraph stages ["Pipeline Stages"]
        direction TB
        S1["Voltage<br/>Extraction"]
        S2["Device<br/>Evaluation"]
        S3["COO<br/>Stamping"]
        S4["Matrix<br/>Assembly"]
        S5["Linear<br/>Solve"]
    end

    subgraph parallel ["Parallelism Mechanism"]
        direction TB
        P1["Vectorized gather<br/>V[indices] - V[indices]"]
        P2["jax.vmap<br/>10,048 parallel threads"]
        P3["Vectorized index<br/>mapping + masking"]
        P4["jax.ops.segment_sum<br/>parallel scatter-add"]
        P5["Sparse LU factorization<br/>(cuDSS on GPU)"]
    end

    subgraph scale ["Scale for c6288"]
        direction TB
        D1["10,048 × 13 lookups"]
        D2["10,048 PSP103 evals<br/>+ 32 resistor evals"]
        D3["~320K COO entries"]
        D4["320K → ~200K unique<br/>in 25K × 25K matrix"]
        D5["25,219 × 25,219<br/>sparse system"]
    end

    S1 --- P1 --- D1
    S2 --- P2 --- D2
    S3 --- P3 --- D3
    S4 --- P4 --- D4
    S5 --- P5 --- D5

Parameter Splitting: Shared vs Per-Device¶

The key optimization for batched evaluation is separating parameters that are constant across all 10,048 MOSFETs from those that vary per device.

graph LR
    subgraph shared ["Shared (broadcast, 1D)"]
        SP["~800 model params<br/>(TOX, VFB0, NSUBO, ...)"]
        SC["~400 cache values<br/>(computed by init)"]
        SIM["simparams<br/>(analysis_type, gmin)"]
    end

    subgraph varying ["Per-device (batched, 2D)"]
        VP["device_params<br/>10,048 × ~20<br/>(voltages + W, L, TYPE)"]
        VC["device_cache<br/>10,048 × ~60<br/>(init results that vary)"]
        LS["limit_state<br/>10,048 × n_lim"]
    end

    subgraph vmap_call ["jax.vmap(eval_fn, in_axes=(None, 0, None, 0, None, 0))"]
        EVAL["PSP103 compact<br/>model equations"]
    end

    shared --> vmap_call
    varying --> vmap_call
    vmap_call --> OUT["10,048 × residuals<br/>10,048 × Jacobian entries"]

The in_axes=(None, 0, None, 0, None, 0) specification tells JAX: - None: broadcast this input to all 10,048 invocations (shared params, shared cache, simparams) - 0: slice along the first dimension, one row per device (device params, device cache, limit state)

This means the ~800 shared model parameters are loaded once into registers/cache, while only the ~20 varying parameters differ per thread.

COO Stamping and Assembly¶

Each device produces local residuals and Jacobian entries indexed by local node numbers (0..12 for PSP103). These must be mapped to global circuit indices.

flowchart TB
    subgraph local ["Per-device local output"]
        direction LR
        LR["residual[0..5]<br/>(6 node contributions)"]
        LJ["jacobian[0..31]<br/>(up to 32 dI/dV entries)"]
    end

    subgraph mapping ["Stamp index mapping"]
        direction LR
        RI["res_indices: (10,048 × 6)<br/>local node → global row"]
        JR["jac_row_indices: (10,048 × 32)"]
        JC["jac_col_indices: (10,048 × 32)"]
    end

    subgraph global ["Global COO arrays"]
        direction LR
        GR["f_resist: ~60K valid entries<br/>→ segment_sum → f[25,219]"]
        GJ["J entries: ~320K triplets<br/>→ segment_sum → J[25,219 × 25,219]"]
    end

    local --> mapping --> global

Ground-node entries are mapped to index -1 and masked to zero, so they don't pollute the system.

Sparse Solver Path¶

The c6288 system is too large for dense linear algebra (25K × 25K × 8 bytes = ~5GB). The sparse path avoids materializing the full matrix:

flowchart TB
    subgraph coo ["BCOO from assembly"]
        C1["~320K COO triplets<br/>(row, col, val)"]
    end

    subgraph convert ["COO → CSR (pre-computed)"]
        C2["sort by (row, col) via permutation"]
        C3["segment_sum to merge duplicates"]
        C4["CSR: ~200K stored elements"]
    end

    subgraph noi ["NOI Constraint Enforcement"]
        N1["Zero out NOI rows/cols in CSR data"]
        N2["Set NOI diagonal to 1.0"]
        N3["Zero NOI entries in RHS"]
    end

    subgraph solve ["Backend-specific solve"]
        direction LR
        GPU["GPU: cuDSS/Spineax<br/>Cached symbolic factorization<br/>Only numerical refactor per NR iter"]
        CPU["CPU: UMFPACK via FFI<br/>Direct sparse LU"]
    end

    coo --> convert --> noi --> solve

The COO→CSR conversion is itself parallel: the permutation and segment_sum are pre-computed during setup, so at runtime it's just a gather + scatter-add.

Transient Time-Stepping Loop¶

The outer loop is also JIT-compiled via lax.while_loop:

stateDiagram-v2
    [*] --> ComputeCoeffs: t < t_stop

    ComputeCoeffs: Compute integration coefficients
    ComputeCoeffs --> EvalSources: c0, c1 from BDF/Trap

    EvalSources: Evaluate source waveforms
    EvalSources --> Predict: vsource_vals at t+dt

    Predict: Extrapolate V from history
    Predict --> NRSolve: V_pred as initial guess

    NRSolve: Newton-Raphson solve
    NRSolve --> CheckNR

    CheckNR: NR converged?
    CheckNR --> LTE: Yes
    CheckNR --> Reject: No (halve dt)

    LTE: Estimate local truncation error
    LTE --> CheckLTE

    CheckLTE: LTE acceptable?
    CheckLTE --> Accept: Yes
    CheckLTE --> Reject: No (reduce dt)

    Reject: Reduce timestep
    Reject --> ComputeCoeffs

    Accept: Store solution
    Accept --> AdvanceTime

    AdvanceTime: Update history, advance t
    AdvanceTime --> ComputeCoeffs: t < t_stop
    AdvanceTime --> [*]: t >= t_stop

Performance Profile¶

For c6288 on CPU (CI benchmark results):

Metric	Value
Timesteps	~1,000
NR iterations/step	5-20
Device evals/NR iter	10,048 PSP103 + 32 resistors
Per-step time (VAJAX)	90 ms
Per-step time (VACASK)	80 ms
Total wall time (VAJAX)	65.7s (includes JIT)
Total wall time (VACASK)	80.2s
Speedup (total)	1.22x faster (JIT amortized)

The per-step overhead (~10ms) comes from adaptive timestep machinery, jnp.where branching, and COO assembly. This overhead is fixed regardless of circuit size, which is why c6288 (~90ms/step) is competitive while small circuits (rc: 0.014ms VAJAX vs 0.002ms VACASK) show higher ratios.

On GPU, the vmap'd device evaluation and cuDSS sparse solve provide additional speedup for large circuits, as the 10,048 parallel PSP103 evaluations map directly to GPU threads.

Scale Comparison: Benchmarks¶

Metric	c6288 (16x16)	mul64 (64x64)	Ratio
Architecture	Wallace-tree	Array multiplier	—
Partial product ANDs	256	4,096	16x
Total MOSFETs	~10,048	~266,408	~27x
Estimated unknowns	~25,219	~400K+	~16x
Adder cells	~2,160	~3,969	~2x

The mul64 benchmark serves as a GPU stress test — at ~266K MOSFETs and ~400K+ unknowns, it exercises the sparse solver and vmap'd device evaluation at a scale where GPU parallelism is essential.