AI Chip Architectures

At the 2018 International Symposium on Computer Architecture, John Hennessy and David Patterson delivered their Turing Lecture: "A New Golden Age for Computer Architecture".

In the 1980s, when Hennessy and Patterson did their Turing Award-winning research,
single-threaded CPU performance grew 52% a year. By 2018, with the end of Moore's Law and Dennard Scaling, the rate was 3%.

There was a need for domain-specific architectures (DSAs). Their worked example was Google's TPU v1, already in production: 29× the throughput of a CPU on neural-network inference, at 80× better energy efficiency. The closing prediction: "the next decade will see a Cambrian explosion of novel computer architectures."

This prediction came true. Today, we now have dozens of architectures in serious development. GPUs, TPUs, LPUs, NPUs, DPUs, ASICs, wafer-scale engines, reconfigurable dataflow, neuromorphic, photonic, analog. Particularly, these architectures focus on compute for AI.

This post aims to survey these varying approaches - their philosophy, architecture, scaling methods (scale-up and scale-out), and software stack (how you program the chip).

For now, I'm starting with NVIDIA GPUs, Google TPUs, and AMD GPUs.

The Problem

AI compute is dominated by matrix multiplication. A transformer is a sequence of matmuls: Q/K/V projection, attention, output projection, FFN - interleaved with element-wise ops: normalisation, activation, residual adds. Training a frontier model performs $10^{25}$ multiply-accumulate operations (matmuls are a sequence of multiply-accumulates).

The shape of those matmuls depends on the workload. Training pushes a batch of sequences forward through every layer, backpropagates the loss, and updates the weights, with thousands of tokens flowing through the same weight matrix at once. Prefill is the prompt-ingestion phase of inference: the full input sequence projected through the model in a single pass, before the first output token has been produced. Both training & prefill stack many tokens against the same weight matrix, so each layer's math is a large matrix-matrix multiply (GEMM), with high arithmetic intensity (compute-bound). Decode is autoregressive: the model emits one token at a time, each conditioned on every token before it, and token N+1 cannot begin until token N has been produced. Only one token gets projected per step, so every matmul becomes a matrix-vector product (GEMV). Producing one token requires a full pass over every weight in the model, plus a full read of the KV Cache for attention. Arithmetic intensity drops by orders of magnitude versus prefill.

Inference systems recover some of that intensity by batching tokens to promote those GEMVs back to GEMMs: continuous batching stacks many users' decode steps, speculative decoding stacks K drafted tokens per request and verifies them in one pass, and multi-token prediction folds the same trick inside the model itself. This achieves higher utilisation of the matmul units, and pushes up the Ops/B. For continuous batching, each user's request still reads its own KV Cache, so long-context decode shifts from weight-bandwidth-bound to KV-bandwidth-bound.

The architecture problem here is moving the numbers to where the matmuls happens fast enough. This is known as the memory wall: compute has scaled exponentially, memory bandwidth has not.

Each architecture proposes a different strategy for winning the data-movement game. Understanding a chip reduces to four questions: where does data live, how does it move to the compute units, what do the compute units look like, and how do chips talk to each other at scale.

NVIDIA GPU

The NVIDIA GPU is a massively parallel processor. The philosophy is that a programmable chip with thousands of threads, orchestrated by a host CPU and exposed through CUDA, is the right machine to run parallelisable workloads. Each generation adds acceleration primitives onto programmable Streaming Multiprocessors without changing the programming model. The same chip trains transformers, serves inference, renders graphics, and runs scientific simulation (accelerated computing).

Genealogy

2006

TeslaG80

The first CUDA-capable GPU; unified shaders and the SIMT execution model.

2010

FermiGF100

First true compute architecture: unified L1/L2 caches, dual warp schedulers, IEEE-754 FP64.

2012

KeplerK20, K40

SMX, dynamic parallelism, Hyper-Q; the GPU can launch its own work.

2014

MaxwellM40

Redesigned SM with ~2× perf-per-watt over Kepler.

2016

PascalP100

NVLink 1.0, HBM2, native FP16 throughput; the first GPU designed explicitly for deep learning.

2017

VoltaV100

First Tensor Cores; independent thread scheduling.

2018

TuringT4

2nd-gen Tensor Cores with INT8/INT4; first RT Cores.

2020

AmpereA100

3rd-gen Tensor Cores with TF32 and structured sparsity; Multi-Instance GPU partitioning.

2022

HopperH100, H200, GH200

4th-gen Tensor Cores, FP8, Transformer Engine; HBM3, TMA, thread block clusters, async wgmma.

2024

BlackwellB100, B200, GB200

5th-gen Tensor Cores with FP4, Tensor Memory (TMEM), two-die chiplet GPU, NVLink 5.

2025

Blackwell UltraB300, GB300

Mid-cycle refresh: ~1.5× FP4 throughput, 288 GB HBM3e. Tuned for long-context reasoning.

2026

RubinRubin, VR200, Rubin CPX

HBM4, 3rd-gen Transformer Engine, Vera CPU pairing, disaggregated prefill via Rubin CPX.

2027

Rubin UltraRubin Ultra

4-die GPU package, 1 TB HBM4e per package. Deployed in 600 kW NVL576 Kyber racks at 100 PetaFLOPS FP4 per GPU.

Architecture

An NVIDIA GPU is a group of throughput-oriented cores, a deep memory hierarchy to keep them fed, + just enough scheduling silicon to keep thousands of threads in flight. The cores are Streaming Multiprocessors, replicated 100+ times per package: 80 on V100, 108 on A100, 132 on H100, 148 on B200, 160 on B300, 224 on Rubin. Inside every SM sits the same recipe: four SM Sub-Partitions, each with its own warp scheduler, dispatch unit, 16k×32-bit register file, scalar CUDA Core lanes, a Special Function Unit for transcendentals, and a private port into the SM's Tensor Cores. The four partitions share an L1/shared-memory block, and the TMA. Threads are grouped into warps of 32 that execute in SIMT lock-step; dozens of resident warps per partition let the scheduler hide memory/arithmetic stalls by switching between them.

Zoom into one Streaming Multiprocessor — four sub-partitions, each with its own warp scheduler, dispatch, register file and Tensor Memory, drawing on shared L1/SMEM and the TMA below.

Compute

CUDA Cores are the original compute throughput, and for AI they still own everything that isn't a matmul: activations, residual adds, normalization, address arithmetic. But, a transformer block is ~99% matmul FLOPs, so the overwhelming compute throughput comes from the Tensor Cores.

These cores execute fused matrix multiply-accumulate on small matrix tiles, $D = A \cdot B + C$ The full matmul is broken into output tiles: to produce one output tile, a kernel walks the shared inner dimension $K$ , drawing $A$ from a row-strip of the left input matrix and $B$ from a column-strip of the right, and folds each partial product into a running accumulator. $C$ holds the partial sum so far, $D$ is the updated value carried into the next step. After the inner loop completes, $D$ is one finished tile of the full output matrix; the whole matmul is built from many of these tile MMAs.

Tile shapes are written M × N × K, $M \times N$ is the output tile size, and $K$ is how much of the inner dimension the instruction contracts over in one fire; the rest of the matmul's $K$ axis is walked by the kernel's inner loop. The accumulator is sticky across that loop: each MMA's output $D$ becomes the next MMA's input $C$ , so the equation is really $C \leftarrow A \cdot B + C$ in place: successive instructions fold their partial products into the same storage until the K-axis is fully walked.

V100's first-gen unit (8 per SM) ran a warp-level 16×16×16 FP16 MMA. A100's 3rd-gen unit added TF32, BF16, FP64 matmul, and 2:4 structured sparsity. H100's 4th-gen unit added native FP8 and pulled the abstraction up from a warp to a warp group: 128 cooperating threads firing an asynchronous wgmma at 64×256×16 shape that runs in the background while the issuing warps load the next tile. B200's 5th-gen unit went further still: a two-SM MMA of 256×256×16 with operands split across a pair of SMs, native FP4, and a dedicated 256 KB Tensor Memory (TMEM) scratchpad per SM that holds accumulator tiles instead of bleeding into the register file. Rubin's 6th-gen unit extends FP4 throughput, adds native FP6, and pairs with a 3rd-gen Transformer Engine that does adaptive NVFP4 micro-block scaling in hardware, keeping the per-tile quantization metadata on the Tensor Core path, rather than through the CUDA Cores.

What stays constant across all six generations is that the matmul lives inside the thread/warp hierarchy, but the number of threads it takes to issue one has shrunk, and the issue itself has decoupled from execution. Volta's mma.sync is warp-collective and synchronous: all 32 threads in a warp execute it together, each lane holding register fragments of A, B, and the accumulator D, and the warp blocks until it completes. Hopper's wgmma.mma_async widens the issuer to a warp-group of 128 threads, moves B into a shared-memory descriptor (A becomes optional: either registers or a descriptor, kernel's choice), and returns immediately: the matmul runs in the background while the warp-group queues the next tile, with completion tracked via wgmma.commit_group / wgmma.wait_group.

Blackwell's tcgen05.mma completes the migration: A joins B in shared-memory descriptors (or A comes from TMEM directly), and the accumulator D lands in TMEM rather than the register file. With every operand off the lanes, there is no per-thread state for an issue to coordinate, so a single thread fires the instruction and returns immediately, with completion signalled by an mbarrier the consumer warp waits on. The rest of the warp, and the issuing thread itself, is free for other work in the meantime. A CTA-pair variant scales the same model across two SMs: one thread on each SM in a paired cluster issues coordinated MMAs that share operands across the pair, composing the 256×256×16 two-SM tile under the same async/mbarrier completion, just promoted to a cluster-level barrier so the pair stays in step.

The matmul has grown bigger and lighter on the issuing threads at the same time: an instruction that started as 32 lanes acting in lockstep is now closer to a single descriptor-driven command, dispatched from inside the warp model but no longer executed by it.

That decoupling is what makes transformer attention kernels efficient on a GPU. The warp can run softmax, apply a mask, or pre-load the next tile while the matmul is in flight; the overlap of matmul and the surrounding element-wise work is the structure of every modern attention kernel (FlashAttention-3, FA4), and it depends on the matrix instruction not blocking the warp.

Memory

The on-chip hierarchy is hardware-managed caches at every level, with software hints layered on top. Off-chip is HBM: 32 GB HBM2 on V100, 80 GB HBM3 on H100, 192 GB HBM3e on B200, 288 GB on B300, 288 GB HBM4 on Rubin. A chip-level L2 Cache sits between HBM and the SMs: 6 MB on V100, 40 MB on A100, 50 MB on H100, 60 MB on B200 (split into two 30 MB banks across the two-die package, with locality-aware residency controls so that hot tiles can be pinned to the near die). Inside each SM, 256 KB of unified L1/SMEM is partitioned at kernel launch between hardware-managed L1 and a programmer-controlled scratchpad. The register file is another ~256 KB per SM, sliced four ways across the partitions.

Blackwell adds a fifth tier: TMEM, 256 KB per SM dedicated to MMA accumulators and addressed only by the Tensor Core, pulling the operand-residency pressure out of the general register file.

Movement between tiers has been progressively decoupled from the warp. Pre-Ampere, loading a tile was synchronous: each thread issued its own global load, the warp blocked until every fragment landed in registers, and a second pass copied them to shared memory; every tile burned warp lanes on address arithmetic and on the wait. Ampere introduced cp.async: per-thread async copies HBM → SMEM that bypass registers entirely, with the warp committing groups of in-flight copies and waiting only when the consumer needs the data. Hopper replaced that with the TMA, a dedicated DMA engine: one thread submits a multi-dimensional tile descriptor (base address, leading dimension, swizzle), the engine handles all the address arithmetic and writes into shared memory, and completion is signalled by an mbarrier. The whole warp is freed from load issue and address math; the kernel just queues descriptors. TMA also supports cluster-level multicast: one HBM read fans out to every SM in a thread-block cluster, turning what used to be N separate loads into one. Blackwell extends TMA again: direct loads into TMEM, so accumulator tiles stream in without staging through SMEM. The trajectory is one less thing the warp has to do per tile, generation after generation.

Warp Specialisation

The Hopper-era programming idiom is warp specialisation: inside one block, some warps act as producers that issue back-to-back TMA loads; others act as consumers that fire wgmma on freshly-arrived tiles. Synchronisation between them is no longer the old SM-wide __syncthreads() barrier; it is mbarrier (memory barriers in shared memory) and asynchronous transaction barriers attached to TMA completions, allowing fine-grained producer/consumer handshakes at warp granularity rather than block granularity. The pattern that has become the reference for every modern attention kernel (FlashAttention-3, CUTLASS ping-pong GEMMs, the Blackwell FA4 kernel) is the same recipe: a TMA-driven producer pipeline feeds a wgmma consumer pipeline through shared memory and TMEM, with mbarrier handshakes and thread-block clusters (Hopper+) tying multiple SMs into one cooperative compute unit so that the two-SM MMA of Blackwell composes naturally on top.

Numerics

FP32 was the historical default; Volta brought FP16 with FP32 accumulate and the loss-scaling tricks that made it trainable; Ampere added TF32 (FP32 range, FP16 mantissa, drop-in for FP32 matmul), BF16, and 2:4 structured sparsity that doubles effective throughput on pruned weights. Hopper introduced native FP8 in both E4M3 and E5M2, paired with the Transformer Engine which auto-scales activations layer-by-layer to keep them inside FP8 dynamic range. Blackwell halved precision again with FP4 and shipped microscaling MX formats (block-level shared exponents that recover most of the accuracy lost at FP4), together with a 2nd-gen Transformer Engine that retargets the auto-scaling pipeline to FP4. Rubin's 3rd-gen Transformer Engine adds NVFP4 (NVIDIA's tightened FP4 variant) and native FP6 with more aggressive sparsity. The chip layout itself is now part of the numerics story: B100/B200/B300 are two reticle-limit dies stitched by a ~10 TB/s NV-HBI link and presented to software as one logical GPU, with 8 HBM stacks on the package; Rubin extends the chiplet recipe to dual-die at ~336 B transistors with 8 HBM4 stacks. Every generation buys roughly 2× per-watt throughput by cutting bits in half and restoring accuracy with a finer-grained scaling scheme, and increasingly, by bonding more silicon into the package.

Bets

Bet 1: Programmability. The workload is a moving target (attention variants, novel model architectures), so keep every block programmable and let the developer write CUDA. Even the specialised units are exposed through that model rather than as fixed-function blocks.
Bet 2: Hide Latency with Massive Multithreading. Latency is unpredictable and data-dependent, so hide it not with a static schedule but with massive thread overcommit, up to 64 resident warps per SM, with the hardware warp scheduler picking a ready warp every cycle.
Bet 3: Warp-wrapped Matmul. The matrix unit is the overwhelming compute throughput, but it must live behind the same warp/thread abstraction that everything else uses, so wrap it in mma.sync → wgmma → tcgen05.mma - rather than expose it as a fixed-function pipe. This enables a single kernel to fuse matmul, softmax, and element-wise ops in one pass.
Bet 4: Async Memory Hierarchy. Make the memory hierarchy explicit and programmer-managed rather than implicit and compiler-scheduled. Keep the L2 cache, but expose SMEM and TMEM as named scratchpads, and layer async machinery on top: TMA for bulk copies, TMEM for the matmul accumulator, mbarrier for the producer/consumer handshake. The hierarchy is software-pipelined inside a programmable kernel, not statically scheduled by a compiler against a known-latency scratchpad.
Bet 5: Amortised SIMT Tax. Every transistor spent on a warp scheduler, register-file, or coherent cache is a transistor not spent on a MAC; accept the tax, and pay it down two ways: a Tensor Core now big enough that the SIMT machinery is amortised across a much larger MAC count, and units like TMEM trading away some general-purpose flexibility for MAC density.

Scaling

There are two regimes for scaling: scale-up and scale-out.

Scale-up

Bind several GPUs into one coherent memory domain. Any GPU can load or store any other GPU's HBM directly over NVLink at nanosecond latencies: one address space, no explicit transfers.

Scale-out

Network those domains together at the rack and cluster level. Data crosses via explicit RDMA at microsecond latencies: separate address spaces, but tens of thousands of chips per cluster.

AI infrastructure uses both: bandwidth-hungry collectives (tensor parallelism, MoE expert routing) stay inside the scale-up domain; data parallelism and pipeline parallelism cross the scale-out fabric.

Scale-up

The scale-up stack is NVLink plus NVSwitch. NVLink implements a cache-coherent fabric between GPUs, so a load or store on one GPU can target another GPU's HBM with the hardware handling address translation and coherence. But NVLink by itself is point-to-point: one link connects exactly two chips. NVSwitch is a dedicated crossbar chip that every GPU connects to, routing traffic so every GPU can simultaneously communicate with every other at full NVLink bandwidth, non-blocking and all-to-all.

Together they defined the HGX 8-GPU baseboard, pairing eight H100 SXM modules with x86 hosts (AMD EPYC or Intel Xeon) over PCIe Gen5. Hopper also shipped a Grace-paired form: the GH200 Grace Hopper Superchip bonded one Grace ARM CPU to one H100 over NVLink-C2C at 900 GB/s, eliminating the PCIe host-device hop. Modules scaled up into GH200 NVL2 pairs and rack-level GH200 NVL32. Blackwell makes the pairing the default. The GB200 module fuses one Grace with two B200s over NVLink-C2C, and NVL72 stitches 36 of them into a single liquid-cooled scale-up domain: 72 GPUs, 36 Grace CPUs, 13.5 TB of HBM and 17 TB of LPDDR5X as one flat, coherent address space. Rubin steps this in two. NVL144 ships in 2026 as a Rubin-generation refresh inside the same Oberon-class rack: 72 Rubin packages, badged as 144 GPUs under NVIDIA's new die-counting convention, with HBM4 and NVLink 6 doubling per-package bandwidth. The actual rack-scale jump is Rubin Ultra in 2027: NVL576 packs 144 four-die Rubin Ultra packages into the new Kyber chassis for 576 GPU dies in one coherent domain.

NVL72 — 72 Blackwell GPUs sit under a row of NVSwitch ASICs that form one non-blocking crossbar, so any GPU can address any other GPU's HBM at full NVLink bandwidth. The whole fabric runs over a passive copper backplane: ~5,184 cables blind-mated, ~130 TB/s of all-to-all bandwidth, ~20 kW of transceiver power saved vs an optical equivalent.

That density is held together by passive copper. NVL72's NVLink fabric runs over 5,184 cables blind-mated through a backplane (~2 miles of cabling per rack, no in-cable retimers, the SerDes living on the GPU and switch ASICs themselves), carrying ~130 TB/s of all-to-all bandwidth across the 72 GPUs. NVIDIA estimates the copper choice saves roughly 20 kW per rack against an optical equivalent that would have needed pluggable transceivers on every link. Copper is what makes rack as one GPU economically practical: at sub-2-metre runs it still wins on power, cost, and signal integrity per dollar; beyond that, the bits have to go on glass.

NVL144 stays inside Oberon and copper continues to work because the package count (72) is unchanged from NVL72; the cabling doesn't have to lengthen, just transmit faster on Gen 6 SerDes. Rubin Ultra's NVL576 holds the same copper line by reshaping the rack: the new Kyber form factor is roughly twice the height of Oberon and packs all 576 GPU dies into one enclosure, sized specifically so every NVLink path stays within passive-copper reach even at 144 four-die packages and tens of thousands of cables.

Scale-out

The scale-out stack comes from their acquisition of Mellanox. Unlike NVLink, scale-out fabrics are not coherent: nodes keep separate address spaces, and data crosses only via explicit RDMA initiated by software, typically wrapped in NCCL collectives like all-reduce or all-to-all. The reference cluster is the DGX SuperPOD: eight NVL72 racks stitched together over Quantum-X800 InfiniBand yield 576 Blackwell GPUs under a single scheduler, and training clusters scale further by tiling SuperPODs. Rubin SuperPODs in 2026 keep the same 8-rack pattern with NVL144 (yielding 1,152 GPUs per SuperPOD instead of 576). Rubin Ultra in 2027 scales the recipe up an order of magnitude: Kyber racks of 576 GPU dies each, stitched together over Quantum-X Photonics CPO, putting thousands of GPUs under one scheduler.

DGX SuperPOD — eight NVL72 racks (576 GPUs total) sit beneath a Quantum-X800 InfiniBand spine. Per-GPU scale-out is a ConnectX-8 NIC at 800 Gbps; inter-rack hops cross OSFP-RHS pluggable optical transceivers, paying microsecond latencies instead of the nanosecond latencies of the in-rack NVLink fabric above.

Every GPU has its own ConnectX NIC into that fabric. Blackwell nodes run ConnectX-8 at 800 Gbps per GPU, an order of magnitude less bandwidth than per-GPU NVLink, and latencies climb from nanoseconds to microseconds. Rubin moves to ConnectX-9 at 1.6 Tbps per GPU, doubling the per-GPU scale-out bandwidth as the per-rack scale-up domain grows from 72 to 576 GPUs. Alongside each NIC sits a BlueField DPU, adding ARM cores and accelerators to offload storage, networking, and security from the host CPU. For customers who prefer Ethernet to InfiniBand, Spectrum-X is a lossless-Ethernet alternative tuned for AI traffic.

The crossover from copper to glass happens at the rack boundary. Inside the NVL72 the spine is copper; once a link has to cross racks at 800 Gbps it is optical. Passive copper DAC tops out at roughly 1.5–2 metres at 200 G/lane, well short of cross-rack reach, so today's SuperPOD spine rides over OSFP-RHS pluggable transceivers, each module carrying its own laser, modulator, photodetector, and DSP. A SuperPOD spine fanning out to thousands of GPUs is, in optical terms, tens of thousands of pluggables drawing tens of kilowatts on transceiver lasers alone.

With Rubin, that optical layer collapses into the switch ASIC. Quantum-X Photonics (InfiniBand) and Spectrum-X Photonics (Ethernet) replace the pluggables with co-packaged optics: lasers, modulators, and photodetectors bonded onto the switch package via TSMC COUPE. NVIDIA claims ~4× fewer lasers and ~3.5× lower link power than the OSFP-pluggable equivalent. The chiplet logic that turned the GPU into a two-die package and stacked HBM next to it is now showing up at the network layer: vertical integration of compute, memory, and photonics on one substrate.

NVLink Fusion recently opened the scale-up fabric itself: third-party CPUs and XPUs can now join NVLink domains, letting hyperscalers build semi-custom racks around NVIDIA's interconnect without designing their own coherent fabric from scratch.

Software

CUDA is the natural programming model for a massively parallel processor. You write a kernel (one piece of code executed once per thread) and launch it across thousands of threads organised into blocks and warps; the programmer decides what they share, when they synchronise, and which piece of the problem each one handles. That is why the abstraction has barely changed in eighteen years, and why every CUDA kernel written since 2007 would still compile and run on Blackwell.

That continuity is both the moat and the constraint. Each new generation introduces new hardware (Tensor Cores, TMA, TMEM) onto the same kernel-and-warps model, exposed as intrinsics in PTX and SASS: mma.sync, wgmma.mma_async, and so on. NVIDIA cannot radically rethink the SM because too much code depends on it; in return, every investment in CUDA software compounds across generations.

On top of PTX sits a stack constructed over two decades. cuBLAS and cuDNN for math and DNN primitives; CUTLASS, encoding decades of GEMM expertise in templated C++; TensorRT-LLM for paged attention, in-flight batching, and speculative decoding; framework bindings through PyTorch, Triton, and JAX.

FlashAttention, one of the most important algorithmic rewrites in modern AI, tiles attention to avoid materialising the $O(N^2)$ matrix. Its four generations (FA1 through FA4) have each been hand-optimised for the latest NVIDIA silicon (FA3 for Hopper's async pipelines, FA4 for Blackwell), with ports to other hardware trailing by months or years.

Most of this stack is written by people NVIDIA does not pay. The moat is not CUDA itself; it is two decades of third-party kernels, libraries, and tooling, and the millions of developers who have learned the API along the way.

NVIDIA also ships human expertise alongside the silicon. They embed dozens of their own engineers inside frontier labs and hyperscaler teams, writing kernels for each new model architecture and tuning them to each new silicon generation. Whatever a lab wants to train next month tends to run well on NVIDIA much faster than other platforms. Switching off NVIDIA is therefore not just rewriting the kernels and libraries. It is re-training the mental models of an entire engineering workforce, and losing the NVIDIA engineers who today sit inside the building.

Google TPU

The TPU is a matrix multiplication machine. The philosophy is, rather than a programmable chip that can run any massively-parallel workload, focus on a single primitive (dense matrix-multiplication on a large systolic array) and let the XLA compiler plan every cycle and every byte of memory ahead of time. No hardware scheduler, no cache, no threads/warps. Each generation grows the pod, with thousands of chips wired through the ICI interconnect into one coherent machine. A TPU has no ambition to render graphics or run scientific simulation; it exists to train and serve Google's workloads (search, translation, recommendation, Gemini) more efficiently per watt than any general-purpose alternative.