Alphawave Semi Bridges from Theory to Reality in Chiplet-Based AI

GenAI, the most talked-about manifestation of AI these days, imposes two tough constraints on a hardware platform. First, it demands massive memory to serve large language model with billions of parameters. Feasible in principle for a processor plus big DRAM off-chip and perhaps for some inference applications but too slow and power-hungry for fast datacenter training applications. Second, GenAI cores are physically big, already running to reticle limits. Control, memory management, IO, and other logic must often go somewhere else though still be tightly connected for low latency. The solution of course is an implementation based on chiplets connected through an interposer in a single package: one or more for the AI core, HBM memory stacks, control, and other logic perhaps on one or more additional chiplets. All nice in principle but how do even hyperscalers with deep pockets make this work in practice? Alphawave Semi has already proven a very practical solution as I learned from a Mohit Gupta (SVP and GM of Custom Silicon and IP at Alphawave Semi) presentation, delivered at the recent MemCon event in Silicon Valley.

Start with connectivity

This and the next section are intimately related, but I have to start somewhere. Silicon connectivity (and compute) is what Alphawave Semi does: PCIe, CXL, UCIe, Ethernet, HBM; complete IP subsystems with controllers and PHYs integrated into chiplets and custom silicon.

Memory performance is critical. Training first requires memory for parameters (weights, activations, etc.) but it also must provide pre-allocated working memory to handle transformer calculations. If you once took (and remember) a linear algebra course, a big chunk of these calculations is devoted to lots and lots of matrix/vector multiplications. Big matrices and vectors. Working space needed for intermediate storage is significant; I have seen estimates running over 100GB (the latest version of Nvidia Grace Hopper reportedly includes over 140GB). This data must also move very quickly between HBM memory and/or IOs and the AI engine. Alphawave Semi support better than an aggregated (HBM/PCIe/Ethernet) terabyte/second bandwidth. For the HBM interface they provide memory management subsystem with an HBM controller and PHY in the SoC communicating with the HBM controller sitting at the base of each HBM memory stack, ensuring not only protocol compliance but also interoperability between memory subsystem and memory stack controllers.

Connectivity between chiplets is managed through Alphawave UCIe IP (protocol and PHY), delivering 24Gbps per data lane. These have already been proven in 3nm silicon. A major application for this connectivity might well be connecting the AI accelerator to an Arm Neoverse compute subsystem (CSS) charged with managing the interface between the AI world (networks, ONNX and the like) to the datacenter world (PyTorch, containers, Kubernetes and so on). Which conveniently segues into the next topic, Alphawave Semi’s partnership with Arm in the Total Design program and how to build these chiplet-based systems in practice.

Click here to read more ...