Chiplet Cloud: Building AI Supercomputers for Serving Large Generative Language Models
By Huwan Peng 1, Scott Davidson 1, Richard Shi 1, Shuaiwen Leon Song 2, Michael Taylor 1
1 University of Washington, 2 The University of Sydney
Large language models (LLMs) such as OpenAI's ChatGPT and Google's Gemini have demonstrated unprecedented capabilities of autoregressive AI models across multiple tasks triggering disruptive technology innovations around the world. However, as models continue to grow the cost to serve these models also continues to grow threatening the democratization of LLMs.
To address this issue, we propose Chiplet Cloud, a chiplet-based ASIC LLM-supercomputer architecture whose goal is to optimize the total cost of ownership (TCO) per generated token. This architecture is a highly parameterizable ASIC and server-level architecture leveraging thousands of replicated accelerator modules collaborating to scale-up the performance of LLMs at cloud-scale. To determine specific parameterizations of the Chiplet Cloud architecture, we implemented a two-phase hardware-software co-design methodology that can search the massive design space and fine tune the architecture across a collection of LLMs based on an accurate inference simulation. A common bottleneck for LLMs is the memory access performance therefore we introduce CC-MEM, a scalable on-chip memory system for Chiplet Cloud architectures. Using the CC-MEM, Chiplet Clouds can be built using only SRAMs for design points where the power and performance of memory access is critical. The CC-MEM also includes a compression decoder module to add support for sparse models without impacting the compute units using a Store-as-Compressed, Load-as-Dense mechanism.
We evaluate Chiplet Cloud architectures across eight popular LLMs. Using fine tuned Chiplet Cloud servers we are able to achieve 97× and 18× improvement in TCO/Token over rented GPU and TPU clouds, or a 8.3× and 3.7× improvement over fabricated GPU and TPU clouds respectively. Chiplet Cloud can also support 1.7× larger models with a sparsity of 60\%.
Related Chiplet
- Direct Chiplet Interface
- HBM3e Advanced-packaging chiplet for all workloads
- UCIe AP based 8-bit 170-Gsps Chiplet Transceiver
- UCIe based 8-bit 48-Gsps Transceiver
- UCIe based 12-bit 12-Gsps Transceiver
Related Technical Papers
- Chiplet Cloud: Building AI Supercomputers for Serving Large Generative Language Models
- Chiplet-Gym: Optimizing Chiplet-based AI Accelerator Design with Reinforcement Learning
- Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM
- AIG-CIM: A Scalable Chiplet Module with Tri-Gear Heterogeneous Compute-in-Memory for Diffusion Acceleration
Latest Technical Papers
- Spiking Transformer Hardware Accelerators in 3D Integration
- GATE-SiP: Enabling Authenticated Encryption Testing in Systems-in-Package
- AIG-CIM: A Scalable Chiplet Module with Tri-Gear Heterogeneous Compute-in-Memory for Diffusion Acceleration
- Chiplever: Towards Effortless Extension of Chiplet-based System for FHE
- The Survey of Chiplet-based Integrated Architecture: An EDA perspective