Chiplet Cloud: Building AI Supercomputers for Serving Large Generative Language Models
By Huwan Peng, Scott Davidson, Richard Shi, Shuaiwen Leon Song, Michael Taylor (University of Washington)
Large language models (LLMs) such as ChatGPT have demonstrated unprecedented capabilities in multiple AI tasks. However, hardware inefficiencies have become a significant factor limiting the democratization of LLMs. We propose Chiplet Cloud, an ASIC supercomputer architecture that optimizes total cost of ownership (TCO) per token for serving generative LLMs. Chiplet Cloud fits all model parameters inside the on-chip SRAMs to eliminate bandwidth limitations while moderating the die size to improve system costs while leveraging software mappings to overcome data communication overhead. We propose a comprehensive design methodology that accurately explores a spectrum of major design trade-offs in the joint space of hardware-software and generates a detailed performance-cost analysis on all valid design points. We evaluate Chiplet Cloud on four popular LLMs. Compared to GPU and TPU, our architecture can achieve up to 94x and 15x improvement in TCO/Token respectively, significantly reducing the cost for realistically serving modern LLMs.
To read the full article, click here
Related Chiplet
- DPIQ Tx PICs
- IMDD Tx PICs
- Near-Packaged Optics (NPO) Chiplet Solution
- High Performance Droplet
- Interconnect Chiplet
Related Technical Papers
- Chiplet Cloud: Building AI Supercomputers for Serving Large Generative Language Models
- Hecaton: Training and Finetuning Large Language Models with Scalable Chiplet Systems
- A3D-MoE: Acceleration of Large Language Models with Mixture of Experts via 3D Heterogeneous Integration
- LaMoSys3.5D: Enabling 3.5D-IC-Based Large Language Model Inference Serving Systems via Hardware/Software Co-Design
Latest Technical Papers
- Scope: A Scalable Merged Pipeline Framework for Multi-Chip-Module NN Accelerators
- Scaling Routers with In-Package Optics and High-Bandwidth Memories
- TDPNavigator-Placer: Thermal- and Wirelength-Aware Chiplet Placement in 2.5D Systems Through Multi-Agent Reinforcement Learning
- Towards Scalable Multi-Chip Wireless Networks with Near-Field Time Reversal
- Hybrid surface pre-treatments for enhancing copper-to-copper direct bonding