Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM
By Zhongkai Yu1,2,†, Shengwen Liang1,†, Tianyun Ma3, Yunke Cai1,2, Ziyuan Nan1,2, Di Huang1, Xinkai Song1, Yifan Hao1, Jie Zhang4, Tian Zhi1, Yongwei Zhao1, Zidong Du1,5, Xing Hu1,5,∗, Qi Guo1, Tianshi Chen6
1 SKL of Processors, Institute of Computing Technology, CAS, Beijing, China
2 University of Chinese Academy of Sciences, Beijing, China
3 University of Science and Technology of China, Beijing, China
4 Peking University, Beijing, China
5 Shanghai Innovation Center for Processor Technologies
6 Cambricon Technologies Co., Ltd., China
Deploying advanced large language models on edge devices, such as smartphones and robotics, is a growing trend that enhances user data privacy and network connectivity resilience while preserving intelligent capabilities. However, such a task exhibits single-batch computing with incredibly low arithmetic intensity, which poses the significant challenges of huge memory footprint and bandwidth demands on limited edge resources. To address these issues, we introduce Cambricon-LLM, a chiplet-based hybrid architecture with NPU and a dedicated NAND flash chip to enable efficient on-device inference of 70B LLMs. Such a hybrid architecture utilizes both the high computing capability of NPU and the data capacity of the NAND flash chip, with the proposed hardware-tiling strategy that minimizes the data movement overhead between NPU and NAND flash chip. Specifically, the NAND flash chip, enhanced by our innovative in-flash computing and on-die ECC techniques, excels at performing precise lightweight on-die processing. Simultaneously, the NPU collaborates with the flash chip for matrix operations and handles special function computations beyond the flash's on-die processing capabilities. Overall, Cambricon-LLM enables the on-device inference of 70B LLMs at a speed of 3.44 token/s, and 7B LLMs at a speed of 36.34 token/s, which is over 22X to 45X faster than existing flash-offloading technologies, showing the potentiality of deploying powerful LLMs in edge devices.
To read the full article, click here
Related Chiplet
- High Performance Droplet
- Interconnect Chiplet
- 12nm EURYTION RFK1 - UCIe SP based Ka-Ku Band Chiplet Transceiver
- Bridglets
- Automotive AI Accelerator
Related Technical Papers
- DCRA: A Distributed Chiplet-based Reconfigurable Architecture for Irregular Applications
- ChipAI: A scalable chiplet-based accelerator for efficient DNN inference using silicon photonics
- CATCH: a Cost Analysis Tool for Co-optimization of chiplet-based Heterogeneous systems
- AuthenTree: A Scalable MPC-Based Distributed Trust Architecture for Chiplet-based Heterogeneous Systems
Latest Technical Papers
- Thermo-mechanical co-design of 2.5D flip-chip packages with silicon and glass interposers via finite element analysis and machine learning
- High-Efficient and Fast-Response Thermal Management by Heterogeneous Integration of Diamond on Interposer-Based 2.5D Chiplets
- HexaMesh: Scaling to Hundreds of Chiplets with an Optimized Chiplet Arrangement
- A physics-constrained and data-driven approach for thermal field inversion in chiplet-based packaging
- Probing the Nanoscale Onset of Plasticity in Electroplated Copper for Hybrid Bonding Structures via Multimodal Atomic Force Microscopy