Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM

By Zhongkai Yu^1,²,⁣†, Shengwen Liang¹,⁣†, Tianyun Ma³, Yunke Cai^1,², Ziyuan Nan^1,², Di Huang¹, Xinkai Song¹, Yifan Hao¹, Jie Zhang⁴, Tian Zhi¹, Yongwei Zhao¹, Zidong Du^1,⁵, Xing Hu^1,⁵,⁣∗, Qi Guo¹, Tianshi Chen⁶
¹SKL of Processors, Institute of Computing Technology, CAS, Beijing, China
²University of Chinese Academy of Sciences, Beijing, China
³University of Science and Technology of China, Beijing, China
⁴Peking University, Beijing, China
⁵Shanghai Innovation Center for Processor Technologies
⁶Cambricon Technologies Co., Ltd., China

Deploying advanced large language models on edge devices, such as smartphones and robotics, is a growing trend that enhances user data privacy and network connectivity resilience while preserving intelligent capabilities. However, such a task exhibits single-batch computing with incredibly low arithmetic intensity, which poses the significant challenges of huge memory footprint and bandwidth demands on limited edge resources. To address these issues, we introduce Cambricon-LLM, a chiplet-based hybrid architecture with NPU and a dedicated NAND flash chip to enable efficient on-device inference of 70B LLMs. Such a hybrid architecture utilizes both the high computing capability of NPU and the data capacity of the NAND flash chip, with the proposed hardware-tiling strategy that minimizes the data movement overhead between NPU and NAND flash chip. Specifically, the NAND flash chip, enhanced by our innovative in-flash computing and on-die ECC techniques, excels at performing precise lightweight on-die processing. Simultaneously, the NPU collaborates with the flash chip for matrix operations and handles special function computations beyond the flash's on-die processing capabilities. Overall, Cambricon-LLM enables the on-device inference of 70B LLMs at a speed of 3.44 token/s, and 7B LLMs at a speed of 36.34 token/s, which is over 22X to 45X faster than existing flash-offloading technologies, showing the potentiality of deploying powerful LLMs in edge devices.

To read the full article, click here

Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM

Related Chiplet

Related Technical Papers

Latest Technical Papers

Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM

Subscribe to the Chiplet Marketplace Newsletter

Related Chiplet

Related Technical Papers

Latest Technical Papers