Hecaton: Training and Finetuning Large Language Models with Scalable Chiplet Systems
By Zongle Huang, Shupei Fan, Chen Tang, Xinyuan Lin, Shuwen Deng, Yongpan Liu (Tsinghua University)
Large Language Models (LLMs) have achieved remarkable success in various fields, but their training and finetuning require massive computation and memory, necessitating parallelism which introduces heavy communication overheads. Driven by advances in packaging, the chiplet architecture emerges as a potential solution, as it can integrate computing power, as well as utilize on-package links with better signal integrity, higher bandwidth, and lower energy consumption. However, most existing chiplet-related works focus on DNN inference. Directly porting them to LLM training introduces significantly large quantities of DRAM access and network-on-package (NoP) overheads which make state-of-the-art chiplet designs fail, highlighting a research gap.
This work proposes Hecaton, a scalable and cost-effective chiplet system for LLM training and finetuning. We first provide a chiplet architecture with tailored scheduling that can largely reduce DRAM accesses. We further design an efficient distributed training method that reduces NoP communication complexity and relieves constraints on SRAM capacity and layout. Theoretical analysis shows that the entire system achieves weak scaling: as the workload and hardware resources grow proportionally, the computation-to-communication ratio remains nearly constant. Experiments with various workloads and hardware configurations verify the property, and Hecaton achieves 4.98× performance improvement and 2.35× energy reduction on Llama2-70B, compared to the tensor parallelism in Megatron. To the best of our knowledge, we propose the first chiplet architecture specifically used for LLM training or finetuning, with guaranteed performance regardless of the problem scale.
To read the full article, click here
Related Chiplet
- Direct Chiplet Interface
- HBM3e Advanced-packaging chiplet for all workloads
- UCIe AP based 8-bit 170-Gsps Chiplet Transceiver
- UCIe based 8-bit 48-Gsps Transceiver
- UCIe based 12-bit 12-Gsps Transceiver
Related Technical Papers
- Chiplet Cloud: Building AI Supercomputers for Serving Large Generative Language Models
- Chiplet Cloud: Building AI Supercomputers for Serving Large Generative Language Models
- Intel Delivers Cutting-Edge Process Technologies to the Data Center with Intel 18A and Advanced Chiplet Packaging
- AIG-CIM: A Scalable Chiplet Module with Tri-Gear Heterogeneous Compute-in-Memory for Diffusion Acceleration
Latest Technical Papers
- Automakers And Industry Need Specific, Extremely Robust, Heterogeneously Integrated Chiplet Solutions
- Efficient ESD Verification For 2.5/3D Automotive ICs
- Heterogeneous Integration Technologies for Artificial Intelligence Applications
- Performance Implications of Multi-Chiplet Neural Processing Units on Autonomous Driving Perception
- ChipAI: A scalable chiplet-based accelerator for efficient DNN inference using silicon photonics