AccelStack: A Cost-Driven Analysis of 3D-Stacked LLM Accelerators
By Chen Bai 1, Xin Fan 1,Zhenhua Zhu 1,2, Wei Zhang 1, Yuan Xie 1
1 The Hong Kong University of Science and Technology
2 Tsinghua University
Abstract
Large language models (LLMs) show viability for artificial general intelligence (AGI) with high computing power and memory bandwidth demands. While existing LLM accelerators leverage high-bandwidth memory (HBM) and 2.5D packaging to address the challenge, emerging hybrid bonding techniques unlock new opportunities for 3D-stacked LLM accelerators. This paper proposes AccelStack, a cost-driven analysis for the new architecture via two innovations. First, a performance model capturing memory-on-logic is presented. Second, a cost model for die-on-die (DoD), die-on-wafer (DoW), and wafer-on-wafer (WoW) is proposed. Evaluations show 3D-stacked accelerators achieve up to 7.17× and 2.09× faster inference than NVIDIA A100 (FP16) and H100 (FP8) simulation results across various LLM workloads, with chiplet-based designs reducing recurring engineering costs by 38.09% versus monolithic implementations.
To read the full article, click here
Related Chiplet
- DPIQ Tx PICs
- IMDD Tx PICs
- Near-Packaged Optics (NPO) Chiplet Solution
- High Performance Droplet
- Interconnect Chiplet
Related Technical Papers
- A cost analysis of the chiplet as a SoC solution
- Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM
- CATCH: a Cost Analysis Tool for Co-optimization of chiplet-based Heterogeneous systems
- Thermal Issues Related to Hybrid Bonding of 3D-Stacked High Bandwidth Memory: A Comprehensive Review
Latest Technical Papers
- Link Quality Aware Pathfinding for Chiplet Interconnects
- Effects of Poor Workload Partitioning on System Performance for Chiplet-Based Systems
- Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures
- Network Design for Wafer-Scale Systems with Wafer-on-Wafer Hybrid Bonding
- CarbonPATH: Carbon-aware pathfinding and architecture optimization for chiplet-based AI systems