Gemini: Mapping and Architecture Co-exploration for Large-scale DNN Chiplet Accelerators
By Jingwei Cai, Zuotong Wu, Sen Peng, Yuchen Wei, Zhanhong Tan, Guiming Shi, Mingyu Gao, Kaisheng Ma
Chiplet technology enables the integration of an increasing number of transistors on a single accelerator with higher yield in the post-Moore era, addressing the immense computational demands arising from rapid AI advancements. However, it also introduces more expensive packaging costs and costly Die-to-Die (D2D) interfaces, which require more area, consume higher power, and offer lower bandwidth than onchip interconnects. Maximizing the benefits and minimizing the drawbacks of chiplet technology is crucial for developing largescale DNN chiplet accelerators, which poses challenges to both architecture and mapping. Despite its importance in the postMoore era, methods to address these challenges remain scarce. To bridge the gap, we first propose a layer-centric encoding method to encode Layer-Pipeline (LP) spatial mapping for largescale DNN inference accelerators and depict the optimization space of it. Based on it, we analyze the unexplored optimization opportunities within this space, which play a more crucial role in chiplet scenarios. Based on the encoding method and a highly configurable and universal hardware template, we propose an architecture and mapping co-exploration framework, Gemini, to explore the design and mapping space of large-scale DNN chiplet accelerators while taking monetary cost (MC), performance, and energy efficiency into account. Compared to the state-of-the-art (SOTA) Simba architecture with SOTA Tangram LP Mapping, Gemini’s co-optimized architecture and mapping achieve, on average, 1.98× performance improvement and 1.41× energy efficiency improvement simultaneously across various DNNs and batch sizes, with only a 14.3% increase in monetary cost. Moreover, we leverage Gemini to uncover intriguing insights into the methods for utilizing chiplet technology in architecture design and mapping DNN workloads under chiplet scenarios.
Related Chiplet
- Direct Chiplet Interface
- HBM3e Advanced-packaging chiplet for all workloads
- UCIe AP based 8-bit 170-Gsps Chiplet Transceiver
- UCIe based 8-bit 48-Gsps Transceiver
- UCIe based 12-bit 12-Gsps Transceiver
Related Technical Papers
- A Heterogeneous Chiplet Architecture for Accelerating End-to-End Transformer Models
- Universal Chiplet Interconnect Express: An Open Industry Standard for Memory and Storage Applications
- Communication Characterization of AI Workloads for Large-scale Multi-chiplet Accelerators
- RapidChiplet: A Toolchain for Rapid Design Space Exploration of Chiplet Architectures
Latest Technical Papers
- Spiking Transformer Hardware Accelerators in 3D Integration
- GATE-SiP: Enabling Authenticated Encryption Testing in Systems-in-Package
- AIG-CIM: A Scalable Chiplet Module with Tri-Gear Heterogeneous Compute-in-Memory for Diffusion Acceleration
- Chiplever: Towards Effortless Extension of Chiplet-based System for FHE
- The Survey of Chiplet-based Integrated Architecture: An EDA perspective