Gemini: Mapping and Architecture Co-exploration for Large-scale DNN Chiplet Accelerators
By Jingwei Cai, Zuotong Wu, Sen Peng, Yuchen Wei, Zhanhong Tan, Guiming Shi, Mingyu Gao, Kaisheng Ma
Chiplet technology enables the integration of an increasing number of transistors on a single accelerator with higher yield in the post-Moore era, addressing the immense computational demands arising from rapid AI advancements. However, it also introduces more expensive packaging costs and costly Die-to-Die (D2D) interfaces, which require more area, consume higher power, and offer lower bandwidth than onchip interconnects. Maximizing the benefits and minimizing the drawbacks of chiplet technology is crucial for developing largescale DNN chiplet accelerators, which poses challenges to both architecture and mapping. Despite its importance in the postMoore era, methods to address these challenges remain scarce. To bridge the gap, we first propose a layer-centric encoding method to encode Layer-Pipeline (LP) spatial mapping for largescale DNN inference accelerators and depict the optimization space of it. Based on it, we analyze the unexplored optimization opportunities within this space, which play a more crucial role in chiplet scenarios. Based on the encoding method and a highly configurable and universal hardware template, we propose an architecture and mapping co-exploration framework, Gemini, to explore the design and mapping space of large-scale DNN chiplet accelerators while taking monetary cost (MC), performance, and energy efficiency into account. Compared to the state-of-the-art (SOTA) Simba architecture with SOTA Tangram LP Mapping, Gemini’s co-optimized architecture and mapping achieve, on average, 1.98× performance improvement and 1.41× energy efficiency improvement simultaneously across various DNNs and batch sizes, with only a 14.3% increase in monetary cost. Moreover, we leverage Gemini to uncover intriguing insights into the methods for utilizing chiplet technology in architecture design and mapping DNN workloads under chiplet scenarios.
To read the full article, click here
Related Chiplet
- Direct Chiplet Interface
- HBM3e Advanced-packaging chiplet for all workloads
- UCIe AP based 8-bit 170-Gsps Chiplet Transceiver
- UCIe based 8-bit 48-Gsps Transceiver
- UCIe based 12-bit 12-Gsps Transceiver
Related Technical Papers
- A Heterogeneous Chiplet Architecture for Accelerating End-to-End Transformer Models
- Universal Chiplet Interconnect Express: An Open Industry Standard for Memory and Storage Applications
- Communication Characterization of AI Workloads for Large-scale Multi-chiplet Accelerators
- High-Bandwidth Chiplet Interconnects for Advanced Packaging Technologies in AI/ML Applications: Challenges and Solutions
Latest Technical Papers
- Automakers And Industry Need Specific, Extremely Robust, Heterogeneously Integrated Chiplet Solutions
- Efficient ESD Verification For 2.5/3D Automotive ICs
- Heterogeneous Integration Technologies for Artificial Intelligence Applications
- Performance Implications of Multi-Chiplet Neural Processing Units on Autonomous Driving Perception
- ChipAI: A scalable chiplet-based accelerator for efficient DNN inference using silicon photonics