Chiplet-Gym: Optimizing Chiplet-based AI Accelerator Design with Reinforcement Learning
By Kaniz Mishty and Mehdi Sadi, Member, IEEE
Modern Artificial Intelligence (AI) workloads demand computing systems with large silicon area to sustain throughput and competitive performance. However, prohibitive manufacturing costs and yield limitations at advanced tech nodes and die-size reaching the reticle limit restrain us from achieving this. With the recent innovations in advanced packaging technologies, chiplet-based architectures have gained significant attention in the AI hardware domain. However, the vast design space of chiplet-based AI accelerator design and the absence of system and package-level co-design methodology make it difficult for the designer to find the optimum design point regarding Power, Performance, Area, and manufacturing Cost (PPAC). This paper presents Chiplet-Gym, a Reinforcement Learning (RL)-based optimization framework to explore the vast design space of chiplet-based AI accelerators, encompassing the resource allocation, placement, and packaging architecture. We analytically model the PPAC of the chiplet-based AI accelerator and integrate it into an OpenAI gym environment to evaluate the design points. We also explore non-RL-based optimization approaches and combine these two approaches to ensure the robustness of the optimizer. The optimizer-suggested design point achieves 1.52X throughput, 0.27X energy, and 0.01X die cost while incurring only 1.62X package cost of its monolithic counterpart at iso-area.
1 INTRODUCTION
As Large Language Models (LLMs), such as chatGPT, GPT-4, LLaMA [1], etc., gain widespread use, there is a growing demand for energy-efficient hardware that can deliver high throughput. To support hundreds of trillions of operations and hundreds of Giga Bytes of data movement, the high-performance and energy-efficient hardware demands more silicon area, accommodating more compute cores and memory capacity. Training any state-of-the-art AI or Deep Learning (DL) model with a single GPU or accelerator is nearly impossible due to extreme computing and memory demands. The data centers are equipped with clusters of powerful computers and GPUs connected via PCIe, NVLink, etc. [2] [3]. Even though these supercomputers can deal with large workloads, they consume a significant amount of energy [2] and involve longer latency. Because off-board communications consume at least one order of magnitude more power and time than any on-package communications [4]. The ideal scenario would be a hardware capable of housing the entire model parameters and intermediate activations on-chip [5], promising optimal performance and energy efficiency. Unfortunately, this is not feasible due to the stagnation of Moore’s law and Dennerd’s scaling, die size reaching the reticle limit, and the prohibitive manufacturing cost and yield limitations [3]. Consequently, researchers endeavor to replicate this ‘hypothetical ideal’ hardware concept by integrating multiple smaller chiplets at the package level, allowing near-ideal performance while minimizing costs and energy consumption.
With the advent of advanced packaging technologies, the chiplet-based heterogeneous integration has opened up a new dimension of chip design, More-than-Moore [3]. In chiplet-based system, multiple chiplets (i.e., SoCs) of diverse functionalities (e.g., logic dies, memories, analog IPs, accelerators etc.) and tech nodes (e.g., 7nm or beyond) from different foundries are interconnected in package level using the advanced packaging technologies, such as CoWoS, EMIB, etc. [3]. The value proposition of chiplet-based architectures is manifold. Compared to multiple monolithic SoCs interconnected via off-package or off-board links such as PCIe, NVLink, CXL etc. [3], package-level integration of multiple monolithic SoCs via 2.5D or 3D has accelerated performance and lower energy consumption alleviating offpackage communications. Chiplet-based systems offer lower RE (Recurrent Engineering) cost by providing higher yield and lower NRE (Non-Recurrent Engineering) by enabling IP reuse and shortening IC design cycle [6].
The commercial chiplet-based general purpose products [7] [8] are designed and developed at vertically integrated companies without exposing much knowledge about the chiplet-based architectures’ design space. Unlike these general purpose products, chiplet-based AI accelerators demand extensive design space exploration to hit the target Power, Performance, Area, and Cost (PPAC) budget. From architectural perspective, designers must consider the resource allocation, mapping and dataflow of the DNN workloads. From communication and integration perspective, chiplet placement, routing protocols, stacking/packaging technologies, interconnect types, and finally from application perspective, system requirement, such as reliability, scalibility etc., should be considered all at the same time while optimizing for PPAC [9]. The existing works often focus either on the architectural or integration aspects as a separate design flow: explore routing and packaging given chiplets [10] [11] [12] or explore chiplets architecture given the packaging [5] [13] [14] [15]. An isolated approach, addressing individual aspects independently, may result in sub-optimal designs due to the inter-dependency among these factors. For instance, varying resource allocation impacts communication demands, influencing the choice of packaging and its configuration, consequently leading to cost variations.
Currently, many flavors of packaging technologies, both from 2.5D and 3D, are available from the industry leaders, which makes it difficult for system designers and integrators to choose the optimum set of configurations from the vast design space based on the system requirements [3]. The various packaging technologies differ in fabrication cost and complexity, performance, and underlying integration technologies [3]. As a result, no single package technology can be marked as superior to others. Each of the other domains, such as resource allocation, chiplet granularity, placement, Network on Package (NoP), and interconnect architectures, to name a few, also has an extensive design space. A proper co-optimization across all these domains based on the system and application requirements at the available cost is necessary for a successful chiplet based system design. Optimizing all possible domains results in a combinatorial explosion where brute force search is not an option and random search might not result in the optimum point. The expensive simulation environment of chip design exacerbates this problem.
To read the full article, click here
Related Chiplet
- Direct Chiplet Interface
- HBM3e Advanced-packaging chiplet for all workloads
- UCIe AP based 8-bit 170-Gsps Chiplet Transceiver
- UCIe based 8-bit 48-Gsps Transceiver
- UCIe based 12-bit 12-Gsps Transceiver
Related Technical Papers
- Occamy: A 432-Core 28.1 DP-GFLOP/s/W 83% FPU Utilization Dual-Chiplet, Dual-HBM2E RISC-V-based Accelerator for Stencil and Sparse Linear Algebra Computations with 8-to-64-bit Floating-Point Support in 12nm FinFET
- Achieving Better Chiplet Design Signal Integrity with UCIe™
- Enabling Innovative Multi-Vendor Chiplet-Based Designs
- ChipAI: A scalable chiplet-based accelerator for efficient DNN inference using silicon photonics
Latest Technical Papers
- Automakers And Industry Need Specific, Extremely Robust, Heterogeneously Integrated Chiplet Solutions
- Efficient ESD Verification For 2.5/3D Automotive ICs
- Heterogeneous Integration Technologies for Artificial Intelligence Applications
- Performance Implications of Multi-Chiplet Neural Processing Units on Autonomous Driving Perception
- ChipAI: A scalable chiplet-based accelerator for efficient DNN inference using silicon photonics