System-Technology Co-Optimization for Dense Edge Architectures using 3D Integration and Non-Volatile Memory
By Leandro M. Giacomini Rocha 1; Mohamed Naeim 1,5,6; Guilherme Paim 3,4; Moritz Brunion 1; Priya Venugopal 1; Dragomir Milojevic 5 , James Myers 2, Mustafa Badaroglu 7, Marian Verlhest 4, Julien Ryckaert 1, and Dwaipayan Biswas 1
1 imec, Leuven, Belgium
2 imec-UK, Cambridge, United Kingdom
3 INESC-ID, Lisbon, Portugal
4 KU Leuven, Leuven, Belgium
5 Université Libre de Bruxelles, Brussels, Belgium
6 Cadence Design Systems, San Jose, CA, USA
7 Qualcomm, San Diego, CA, USA
High-performance edge artificial intelligence (Edge-AI) inference applications aim for high energy efficiency, memory density and small form factor, requiring a design space exploration across the whole stack – workloads, architecture, mapping and co-optimization with emerging technology. In this paper, we present an system-technology co-optimization (STCO) framework that interfaces with workload-driven system scaling challenges and physical design-enabled technology offerings. The framework is built on three engines that provide the physical design characterization, dataflow mapping optimizer, and system efficiency predictor. The framework builds on a systolic array accelerator to provide the design-technology characterization points using advanced imec A10 nanosheet CMOS node along with emerging, high-density voltage-gated spin-orbit-torque (VGSOT) MRAM, combined with memory-on-logic fine-pitch 3D wafer-to-wafer hybrid bonding. We observe that 3D system integration of SRAM-based design leads to 9% power savings with 53% footprint reduction at iso-frequency w.r.t. 2D implementation for the same memory capacity. 3D NVM-VGSOT allows 4× memory capacity increase with 30% footprint reduction at iso-power compared to 2D SRAM 1×. Our exploration with two diverse workloads – image resolution enhancement (FSRCNN) and eye tracking (EDSNet) – shows that more resources allow better workload mapping possibilities which are able to compensate peak system energy efficiency degradation on high memory capacity cases. We show that a 25% peak efficiency reduction on a 32× memory capacity can lead to a 7.4× faster execution with 5.7× higher effective TOPS/W than the 1× memory capacity case on the same technology.
Related Chiplet
- Direct Chiplet Interface
- HBM3e Advanced-packaging chiplet for all workloads
- UCIe AP based 8-bit 170-Gsps Chiplet Transceiver
- UCIe based 8-bit 48-Gsps Transceiver
- UCIe based 12-bit 12-Gsps Transceiver
Related Technical Papers
- Flexible electronic-photonic 3D integration from ultrathin polymer chiplets
- 3D Integration, Advanced Metrology Shape the Semiconductor Landscape
- 2D materials-based 3D integration for neuromorphic hardware
- Spiking Transformer Hardware Accelerators in 3D Integration
Latest Technical Papers
- Performance Implications of Multi-Chiplet Neural Processing Units on Autonomous Driving Perception
- ChipAI: A scalable chiplet-based accelerator for efficient DNN inference using silicon photonics
- Advanced Packaging and Chiplets Can Be for Everyone
- Interfacing silicon photonics for high-density co-packaged optics
- System-Technology Co-Optimization for Dense Edge Architectures using 3D Integration and Non-Volatile Memory