System-Technology Co-Optimization for Dense Edge Architectures using 3D Integration and Non-Volatile Memory

By Leandro M. Giacomini Rocha 1; Mohamed Naeim 1,5,6; Guilherme Paim 3,4; Moritz Brunion 1; Priya Venugopal 1; Dragomir Milojevic 5 , James Myers 2, Mustafa Badaroglu 7, Marian Verlhest 4, Julien Ryckaert 1, and  Dwaipayan Biswas 1
 1 imec, Leuven, Belgium
 2 imec-UK, Cambridge, United Kingdom
 3 INESC-ID, Lisbon, Portugal
 4 KU Leuven, Leuven, Belgium
 5 Université Libre de Bruxelles, Brussels, Belgium
 6 Cadence Design Systems, San Jose, CA, USA
 7 Qualcomm, San Diego, CA, USA

High-performance edge artificial intelligence (Edge-AI) inference applications aim for high energy efficiency, memory density and small form factor, requiring a design space exploration across the whole stack – workloads, architecture, mapping and co-optimization with emerging technology. In this paper, we present an system-technology co-optimization (STCO) framework that interfaces with workload-driven system scaling challenges and physical design-enabled technology offerings. The framework is built on three engines that provide the physical design characterization, dataflow mapping optimizer, and system efficiency predictor. The framework builds on a systolic array accelerator to provide the design-technology characterization points using advanced imec A10 nanosheet CMOS node along with emerging, high-density voltage-gated spin-orbit-torque (VGSOT) MRAM, combined with memory-on-logic fine-pitch 3D wafer-to-wafer hybrid bonding. We observe that 3D system integration of SRAM-based design leads to 9% power savings with 53% footprint reduction at iso-frequency w.r.t. 2D implementation for the same memory capacity. 3D NVM-VGSOT allows 4× memory capacity increase with 30% footprint reduction at iso-power compared to 2D SRAM 1×. Our exploration with two diverse workloads – image resolution enhancement (FSRCNN) and eye tracking (EDSNet) – shows that more resources allow better workload mapping possibilities which are able to compensate peak system energy efficiency degradation on high memory capacity cases. We show that a 25% peak efficiency reduction on a 32× memory capacity can lead to a 7.4× faster execution with 5.7× higher effective TOPS/W than the 1× memory capacity case on the same technology.

Click here to read more ...