Occamy: A 432-Core 28.1 DP-GFLOP/s/W 83% FPU Utilization Dual-Chiplet, Dual-HBM2E RISC-V-based Accelerator for Stencil and Sparse Linear Algebra Computations with 8-to-64-bit Floating-Point Support in 12nm FinFET
By Gianna Paulin*, Paul Scheffler*§, Thomas Benz*§, Matheus Cavalcante†, Tim Fischer*, Manuel Eggimann*, Yichao Zhang*, Nils Wistoff*, Luca Bertaccini*, Luca Colagrande*, Gianmarco Ottavi‡, Frank K. G¨urkaynak*, Davide Rossi‡, Luca Benini‡,
* ETH Zurich, Switzerland
† Stanford University, USA
‡ University of Bologna, Italy
We present Occamy, a 432-core RISC-V dual-chiplet 2.5D system for efficient sparse linear algebra and stencil computations on FP64 and narrow (32-, 16-, 8-bit) SIMD FP data. Occamy features 48 clusters of RISC-V cores with custom extensions, two 64-bit host cores, and a latency-tolerant multi-chiplet interconnect and memory system with 32 GiB of HBM2E. It achieves leading-edge utilization on stencils (83 %), sparse-dense (42 %), and sparse-sparse (49 %) matrix multiply.
Introduction
Sparse machine learning (ML) and high-performance computing applications in fields like multiphysics simulation and graph analytics often rely on sparse linear algebra (LA), stencil codes, and graph pattern matching [1]. These workloads achieve low FPU utilization (typically < 10 % for sparse LA) on modern CPUs and GPUs because of their sparse, irregular memory accesses and complex, indirection-based address computations [2-4]. While many specialized accelerators have been proposed for sparse ML workloads, they lack the flexibility of instruction processors [5]. We present Occamy, a flexible, general-purpose, dual-chiplet system with two 16 GiB HBM2E stacks optimized for a wide range of irregular-memory-access workloads. Each chiplet integrates a RISC-V host core and 216 lightweight, latency-tolerant RISC-V compute cores with domain-specific ISA extensions organized hierarchically in six groups of four nine-core compute clusters. Occamy demonstrates in silicon three innovations: (A) efficient multi-precision compute cores with sparse streaming units (SUs) supporting indirection, intersection, and union operations to accelerate general sparse computations, (B) a scalable, latency-tolerant, hierarchical architecture with separate data and control interconnects and distributed DMA units for agile on-die and die-to-die traffic, and (C) an innovative system-in-package 2.5D integration for two compute chiplets with two 16 GiB HBM2E stacks.
Related Chiplet
- Direct Chiplet Interface
- HBM3e Advanced-packaging chiplet for all workloads
- UCIe AP based 8-bit 170-Gsps Chiplet Transceiver
- UCIe based 8-bit 48-Gsps Transceiver
- UCIe based 12-bit 12-Gsps Transceiver
Related Technical Papers
Latest Technical Papers
- Automakers And Industry Need Specific, Extremely Robust, Heterogeneously Integrated Chiplet Solutions
- Efficient ESD Verification For 2.5/3D Automotive ICs
- Heterogeneous Integration Technologies for Artificial Intelligence Applications
- Performance Implications of Multi-Chiplet Neural Processing Units on Autonomous Driving Perception
- ChipAI: A scalable chiplet-based accelerator for efficient DNN inference using silicon photonics