Occamy: A 432-Core 28.1 DP-GFLOP/s/W 83% FPU Utilization Dual-Chiplet, Dual-HBM2E RISC-V-based Accelerator for Stencil and Sparse Linear Algebra Computations with 8-to-64-bit Floating-Point Support in 12nm FinFET

By Gianna Paulin*, Paul Scheffler, Thomas Benz, Matheus Cavalcante, Tim Fischer*, Manuel Eggimann*, Yichao Zhang*, Nils Wistoff*, Luca Bertaccini*, Luca Colagrande*, Gianmarco Ottavi, Frank K. G¨urkaynak*, Davide Rossi, Luca Benini,
* ETH Zurich, Switzerland
Stanford University, USA
University of Bologna, Italy

We present Occamy, a 432-core RISC-V dual-chiplet 2.5D system for efficient sparse linear algebra and stencil computations on FP64 and narrow (32-, 16-, 8-bit) SIMD FP data. Occamy features 48 clusters of RISC-V cores with custom extensions, two 64-bit host cores, and a latency-tolerant multi-chiplet interconnect and memory system with 32 GiB of HBM2E. It achieves leading-edge utilization on stencils (83 %), sparse-dense (42 %), and sparse-sparse (49 %) matrix multiply.

Introduction

Sparse machine learning (ML) and high-performance computing applications in fields like multiphysics simulation and graph analytics often rely on sparse linear algebra (LA), stencil codes, and graph pattern matching [1]. These workloads achieve low FPU utilization (typically < 10 % for sparse LA) on modern CPUs and GPUs because of their sparse, irregular memory accesses and complex, indirection-based address computations [2-4]. While many specialized accelerators have been proposed for sparse ML workloads, they lack the flexibility of instruction processors [5]. We present Occamy, a flexible, general-purpose, dual-chiplet system with two 16 GiB HBM2E stacks optimized for a wide range of irregular-memory-access workloads. Each chiplet integrates a RISC-V host core and 216 lightweight, latency-tolerant RISC-V compute cores with domain-specific ISA extensions organized hierarchically in six groups of four nine-core compute clusters. Occamy demonstrates in silicon three innovations: (A) efficient multi-precision compute cores with sparse streaming units (SUs) supporting indirection, intersection, and union operations to accelerate general sparse computations, (B) a scalable, latency-tolerant, hierarchical architecture with separate data and control interconnects and distributed DMA units for agile on-die and die-to-die traffic, and (C) an innovative system-in-package 2.5D integration for two compute chiplets with two 16 GiB HBM2E stacks.