Occamy: A 432-Core 28.1 DP-GFLOP/s/W 83% FPU Utilization Dual-Chiplet, Dual-HBM2E RISC-V-based Accelerator for Stencil and Sparse Linear Algebra Computations with 8-to-64-bit Floating-Point Support in 12nm FinFET
By Gianna Paulin*, Paul Scheffler*§, Thomas Benz*§, Matheus Cavalcante†, Tim Fischer*, Manuel Eggimann*, Yichao Zhang*, Nils Wistoff*, Luca Bertaccini*, Luca Colagrande*, Gianmarco Ottavi‡, Frank K. G¨urkaynak*, Davide Rossi‡, Luca Benini‡,
* ETH Zurich, Switzerland
† Stanford University, USA
‡ University of Bologna, Italy
We present Occamy, a 432-core RISC-V dual-chiplet 2.5D system for efficient sparse linear algebra and stencil computations on FP64 and narrow (32-, 16-, 8-bit) SIMD FP data. Occamy features 48 clusters of RISC-V cores with custom extensions, two 64-bit host cores, and a latency-tolerant multi-chiplet interconnect and memory system with 32 GiB of HBM2E. It achieves leading-edge utilization on stencils (83 %), sparse-dense (42 %), and sparse-sparse (49 %) matrix multiply.
Introduction
Sparse machine learning (ML) and high-performance computing applications in fields like multiphysics simulation and graph analytics often rely on sparse linear algebra (LA), stencil codes, and graph pattern matching [1]. These workloads achieve low FPU utilization (typically < 10 % for sparse LA) on modern CPUs and GPUs because of their sparse, irregular memory accesses and complex, indirection-based address computations [2-4]. While many specialized accelerators have been proposed for sparse ML workloads, they lack the flexibility of instruction processors [5]. We present Occamy, a flexible, general-purpose, dual-chiplet system with two 16 GiB HBM2E stacks optimized for a wide range of irregular-memory-access workloads. Each chiplet integrates a RISC-V host core and 216 lightweight, latency-tolerant RISC-V compute cores with domain-specific ISA extensions organized hierarchically in six groups of four nine-core compute clusters. Occamy demonstrates in silicon three innovations: (A) efficient multi-precision compute cores with sparse streaming units (SUs) supporting indirection, intersection, and union operations to accelerate general sparse computations, (B) a scalable, latency-tolerant, hierarchical architecture with separate data and control interconnects and distributed DMA units for agile on-die and die-to-die traffic, and (C) an innovative system-in-package 2.5D integration for two compute chiplets with two 16 GiB HBM2E stacks.
Related Chiplet
- Interconnect Chiplet
- 12nm EURYTION RFK1 - UCIe SP based Ka-Ku Band Chiplet Transceiver
- Bridglets
- Automotive AI Accelerator
- Direct Chiplet Interface
Related Technical Papers
Latest Technical Papers
- MAHL: Multi-Agent LLM-Guided Hierarchical Chiplet Design with Adaptive Debugging
- ATSim: A Fast and Accurate Simulation Framework for 2.5D/3D Chiplet Thermal Design Optimization
- Chiplet-Based Architectures: Redefining the Future of System-on-Chip (SoC) Design
- AuthenTree: A Scalable MPC-Based Distributed Trust Architecture for Chiplet-based Heterogeneous Systems
- THERMOS: Thermally-Aware Multi-Objective Scheduling of AI Workloads on Heterogeneous Multi-Chiplet PIM Architectures