REED: Chiplet-based Accelerator for Fully Homomorphic Encryption

Aikata Aikata1, Ahmet Can Mert1, Sunmin Kwon2, Maxim Deryabinand Sujoy Sinha Roy1
1 Graz University of Technology, Graz, Austria
2 Samsung Advanced Institute of Technology, Samsung Electronics, Suwon, Korea

Abstract. Fully Homomorphic Encryption (FHE) enables privacy-preserving computation and has many applications. However, its practical implementation faces massive computation and memory overheads. To address this bottleneck, several Application-Specific Integrated Circuit (ASIC) FHE accelerators have been proposed. All these prior works put every component needed for FHE onto one chip (mono-lithic), hence offering high performance. However, they encounter common challenges associated with large-scale chip design, such as inflexibility, low yield, and high manufacturing costs. In this paper, we present the first-of-its-kind multi-chiplet-based FHE accelerator ‘REED’ for overcoming the limitations of prior monolithic designs. To utilize the advantages of multi-chiplet structures while matching the performance of larger monolithic systems, we propose and implement several novel strategies in the context of FHE. These include a scalable chiplet design approach, an effective framework for workload distribution, a custom inter-chiplet communication strategy, and advanced pipelined Number Theoretic Transform and automorphism design to enhance performance. Our instruction-set and power simulations experiments with a prelayout netlist indicate that REED 2.5D microprocessor consumes 96.7mmchip area, 49.4 W average power in 7nm technology. It could achieve a remarkable speedup of up to 2,991×compared to a CPU (24-core 2×Intel X5690) and offer 1.9×better performance, along with a 50% reduction in development costs when compared to state-of-the-art ASIC FHE accelerators. Furthermore, our work presents the first instance of benchmarking an encrypted deep neural network (DNN) training. Overall, the REED architecture design offers a highly effective solution for accelerating FHE, thereby significantly advancing the practicality and deployability of FHE in real-world applications.

Keywords: Homomorphic Encryption, Hardware Acceleration, Chiplets, CKKS

1. Introduction

Data breaches can put millions of private accounts at risk because data is often stored or processed without encryption, making it vulnerable to attacks. Fully Homomorphic Encryption (FHE) is a solution that allows secure, private computations, communications, and storage. It enables servers to compute on homomorphically encrypted data and return encrypted outputs. FHE has a wide range of applications, including cloud computing , data processing, and machine learning. The concept of FHE was introduced in 1978 by Rivest, Adleman, and Dertouzos, and the first FHE scheme was constructed in 2009 by Gentry. Since then, many FHE schemes have emerged- BGV, FV, CGGI, and CKKS. These schemes allow computations to be outsourced without the need to trust the service provider, providing a functional and dependable privacy layer.

Despite significant progress in the mathematical aspects of FHE, state-of-the-art FHE schemes typically introduce 10,000×to 100,000×slowdown  compared to plaintext calculations. This overhead can be attributed to plaintext expanding into large polynomials when encrypted using an FHE scheme. Subsequently, simple operations, like plaintext multiplication, translate into complex polynomial operations. FHE’s massive computation and data overhead hinders its deployment in real-life applications. To bridge this performance gap, researchers have proposed acceleration techniques on various platforms, including GPU, FPGA, and ASIC. Software implementations offer flexibility but poor performance. Attempts have been made to provide GPU  and FPGA-based solutions. However, the performance gap is still 2-3 orders compared to plain computation.

Currently, the fastest hardware acceleration results for FHE have been reported using ASIC modeling. The works propose utilizing large chip architecture designs with all FHE building blocks onto a single chip to maximize performance, hence monolithic. While simulations of these architectures show that they can achieve high performance for FHE workloads, the limitations of the current manufacturing capabilities, such as inflexibility, low yield, and higher manufacturing costs [Gon21], impact their real-world deployment. For instance, the large architectures  with area-consumption of approximately 400mm2, result in a manufacturing yield of only 67%, chip fabrication cost of over 25 million US$ [MUS], and long time-to-market (>3 years).

Additionally, several of these proposals overlook the crucial need for communication-computation parallelism as the off-chip to on-chip communication is slower than the chip’s computation speed. Our analysis shows that this feature is important in an FHE accelerator for achieving good performance when running complex tasks like neural network training. Prior works also utilize higher on-chip bandwidth due to readily available on-chip memory(20TB/s [KLK+22],36TB/s [KKC+23], and84TB/s [SFK+22]). Replacing this on-chip memory with cheaper HBM3 (1.2TB/s bandwidth) would require 17 to 70 HBM3 modules to match the necessary bandwidth.

In summary, while the large and complex monolithic FHE architectures proposed in prior works show promise, they face practical challenges such as high manufacturing costs, yield rates, and extended time-to-market. Addressing these challenges opens the door to exploring new approaches like chiplet-based architecture design. Chiplet-based architecture design utilizes multiple smaller chiplets instead of one large monolithic chip to realize a large system. Chiplets are modular building blocks that are combined to create more complex integrated circuits, such as CPUs, GPUs, Systems-on-Chip (SoCs),or System-in-Package (SiPs).

The transition to chiplet integrated systems represents both the present and future of architectural designs [Gon21,ZSB21,Man22,YLK+18,GPG23,MWW+22]. In the DATE2024 keynote talk [RD24], the speaker remarks how chiplet-based designs help ‘push the performance boundaries, with maximum efficiency, while managing costs associated with manufacturing and yield’. Chiplet-based architectures also feature the advantage of tiling beyond the reticle limit (858mm2) [GPG23] as multiple chiplets can be integrated for better performance. Although chiplet-based architectures enjoy the aforementioned advantages, they also face a trade-off between performance and yield. Multiple smaller chiplets offer high yields and reduced manufacturing costs but, at the same time, experience performance overhead due to slower chiplet-to-chiplet communication. Taking the advantages and challenges of chiplet-based systems into consideration, we are curious to investigate the following research questions:

How can we design and optimize a multi-chiplet accelerator for FHE that matches the performance of large monolithic FHE accelerators while overcoming the inherent challenges of monolithic designs?

To investigate the question mentioned above, we present REED, a multi-chiplet architecture for FHE acceleration. We propose a holistic design methodology covering all aspects of FHE acceleration, from low-level building blocks to high protocol levels, and reduce the area to 43.9mm2for one REED-chiplet. This includes the first scalable design methodology for one chiplet and ensures full utilization of chiplets for varying amounts of available off-chip data bandwidths. After finalizing an efficient design of one chiplet, we move to a data and task distribution study for multiple chiplets in the context of CKKS[CKKS17] routines. Towards this, we contribute novel strategies that offer long-term computation and communication parallelism. Finally, we synthesize the proposed design methodology for ASIC and report application benchmarks.

Contributions

To the extent of our knowledge, this is the first chiplet-based architecture for accelerating FHE. Throughout this work, we have followed Occam’s razor, seeking the simplest solutionsfor the best results. We unfold our major contributions as follows:

  • Chiplet-based FHE accelerator: We present a novel and cost-effective chiplet-based FHE implementation approach, which is inherently scalable1. The chiplets are homogeneous (i.e., identical), which reduces testing and integration costs. REEDwith 2.5D packaging surpasses state-of-the-art work SHARP64[KKC+23] with 1.9×better performance and 2×less development cost.
  • Workload division strategy: The first step to realizing a multi-chiplet architecture is to develop an efficient disintegration strategy that helps us divide the workloads among multiple chiplets and reduces memory consumption. Hence, we propose an interleaved data and workload distribution technique for all FHE routines.
  • FHE-tailored efficient C2C communication: Chiplet-based architectures suffer from slow C2C (chiplet-to-chiplet) communication. We address this by proposing the first non-blocking ring-based inter-chiplet communication strategy tailored to FHE. This mitigates data exchange overhead during the KeySwitch macro-routine, accelerating Bootstrapping (the most expensive FHE routine).
  • Scalable design: To attain scalability by design, we propose a configuration-based design methodology such that the memory read/write and computational throughput are the same. Changing the configuration parameters allows the architecture to adapt to the desired area and throughput requirements. This also offers inherent communication-computation parallelism in the design of every chiplet.
  • Novel compute acceleration: Furthermore, we present new design techniques for the micro-procedures of FHE- the number-theoretic transform (NTT) and auto-morphism (AUT). Our approach introduces Hybrid NTT, eliminating the need for expensive transpose operation and scratchpad memory. It is easily scalable for higher or lower polynomial degrees. Hence, other applications, such as zero-knowledge proofs, can also benefit from this, where transposition is expensive due to high polynomial degrees. Additionally, we have prototyped these building blocks on FPGA- AlveoU250.
  • Application benchmark: Finally, we choose parameters offering high precision and good performance. REED is the first work to benchmark an encrypted deep neural network training, showcasing practical and real-world impact. While CPU(24-core, 2×Intel Xeon CPU X5690@3.47GHz) requires 29 days to finish it, REED2.5D would take only 15.4 minutes, a realistic time for an NN training. We also use DNN training to run accuracy/precision experiments and validate our parameter choice.

Connection and comparison with chiplet designs for ML

While prior chiplet-based Machine Learning (ML) works address similar problems, our solutions are tailored to meet FHE requirements more effectively. For instance, [SCV+21]addresses MCM’s long “tail-latency” issue using non-uniform work distribution and communication-aware data placement. In the context of FHE, we resolve this by running parallel computations over extended periods, ensuring uniform task distribution and data placement. Our chiplet interconnections are ring-like and unidirectional. Although we do not propose an automatic tool, our analysis, similar to [TCDM21], focuses on long-term chiplet utilization based on FHE’s computational-depth. Our methodology introduces a new configuration-based design built from scratch with novel building blocks and high-level protocols. In contrast to [HKKR20], which combines heterogeneous-chiplets, we propose homogeneous-chiplets observing unique data-flow of FHE. A common limitation of the prior works is that they propose very small chiplet sizes (2 to 6mm2), which is too small as per a recent study done by the authors in [GPG23]. Thus, we ensure that our chiplet sizes fall within the optimal range.

To read the full article, click here