MFIT : Multi-FIdelity Thermal Modeling for 2.5D and 3D Multi-Chiplet Architectures

Lukas Pfromm1, Alish Kanani1, Harsh Sharma, Parth Solanki,
Eric Tervo, Jaehyun Park, Janardhan Rao Doppa, Partha Pratim Pande, Umit Y. Ogras 1 Equal contributionCorresponding Author: Lukas Pfromm (pfromm@wisc.edu)L. Pfromm, A. Kanani, E. Tervo and U. Y. Ogras are with the Department of Electrical and Computer Engineering, University of Wisconsin–Madison, WI, USA.H. Sharma, J. R. Doppa and P. P. Pande are with the School of Electrical Engineering and Computer Science, Washington State University, WA, USA.P. Solanki and E. Tervo are with the Department of Mechanical Engineering, University of Wisconsin-Madison, WI, USA.J. Park is with Department of Electrical, Electronic and Computer Engineering, University of Ulsan, Republic of Korea.

Abstract

Rapidly evolving artificial intelligence and machine learning applications require ever-increasing computational capabilities, while monolithic 2D design technologies approach their limits. Heterogeneous integration of smaller chiplets using a 2.5D silicon interposer and 3D packaging has emerged as a promising paradigm to address this limit and meet performance demands. These approaches offer a significant cost reduction and higher manufacturing yield than monolithic 2D integrated circuits. However, the compact arrangement and high compute density exacerbate the thermal management challenges, potentially compromising performance. Addressing these thermal modeling challenges is critical, especially as system sizes grow and different design stages require varying levels of accuracy and speed. Since no single thermal modeling technique meets all these needs, this paper introduces MFIT, a range of multi-fidelity thermal models that effectively balance accuracy and speed. These multi-fidelity models can enable efficient design space exploration and runtime thermal management. Our extensive testing on systems with 16, 36, and 64 2.5D integrated chiplets and 16 $\times$ 3 3D integrated chiplets demonstrates that these models can reduce execution times from days to mere seconds and milliseconds with negligible loss in accuracy.

I Introduction

Massive data from different modalities, including text, images, video, and speech, are continuously produced by various sensors. At the same time, increasingly complex artificial intelligence (AI) and machine learning (ML) algorithms process this data to enable new applications that were previously impractical. This trend dictates the design of large-scale chips with high memory and compute capabilities, offering a high degree of parallelism [1, 2]. Traditional 2D chip design and packaging technologies cannot sustain this need due to the low yield of large monolithic planar chips and the corresponding increase in fabrication cost [3]. Therefore, new design approaches are required to meet the increasing demand for computing power and memory capacity [1].

2.5D and 3D chiplet-based architectures have emerged as promising alternatives to traditional monolithic 2D chips due to their lower fabrication costs [4, 5, 6]. Compared to conventional monolithic systems, chiplet-based systems integrate multiple small pre-fabricated chips (chiplets) on a silicon interposer, which facilitates data exchange, as illustrated in Figure 1(a). 3D packaged systems expand on this approach by stacking multiple chiplets vertically and connecting them with vertical vias, creating a more compact system as illustrated in Figure 1(b). The smaller size of these chiplets enables a higher yield and lower overall manufacturing cost than traditional monolithic dies [7]. Additionally, this modular approach facilitates scaling the system sizes and enables heterogeneous integration of different chiplet types, e.g., memory, processing, and processing-in-memory chiplets. Hence, emerging 2.5D and 3D architectures enable a new cost-effective avenue for compact scale-out implementations of various emerging compute- and data-intensive applications, including AI/ML. Indeed, these advantages have led to industrial adoption by companies including Intel [8, 9], AMD [10, 11, 12], and NVIDIA [4].

Refer to caption — (a) A 16 - 2.5D integrated chiplet based system. The magnified view on the right shows a detailed structure of a single chiplet.

Thermal bottlenecks have long been a significant barrier to increasing the performance of computing systems. 2.5D and 3D integrated systems exacerbate this barrier due to their dense integration and unique physical structure [13]. In contrast to a monolithic chip, where heat is spread directly across the die, a 2.5D chiplet-based system conducts heat between different chiplets through the interposer and heat spreader. Likewise, heat also flows vertically between adjacent stacked chiplets in a 3D chiplet-based system.

These factors introduce unique challenges for effective thermal management in these systems. Traditional design flows and physical floor planning focus on reducing wire lengths to meet timing constraints and minimizing area to reduce fabrication costs. However, these objectives could also lead to thermal crosstalk, thermal hotspots, and compromise performance. Chiplet-based systems introduce additional design parameters such as inter-chiplet link length, spacing, chiplet placement, sizing, inter-layer communication, and design partitioning. Tuning traditional and chiplet-based design parameters while maintaining thermal stability is critical to ensure a thermally-efficient design.

The semiconductor chip design cycle spans multiple phases: system specification, architecture exploration, logic design, physical design and validation, fabrication, and post-silicon optimization/validation. Each phase has a unique set of design constraints and requirements. For example, lacking a test chip during the pre-silicon phases requires simulation and analytical models. Finite Element Method (FEM) simulations offer the most accurate approach for pre-silicon thermal analysis [14]. They can serve as a reference and enable heat flow studies to guide the design process. However, they are too slow for practical architecture and design space exploration (DSE). Modeling the package as a thermal RC (resistive-capacitive) network can significantly accelerate simulations with acceptable accuracy loss [15, 16]. Since each node in the thermal circuit corresponds to a specific location in the package, thermal RC models solve discretized versions of the FEM models in space. Hence, they enable thermally-aware DSE and optimization with a finite number of discrete hotspot nodes. However, the thermal resistance/capacitance values and the circuit topology must accurately reflect the chip geometry and material properties for reliable results. Since the thermal RC models solve continuous-time ordinary differential equations (ODEs), they have execution times in the order of seconds to minutes. Therefore, they cannot be used for runtime optimization tasks such as dynamic thermal and power management (DTPM). One can discretize them in the time domain with a given sampling period [17, 18]. The resulting discrete state-space (DSS) models significantly reduce runtime at the cost of further abstracting the model from the physical package. Consequently, they are applicable only to the specific configurations for which they are developed.

There is a strong need for tools to accurately analyze the thermal behavior of 2.5D and 3D integrated systems and guide their design process. However, no single modeling technique can alone address the needs of all design phases. To address this much needed gap, this paper proposes MFIT, a multi-fidelity thermal modeling framework that synergistically exploits the strengths of each class of models (FEM, thermal RC, and DSS). We use this framework to produce a set of thermal models that can guide the entire design cycle, unlike a point solution that can serve a specific portion of the design process. The elements of this set not only cover complementary parts of the design cycle but support each other and produce consistent results. We first develop a fine-grained FEM model of the target package as a reference. Since it is slow and computationally expensive, we next judiciously design an abstracted version of this fine-grained FEM model to simulate an entire package in days while maintaining accuracy. To enable fast DSE, MFIT also incorporates thermal RC circuit models verified against the reference FEM models. Our thermal RC models run in the order of seconds while leading to less than 1.7^∘C error, as summarized in Figure 2. Hence, they can be used for pre-silicon architectural optimization, such as mapping the workloads to chiplets, network-on-interposer design, and chiplet placement for 2.5D and 3D stacked systems. Finally, MFIT derives one more class of models by discretizing the thermal RC models, enabling runtime thermal management and large-scale DSE in the order of milliseconds. However, they work only for a specific sampling period and configuration. Hence, the parameters must be regenerated from the RC model if the target configuration changes. In summary, we obtain a set of multi-fidelity thermal models that guide and complement each other to cover all design phases.

The key contributions of this work are as follows:

•

A novel thermal modeling approach that systematically abstracts fine-grained FEM models to produce abstract FEM, thermal RC, and DSS models to achieve varying speed and accuracy trade-offs,
•

A family of open-source multi-fidelity thermal models that span a wide accuracy (reference to 1.7^∘C) and speed (days to milliseconds) range,
•

Extensive evaluations with 16, 36, and 64 - 2.5D and 16 $\times$ 3 - 3D integrated chiplets systems running AI/ML workloads to demonstrate the accuracy and speed-up benefits of our multi-fidelity thermal models,
•

Open-sourced code for thermal RC and DSS models at github.com/AlishKanani/MFIT. Additionally, we plan to make our FEM models publicly accessible in the near future.

The remainder of the paper is organized as follows. Section II and Section III discuss related work and background on FEM. Section IV presents the proposed multi-fidelity thermal modeling framework. Finally, Section V presents the experimental evaluation, and Section VI concludes the paper.

II Related Work

2.5D and 3D integration-based systems are becoming mainstream due to higher performance and lower manufacturing costs than monolithic chips. Both domain-specific and general-purpose 2.5D and 3D architectures have been explored to date [11, 19, 20, 5, 4, 21, 12]. SIMBA is one of the first prototype multi-chip modules with 36 chiplets designed for inference with deep models [4]. Similarly, Floret is a data center-scale architecture for accelerating convolutional neural network (CNN) inferencing tasks by exploiting the dataflow knowledge [5]. Loi et al., [22] analyze the performance benefits of vertically integrated (3D) processor-memory hierarchy under thermal constraints. Similarly, Eckert et al. [21] consider processing-in-memory (PIM) architectures implemented using 3D die stacking. They study the thermal constraints across different processor organizations and cooling solutions to identify viable solutions. Our proposed open-source thermal models catalyze similar thermal analysis and optimization studies for 2.5D and 3D integrated systems.

The most accurate and direct thermal evaluation approach is temperature measurements on a hardware system. It can be performed using thermal imaging [23, 24] or temperature sensors [25, 26]. However, the availability of the target hardware is a significant limitation. For example, large-scale 2.5D and 3D chiplet systems with tens of chiplets do not exist yet, while smaller prototypes and commercial systems provide limited insights applicable for larger systems [10, 27]. This limitation motivated FEM-based modeling as the most accurate way to analyze the heat flow and temperature [14]. Proprietary software, such as ANSYS Fluent [28] and COMSOL [29], are commonly used for FEM simulations. Since FEM suffers from computational cost, detailed FEM solutions are suitable only for small designs and validating analytical models [28]. For example, the authors of [30] employ FEM to simulate a two-chiplet system on an interposer. They employ abstracted FEM models for both $\mu$ -bumps and C4 bumps to speed up the process, effectively reducing computational complexity before tackling the entire system.

The computational overhead and impractically high execution time of FEM solvers motivate analytical models that enable rapid thermal evaluation in early design phases. The most common method involves constructing thermal RC networks and solving the corresponding system of ODEs. Popular thermal simulators such as HotSpot [15] leverage this method, focusing on the microarchitectural layout blocks to facilitate design space exploration and early-stage thermal-aware layout and placement. Similarly, 3D-ICE [16] models liquid cooling with microchannels embedded between silicon layers. PACT [31] also employs a similar methodology by utilizing SPICE tools as solvers, focusing on standard-cell-level thermal analysis for 2.5D systems. However, these tools are not fast enough for large-scale, thermally-aware DSE of multi-chiplet 2.5D and 3D integrated systems.

2.5D and 3D packages often involve materials with varying thermal conductivity across different directions. For example, the thermal conductivity of the C4 layer is higher in the vertical direction than in the lateral direction. The existing thermal models [15, 16, 31] do not account for these variations. Moreover, they assume a uniform grid size for all material layers (e.g., interposer, C4 bumps, chiplets). In contrast, the thermal RC models in our multi-fidelity set allow varying thermal conductivity across different directions and grid sizes for each layer and block.

Architecture-level thermal RC models are well-suited for offline studies, such as architectural exploration, temperature sensor placement [32, 33], and thermal-aware chiplet placement [34]. However, DTPM requires a much faster temperature estimation time, in the order of milliseconds, for real-time temperature management [18, 35]. DSS models address this need by deriving a discrete-time linear time-invariant system that models the thermal dynamics at fixed locations as a function of the power consumption. For instance, TILTS [36] discretizes the power inputs to the chip over fixed time intervals to accelerate thermal simulations. Hence, it needs to be reproduced when the timing requirements or the underlying hardware configuration change. The speedup gain offsets the loss of the explicit connection to the hardware parameters (e.g., thermal conductance and capacitance) and generality.

The results of FEM, thermal RC, or other thermal simulations can also be used to train physics-informed machine learning techniques to model heat transfer in integrated circuits to reduce the thermal modeling effort [37]. For example, a recent technique collects data from numeric simulations and trains a random forest model to predict the convection heat transfer coefficients for a nonlinear heat transfer problem [38]. Similarly, Hwang et al., [39] present closed-form models derived from numerical simulations for tapered micro-channels to analyze the heat transfer performance as a function of the channel geometry. In contrast to individual classes of thermal models, this work proposes a framework to produce a family of multi-fidelity thermal models for 2.5D and 3D chiplet-based systems. The specific set of models designed with this framework covers a wide range of accuracy and execution time trade-offs, making them suitable for different design phases. Additionally, they can be augmented by additional models, such as physics-informed ML models, with complementary accuracy and execution time trade-offs.

III FEM for Thermal Analysis

FEM analysis begins by dividing the problem domain into small finite elements, converting the continuous governing partial differential equations (PDEs) into algebraic equations. Next, the system’s geometry is broken down into a lattice of small discrete cells called a “mesh” which approximates a larger, continuous block [28]. After applying the PDEs and boundary conditions to each element, the equations are assembled into a global algebraic system, maintaining continuity between adjacent elements. This global system, representing the discretized PDEs over the whole domain, is solved numerically for the field variables at each mesh node. In this work, only the equation governing solid conduction is solved [14]:

\nabla\cdot\left[k\nabla T\right]+\dot{q}=\rho C_{v}\frac{\partial T}{\partial t}

(1)

where $k$ is the thermal conductivity, $T$ is the temperature, $\dot{q}$ is the heat generation rate, $\rho$ is the density, and $C_{v}$ is the volumetric specific heat.

III-A Stages of the FEM Simulation Pipeline

Performing FEM simulations involves several key processing steps visualized in Figure 3. First, the geometry, a 3D representation of the 2.5D or 3D integrated package, is created using computer-aided design tools. This geometry should be as detailed as possible while allowing the setup and simulation to be completed within the given time constraints. Next, a volumetric mesh is generated by transforming the 3D model into one consisting of many individual cells on which the FEM software operates. This step can be iterated since creating appropriate mesh (aka meshing) is critical for computation time and accuracy of solutions. Once an acceptable mesh has been created, it is imported into the solver. The simulation is then set up, including boundary conditions, material parameters, power source terms, and other general model parameters. Our specific system setup is expanded upon in Section V. Finally, the FEM software simulates the model by solving the governing heat transfer equations.

III-B Impracticality of FEM Simulations

While FEM simulations offer high accuracy, they are impractical for DSE or runtime thermal management due to their time-consuming setup and operation. The process of geometry creation, meshing, solver setup, and execution is intricate and often exceeds the simulation runtime itself. Because the simulation process requires multiple iterations for reliable results, this time overhead quickly becomes prohibitive. The setup of 2.5D or 3D integrated systems is especially complex due to the large number of discrete power sources. These systems also involve numerous small and large bodies, dramatically increasing the computational complexity and the solver runtime [28]. Simulation times range from hours to days, directly impacted by geometric detail, complexity, size, and setup parameters such as the time step. Consequently, analyzing 2.5D or 3D integrated systems with intricate geometries and operating conditions using FEM simulations becomes prohibitively time-consuming, highlighting the need for alternative approaches.

IV Multi-Fidelity Thermal Modeling

IV-A Overview of the Proposed Approach

Our multi-fidelity thermal model set involves four individual models, visualized in Figure 4. The process of creating these models is identical to any packaging technology. MFIT considers 2.5D chiplet on silicon interposers and 3D direct bonded systems [9].

We start by creating fine-grained FEM models of specific components within the package. For example, we model individual links within the interposer and $\mu$ -bumps connecting a chiplet to the interposer. The cost of this level of detail is the model complexity and execution times that limit the simulation scope. Therefore, the fine-grained models are used as a reference to design abstracted FEM models, as explained in the following subsection. This abstraction enables system-level FEM simulations of systems with much higher chiplet counts than would otherwise be possible with negligible accuracy loss. The third class in the MFIT framework is the thermal RC models. Since these models are constructed using the geometry and material parameters of the system, a new RC model of a different system configuration can be created without re-running FEM simulations, allowing for rapid DSE. Finally, the continuous-time state-space equations that govern the thermal RC models are discretized with a given sampling period to create DSS models as detailed in Section IV-D.

IV-B Fine-Grained to Abstracted FEM Modeling

First, fine-grained models of key system components are constructed with as much detail as possible. Fine-grained modeling of the entire system at the highest level of detail is infeasible due to the memory, CPU, and execution time requirements. The second step is systematically designing abstracted models by replacing detailed structures with homogeneous blocks. During this process, we find the material parameters for these blocks such that their thermal behavior matches the original structure.

MFIT focuses on two structures within a chiplet-based package for this work: the $\mu$ -bumps connecting each chiplet to the interposer and the links that enable communication between chiplets. The rationale behind selecting these components is elaborated on in the following subsections. These two structures are present in both 2.5D and 3D chiplet-based packages, as shown in Figures 5 and 6, and the results of the abstraction experiments are applied to both the 2.5D and 3D full-system abstracted models.

While we apply our abstract modeling approach to only two structures in this work, the same approach applies to other structures in the package, such as the substrate or C4 bumps. In addition to these abstractions, MFIT also models the heatsink as a heat transfer coefficient (HTC) instead of a physical model. This choice removes the need to model fluids in our simulations, as fluid flow is only used for convective heat transfer in the heatsink.

To capture the 3D chiplet-based systems, we consider direct die-to-die bonding between vertically stacked chiplets as an example. With this bonding method, no additional layers are modeled between stacked chiplets. Instead, chiplets are modeled directly contacting each other, stacked one on top of the other. Similarly, another bonding method can be utilized, such as TSVs or $\mu$ -bumps between stacked chiplets. When modeling the geometry of TSVs or $\mu$ -bumps between stacked chiplets in a system, an additional block of homogeneous material is created between each stacked chiplet. This block has the same thickness as the thickness of the bonding method. Then, to determine the material parameters of this new block, an identical process is followed to that described in the following section.

IV-B1 $\mu$ -bump Abstracted Model

The $\mu$ -bumps are particularly important for thermal behavior since they are one of the two paths to dissipate heat away from a chiplet, as seen in Figure 5. Due to the density and the total number of $\mu$ -bumps present, which number in the thousands, it is impractical to simulate an entire package with individually modeled $\mu$ -bumps. Therefore, a small block of the $\mu$ -bump layer, along with the associated chiplet and interposer, is simulated in isolation to determine thermal coefficients that can be applied to the final abstract models as illustrated in Figure 7.

First, the detailed block containing $\mu$ -bumps and underfill material is simulated with static heat and convection boundaries, which are applied to create a measurable thermal gradient across the $\mu$ -bump layer. Then, the thermal conductivity $k$ is calculated as:

k=\frac{\dot{q}\cdot l}{A\cdot\Delta T}

(2)

where $\dot{q}$ is the heat flow rate, $l$ is the thickness of the material, $A$ is the cross-sectional area, and $\Delta T$ is the temperature difference across the material [40]. Thermal capacitance and specific heat are calculated via weighted body average [41]. These parameters are applied to a model containing a homogeneous block in place of the previously modeled $\mu$ -bumps and underfill material. Finally, the same boundary conditions are used, as shown in Figure 7.

We observe identical temperature drop across the $\mu$ -bump layer of the abstracted model and less than a tenth of a degree difference in interface temperatures in this sub-block, as presented in Table I, while achieving approximately 1.5x speedup.

TABLE I: Temperature results of the

\mu

-bump block abstraction experiments. Only a single result is shown for brevity.

Model

Upper surface

Temp. (^∘C)

Lower surface

Temp. (^∘C)

Temp.

Drop (^∘C)

Detailed

\mu

-bumps

39.13

31.05

8.08

Abstracted

\mu

-bumps

39.26

31.18

8.08

IV-B2 Link Abstracted Model

A link is a group of wires embedded in the interposer to interconnect chiplets. Depending on the thermal crosstalk over the links, the NoI architecture can play a significant role in a system’s thermal behavior. To determine how links are modeled in the complete system simulations, we tested three different configurations of a two-chiplet package. These configurations model the link (1) in full detail, (2) as an abstracted block, and (3) not model at all. We use two different power configurations: (a) the power dissipation is static over time, and (b) it varies dynamically over time. These power consumption profiles are applied to one chiplet while the temperature of the other chiplet is calculated through FEM simulation. The mean absolute error (MAE) of the receiving chiplet temperature compared to the detailed model case is recorded in Table II. Only a minimal accuracy loss is observed for both cases, while execution time savings are significant, as shown in Table III. Therefore, we choose not to model links in our full-system simulations.

TABLE II: Error comparison between abstracting and removing the links compared to detailed link modeling.

Power

Steady

MAE (^∘C)

Transient

MAE (^∘C)

Abstracted links

0.05

0.02

No links

0.34

0.13

TABLE III: Execution time for Detailed, Abstract, and No Links experiments with steady and transient power inputs.

Power

Steady

Exe. time (min)

Transient

Exe. time (min)

Detailed links

489.23

503.86

Abstract links

164.29

172.64

No links

123.80

132.13

IV-B3 Heatsink Abstracted Model

FEM simulations that involve a heatsink must model the convective heat transfer from the heatsink to the atmosphere using fluid models [30]. However, modeling fluid dramatically increases the setup and simulation time. Additionally, the geometry must be modified for every heatsink configuration, further increasing the time needed for design iteration. Due to their complexity, high-performance cooling methods, such as liquid cooling, are also difficult to model in FEM-based simulations.

MFIT removes the need to model the heatsink by abstracting the cooling solution to a single HTC. We then apply this coefficient to the top of the lid where a heatsink would typically sit. Modeling a heatsink as an HTC is an active area of research that has been studied for many different cooling solutions [42, 43]. This approach allows for a great deal of flexibility in FEM modeling. Instead of completing the time-consuming pipeline in Figure 3, the HTC can be modified to test the behavior of different cooling solutions.

MFIT, assumes an active air-cooled heatsink. The value of the HTC of an air-cooled heatsink is determined by:

h_{eq}=\frac{h_{avg}\cdot A_{t}\cdot\left(1-\frac{N\cdot A_{f}\cdot(1-\eta_{f}% )}{A_{t}}\right)}{LW}

(3)

where $A_{t}$ is the total area of the heatsink, $A_{f}$ is the fin area, $N$ is the number of fins, $\eta_{f}$ is the fin efficiency, and $L$ and $W$ are the length and width of the base plate. While the average convective HTC ( $h_{avg}$ ) can be calculated using the Nusselt number [41]. We select values consistent with a basic copper heatsink with forced airflow provided by a typical commercial computer fan. The parameters can be easily tuned to reflect different cooling solutions. Applying this method accelerates the simulation process substantially.

While this method is effective for systems where heat is dissipated primarily through the lid, a different approach may be required for other cooling methods, such as inter-tier liquid cooling, where microchannels contact the chip directly [39, 38]. In such an approach, heat is dissipated directly from the chip without moving to a heat spreader like the lid. Hence, additional heat transfer coefficients or abstraction techniques may be needed to capture the cooling behavior.

IV-C FEM to Thermal RC models

TABLE IV: Summary of the key parameters used in this work.

Notation	Definition
$k_{x}$ , $k_{y}$ , $k_{z}$	Thermal conductivity of a layer along the x, y, and z axes
$G_{x}$ , $G_{y}$ , $G_{z}$	Thermal conductance of a node along the x, y, and z axes
$\rho$	Material density
$C_{v}$	Volumetric specific heat of a layer
$G_{conv}$	Convection conductance of a boundary node
$h_{eq}$	Heat transfer coefficient of heatsink
N	Total number of nodes in the RC and DSS models
$\mathbf{T},\dot{\mathbf{T}}$	Nx1 temperature matrix and its derivative
$\dot{\mathbf{q}}$	Nx1 matrix of heat generation
$\mathbf{C},\mathbf{G}$	NxN thermal capacitance and conductance matrices

This section describes the process of constructing a thermal RC model from the geometry of a given package. MFIT applies this technique to both 2.D and 3D systems demonstrating the flexibility of the proposed methodology to different packaging technologies. However, this process can easily be applied to any package.

The package is first divided into horizontal layers, with the slicing process starting at the bottom substrate layer and ending at the top lid layer. Depending on the package design, each layer may be composed of uniform material or various material blocks. This flexibility enables the thermal RC model to simulate packages with heterogeneous designs where different chiplets are manufactured with various technologies, resulting in different material parameters in the same layer. For layers with uniform material properties, it is divided into a 2D grid of nodes. When a layer contains different material blocks, each block can be divided into grids with different levels of granularity. a 3D network of thermal nodes that discretizes the chiplet geometry in space.

Since a layer or material blocks may have anisotropic material, where thermal conductivity differs along the x, y, and z axes (represented by $k_{x}$ , $k_{y}$ , and $k_{z}$ ), we calculate the thermal conductance ( $G_{x}$ , $G_{y}$ , $G_{z}$ ) for each node using the following equations:

G_{x}=\frac{k_{x}\cdot l_{y}\cdot l_{z}}{l_{x}},~{}G_{y}=\frac{k_{y}\cdot l_{x% }\cdot l_{z}}{l_{y}},~{}G_{z}=\frac{k_{z}\cdot l_{x}\cdot l_{y}}{l_{z}}

(4)

where $l_{x}$ and $l_{y}$ are the node lengths in the x and y dimensions, respectively, and $l_{z}$ represents the thickness of the layer. The thermal capacitance of each node is then calculated as $C=\rho\cdot C_{v}\cdot l_{x}\cdot l_{y}\cdot l_{z}$ , where $\rho$ is material density and $C_{v}$ is the volumetric specific heat.

Heat is dissipated from the package primarily through a heatsink which is simulated using a convective heat transfer coefficient as detailed in Section IV-B3. MFIT assumes forced convection is applied to the heatsink while passive convection occurs on the other external boundaries of the package. Consequently, convective conductance ( $G_{conv}=h_{eq}\cdot l_{x}\cdot l_{y}$ ) is incorporated into the nodes of the top and bottom layers.

The conductance between neighboring nodes $i$ and $j$ ( $G_{ij}$ ) of the same layers is determined by lateral conductances ( $G_{x}$ and $G_{y}$ ). Unlike existing thermal RC models [44, 16, 31], our thermal RC model allows non-uniform grid sizes for different layers and blocks, so a node in one layer can be connected to multiple nodes from adjacent layers. Thus, vertical conductances between nodes of different layers are calculated from $G_{z}$ , considering the overlap in the x-y plane. Once this RC network is established, we can formulate an ODE based on Kirchhoff’s current law for a node $i$ as:

C_{i}\frac{dT_{i}}{dt}=\textstyle\sum_{j=1}^{N}(G_{ij})(T_{j}-T_{i})+\dot{q}_{i}

(5)

\displaystyle\text{where}~{}~{}G_{ij}=0,~{}\text{if }i==j\text{ or }j\text{ is% not a neighbor node of }i.

The heat generation ( $\dot{q}_{i}$ ) from node $i$ is analogous to electric current, and temperature ( $T_{i}$ ) is analogous to voltage. Since only the chiplet layers consume power, heat generation for the nodes in other layers is zero. Solving the system of ODEs by forming a matrix is a well-studied approach. It can be represented by:

\mathbf{C}\times\dot{\mathbf{T}}=\mathbf{G}\times\mathbf{T}+\dot{\mathbf{q}}

(6)

where $\mathbf{T}$ , $\dot{\mathbf{T}}$ , and $\dot{\mathbf{q}}$ are $N\times 1$ matrices representing node temperatures, temperature derivative, and generated heat. $\mathbf{C}$ is a $N\times N$ diagonal matrix, where each element corresponds to a node’s thermal capacitance. The conductance matrix $\mathbf{G}$ can be expressed as:

\mathbf{G}=\begin{bmatrix}-\sum_{j=1}^{N}G_{1j}&G_{12}&\cdots&G_{1N}\\ G_{21}&-\sum_{j=1}^{N}G_{2j}&\cdots&G_{2N}\\ \vdots&\vdots&\ddots&\vdots\\ G_{N1}&G_{N2}&\cdots&-\sum_{j=1}^{N}G_{Nj}\end{bmatrix}

(7)

where, $G_{ij}$ represents conductance between the neighboring nodes $i$ and $j$ .

MFIT employs the highly adaptive solver LSODA [45] to solve this system of ODEs. LSODA is designed to handle both stiff and non-stiff systems efficiently. It dynamically switches between different numerical integration methods depending on the characteristics of the ODE system being solved. This switching capability is particularly useful for thermal ODEs, as the equations’ stiffness can vary over time depending on the power consumption. It is worth noting that the matrices representing the system are highly sparse since each node is connected to only a few neighboring nodes. MFIT leverages this sparsity to accelerate the solver’s execution time. Finally, we fine-tune the capacitance values of each layer utilizing FEM results as a reference to improve the accuracy of our model.

IV-D Thermal RC to Discrete State Space models

The thermal RC model can be discretized in the time domain to further reduce the execution time of the model with no cost of accuracy. The resulting DSS model’s limitation is its dependence on an underlying continuous time model. It cannot be constructed directly without a thermal RC model as an intermediate step or system identification and measurement data. Additionally, a DSS model is specific to the geometry, materials, and sampling period used during the creation of the thermal RC model and the later discretization process. Therefore, the DSS model must be reconstructed if any design parameter changes. Only an existing RC model and a set time step are required to create a DSS model, with no direct information from the previous FEM model being needed.

Integrating the continuous time system, as given in Equation 6 for $t=t+T_{s}$ results in [46]:

\begin{split}\mathbf{T}(t+T_{s})=&e^{\mathbf{C^{-1}\mathbf{G}}(T_{s})}~{}% \mathbf{T}(t)+\\ &\int_{t}^{t+T_{s}}e^{\mathbf{C^{-1}\mathbf{G}}(t+T_{s}-\tau)}{C^{-1}}(\tau)~{% }\dot{\mathbf{q}}~{}d\tau\end{split}

(8)

Assuming constant heat generation during the sampling period $T_{s}$ , the system can be discretized from continuous time variable $t$ to discrete steps $k$ as:

\mathbf{T}[k+1]=\mathbf{A}\times\mathbf{T}[k]+\mathbf{B}\times\dot{\mathbf{q}}% [k]

(9)

where $\mathbf{A}$ and $\mathbf{B}$ are the state and input matrices. Equation 9 represents the discrete-time equivalent of the continuous-time thermal RC model (shown in Equation 6). MFIT uses the zero-order hold (ZOH) method for the discretization process. When power is provided as discrete inputs at each sampling period, ZOH provides an exact match to the continuous time model. $T_{s}$ can be determined for discretization as a function of input power consumption and system dynamics.

DSS model consists only of multiply-accumulate operations, allowing for extremely fast operation, as shown in Section V. The discretization process is also nearly instantaneous, allowing for rapid DSS model creation when a thermal RC model is available.

V Experimental Results

V-A Experimental Setup

We evaluate the accuracy of the proposed MFIT methodology on three 2.5D systems and one 3D system representative of their respective classes. Three separate 2.5D systems are studied to demonstrate the flexibility of the proposed approach for systems with different numbers of chiplets. A 3D system is considered to demonstrate the capability of the approach to model systems beyond a single planar layer of chiplets. Our thermal RC and DSS models are open-sourced to catalyze research in this domain. The rest of this section describes the parameters and geometry of the 2.5D and 3D systems considered in this paper.

TABLE V: Specifications of simulated systems in this work

Parameter

16 2.5D

36 2.5D

64 2.5D

16x3 3D

Package Geometry

Package

Thickness (mm)

1.855

2.255

Package Length

and Width (mm)

15.5

21.5

27.5

15.5

Package Top

Area (mm²)

240.25

462.25

756.25

240.25

Package Volume (mm³)

445.66

857.47

1402.84

534.55

Power (100% utilization)

Individual Chiplet

Power (W)

1.2

Total System

Power (W)

108

192

57.6

Total System Power

per lid area (W/mm²)

0.199

0.233

0.253

0.239

Temperature

Maximum Chiplet

Temperature (^∘C)

118.25

129.75

164.03

128.65

V-A1 Package Overview

Both the 2.5D and 3D systems utilize a silicon interposer with chiplets placed upon it. The interposer is connected to the underlying substrate using C4 bumps. Copper wires embedded in the interposer are used to connect neighboring chiplets. In both our 2.5D and 3D systems, each chiplet area is considered to be $2.25mm^{2}$ , consistent with prior studies [19, 20, 5]. Each chiplet consists of multiple blocks. Each of these blocks corresponds to a component such as a computational tile or a router used for inter-chiplet communication, as detailed in Figure 1(a). Each of these blocks which make up the chiplet has an individual power profile. This means that different power profiles can be applied to every computational tile and router port in each chiplet. In our experimentation, different levels of detail are applied to the chiplets in the 2.5D and 3D system, as described in the following sections.

2.5D System Specifics: The target 2.5D system consists of a grid of chiplets integrated on an interposer, as illustrated in Figure 1(a). Each chiplet is connected directly to the interposer via $\mu$ -bumps surrounded by a capillary underfill material. The physical dimensions of the router ports are compatible with Universal Chiplet Interconnect Express (UCIe) specification [47]. The entire package is covered by a copper lid, which contacts each chiplet through a thermal interface material (TIM).

3D System Specifics: In the 3D system, three stacked chiplets are placed in a 4x4 grid with equal spacing, consistent with [9]. The bottom chiplets are connected to the interposer through the $\mu$ -bumps surrounded by a capillary underfill material, as detailed in Figure 1(b). The lid contacts only the top chiplet layer through a thermal interface material.

V-B Thermal RC Model Configuration

The number of nodes in the thermal RC network determines the model complexity, runtime, and granularity at which temperature can be observed in the model. A higher node density is used in chiplets to optimize this trade-off, while fewer nodes are used in non-chiplet components, such as the interposer, lid, substrate, and so on. Each chiplet is divided into four equal quadrants. One node is placed within each quadrant to allow for granular temperature monitoring across each chiplet. For non-chiplet layers in each model, an alternate node density is used for each model as described below. This easily configurable non-uniform node density enables higher thermal resolution in critical parts such as chiplets while decreasing the runtime with lower resolution in less critical structures such as the substrate and lid. The DSS models in our experimentation are created by discretizing the thermal RC models with $T_{s}=0.01s$ sampling period. The sampling time can be chosen as a function of the application requirements.

2.5D Thermal RC Model Specifics: For the 2.5D systems, the choice of 4 nodes per chiplet leads to 64, 144, and 256 nodes in the chiplet layer of the 16, 36, and 64 chiplet systems, respectively. For all other layers, the number of nodes is equal to the total number of chiplets per layer. This allows the model to maintain higher thermal resolution in the critical chiplet layers while maintaining a fast execution time.

3D Thermal RC Model Specifics: The node densities in the 3D system are adjusted similar to the 2.5D models. The layers that contain chiplets use a 8 $\times$ 8 grid, implying 4 nodes per chiplet. All other layers have a 4 $\times$ 4 grid, leading to a lower node density. The entire 3D system consists of 48 chiplets in total. There are a total of 192 nodes counting all nodes within chiplets.

V-B1 Input Workloads and Power Consumption

Identical workloads are considered for the 2.5D and 3D systems, with differences in chiplet power density detailed in the following subsections. The target chiplet systems are analyzed under one synthetic (WL1) and five real AI/ML application workloads (WL2-WL6). The synthetic workload starts with a stress test that applies the maximum power to all chiplets to increase temperature beyond 100^∘C. Then, a pseudo-random bit sequence (PRBS) is applied to each chiplet to emulate a wide range of dynamic variations. Finally, all chiplets are turned off to let the temperature return to the ambient state, as depicted in Figure 10. Besides testing transient and steady-state behaviors, this power profile helps us to tune the thermal RC model.

The remaining scenarios consider processing-in-memory (PIM)-based chiplets for accelerating ML workloads. The computational platform is resistive random access memory (ReRAM) based chiplets commonly used in literature [20, 4, 48]. We select this configuration due to it’s ability to efficiently implement matrix-vector multiplication, which is the predominant operation in any CNN workload. Each workload consists of a series of deep neural networks (DNNs) which run in series on the system. The workloads are listed in Table VI. The neural networks (NN) in these workloads consist of several networks such as ResNets, DenseNets, and VGG networks. For example, WL1 contains 16 ResNet34’s, followed by one VGG19, then 5 ResNet50’s, and so on. Each workload contains from 20 to 40 individual networks. Workloads are mapped to the system as computing resources become available, meaning a new NN is mapped to chiplets when it completes the execution of a previous NN. Consequently, these workloads consist of NNs ranging from small NNs like ResNet18, which can be mapped to a single chiplet, to larger NNs such as DenseNet169, which are spread across multiple chiplets.

TABLE VI: Descriptions of the workloads used in this work. (C) denotes CIFAR100 dataset, (I) denotes ImageNet dataset.

Workload	Composition
WL1	Synthetic (see Figure 10)
WL2	16 $\times$ ResNet34 (C), 1 $\times$ VGG19 (C), 5 $\times$ ResNet50 (C), 3 $\times$ DenseNet40 (C), 1 $\times$ ResNet152 (C), 1 $\times$ VGG19 (I), 4 $\times$ ResNet34 (I), 1 $\times$ ResNet18 (I), 1 $\times$ ResNet50 (I), 1 $\times$ VGG16 (I)
WL3	16 $\times$ ResNet34 (I), 1 $\times$ VGG19 (I), 5 $\times$ ResNet50 (I), 3 $\times$ DenseNet169 (I), 1 $\times$ ResNet110, 1 $\times$ VGG19 (I), 4 $\times$ ResNet101 (I), 1xResNet152 (I), 1 $\times$ ResNet18 (I), 1 $\times$ ResNet50 (I), 1 $\times$ Resnet152 (I)
WL4	16 $\times$ ResNet34 (C), 2 $\times$ VGG19 (I), 4 $\times$ DenseNet169 (I), 3 $\times$ DenseNet40 (C), 5 $\times$ ResNet50 (C), 3 $\times$ ResNet101, 7 $\times$ ResNet150 (I), 2 $\times$ VGG19 (I), 4 $\times$ ResNet101, 1 $\times$ VGG19 (C)
WL5	16 $\times$ Resnet34 (I), 1 $\times$ ResNet152 (I), 1 $\times$ ResNet110 (I), 3 $\times$ ResNet101 (I), 9 $\times$ DenseNet169 (I), 4 $\times$ ResNet34 (I), 12 $\times$ ResNet18 (I), 5 $\times$ ResNet50 (I), 1 $\times$ ResNet152 (I)
WL6	3 $\times$ DenseNet169 (I), 4 $\times$ ResNet34 (I), 12 $\times$ ResNet18 (I), 4 $\times$ ResNet101 (I), 2 $\times$ VGG19 (I), 4 $\times$ ResNet101 (I), 1 $\times$ VGG19 (C), 3 $\times$ DenseNet40 (C)

After mapping the group of NNs to the target chiplet-based systems, the chiplet power consumption is estimated in two parts, communication and computation. We estimate computation power through NeuroSim and interconnection network power using BookSim [49, 50]. We use running average power throughout the workload execution (40-55 seconds), consistent with power measuring tools such as Intel RAPL [51] and pyNVML [52].

Differences in 2.5D and 3D system chiplet power: The experiments use different hardware parameters, such as voltage and frequency, for the 2.5D and 3D systems. This results in lower per-chiplet power consumption in the 3D system of 1.2W as compared to 3W for the 2.5D system, detailed under the Power section of Table V. Using these parameters, the total system power per lid area of the 3D system is between that of the 36 and 64 chiplet 2.5D system. This level means that the temperature of individual chiplets should be roughly equivalent between these systems, which is confirmed in Figure 10.

V-B2 HotSpot [15] configuration:

We also compare our proposed models to the state-of-the-art tool HotSpot [15] for all the system sizes and workload configurations. Since HotSpot was originally designed for 2D chips, we utilize an extension that adds 3D modeling capabilities [53] to model both 2.5D and 3D integrated systems. Geometry and material parameters are set to be identical to our reference FEM model. Since HotSpot does not support thermal conductivity variations in x-y-z directions, we use the average conductivity for anisotropic material layers (ex. C4 bump layer) in the chiplet package. HotSpot also lacks the support for varying grid sizes across different layers. Therefore, we maintain a uniform grid size matching our chiplet layer.

V-C Execution Time Evaluation

This section evaluates the execution time of the proposed multi-fidelity thermal model set. All simulations are run on a dual Intel Xeon Gold 6242R system with 40 processing cores. We use WL1 for our timing analysis since the execution time is comparable across all workloads.

2.5D Evaluation: Abstracted FEM simulations take 2.4, 14.5, and 38.0 hours for 16, 36, and 64 chiplet systems, respectively. While providing an accurate reference, these long simulation times and significant development effort make FEM impractical for DSE. Our thermal RC models fill this gap with execution times ranging from 53.0 to 1.8 seconds for 16, 36, and 64 chiplets, as summarized in the first 3 systems in Figure 8. Coupled with the accuracy presented in the previous subsection, the 46 to 136-fold speedup demonstrates their viability as a DSE tool.

Thermal RC models are derived directly from the underlying geometry and material parameters, meaning they can be reconfigured for different hardware and design configurations without re-calibration from FEM simulations. Relaxing this physical system to model connection, our DSS models reduce execution time to 39, 96, and 944 ms for 16, 36, and 64 chiplets respectively. This speedup with respect to the RC model enables runtime temperature prediction which can inform dynamic thermal power management (DTPM) decisions to increase system performance and reliability [54]. While maintaining the accuracy for a given configuration, DSS models need to be regenerated if the sampling period or hardware configuration changes.

3D Evaluation: The execution time results of the 3D system are summarized in the far right system of Figure 8. FEM simulations take approximately 3.3 hours for the single 3D system. The thermal RC model dramatically decreases this runtime to 6 seconds. Similar to the results seen for the 2.5D system, the 3D DSS model again shows a dramatically reduced runtime of 0.07 seconds. For comparison, the execution time of a similarly sized 2.5D system DSS model is 0.09 seconds.

Execution time comparison to HotSpot [15]: The commonly used thermal modeling tool HotSpot belongs to the same class as our thermal RC models. The execution times of the proposed models are significantly faster (1862 $\times$ for 16, 607 $\times$ for 36, and 245 $\times$ for 64 - 2.5D integrated chiplet systems while 817 $\times$ for the 16x3 - 3D integrated chiplet system) than HotSpot, as shown in Figure 8.

The significant speedup mentioned above with similar or higher accuracy is primarily attributed to two factors. First, we employ a non-uniform grid for different layers, as explained in Section IV-C. The speedup due to the use of a non-uniform grid size is 13 $\times$ , 17 $\times$ , 19 $\times$ and 7 $\times$ faster execution for 16, 36, 64 - 2.5D integrated and 16x3 - 3D integrated chiplet systems with respect to HotSpot. Second, we employ the adaptive solver LSODA. This solver requires fewer iterations per time step for convergence in comparison to HotSpot. Utilizing an identical grid configuration to HotSpot, the use of this solver alone leads to speedups of approximately 144 $\times$ , 35 $\times$ , 13 $\times$ , and 122 $\times$ , again with respect to the baseline HotSpot model execution time. Finally, we emphasize that MFIT, our multi-fidelity thermal model set covers a much wider range of accuracy and execution time trade-offs than a specific point solution such as HotSpot.

V-D Validation of Thermal RC and DSS model

We validate the accuracy of our thermal RC and DSS models by comparing their temperature estimates to full-system FEM simulation results for the same workload and system configurations. This comparison is completed for each of the three 2.5D system sizes and the single 3D system. In addition to the visualization of the temperature estimate over time, two metrics are used to quantify the accuracy of our thermal RC and DSS models against the FEM results. The MAE metric measures the mean absolute error in temperature across the entire simulation duration. Predicting temperature violations (e.g., tracking the time steps when the temperature exceeds the allowed threshold) is often used by DTPM algorithms. Therefore, the second metric measures the accuracy of our models in predicting temperature violations. We set 85^∘C as the maximum allowable temperature threshold for each system without loss of generality [55]. This metric first identifies the time steps in FEM simulations where temperature violations occur (temperature exceeds 85^∘C). Then, it computes the percentage of these violations captured by the thermal RC and DSS models (e.g., 100% means all violations are detected with perfect accuracy). The proposed models conservatively flag violations within one degree of the above-mentioned threshold temperature.

To assist users in visualizing the thermal behavior of the system under test, the RC model also creates a heat map of each layer in the system. As an example, the heat map of the interposer layer of a 2.5D 64 chiplet system is shown in Figure 9. This figure shows the temperature gradient that occurs between the center of the interposer, where the heat producing chiplets are located, and the edges of the system, where there are no chiplets. These maps allow for quick visual verification of the system behavior instead of relying purely on numerical results.

2.5D Validation Results: Figures 10(a), 10(b), and 10(c) plot the temperature as a function of time for each 2.5D system size of a representative chiplet while running workload WL1. All three plots of 2.5D systems clearly show that the systematically constructed thermal RC and DSS models produce near-identical results to the FEM baseline. They closely follow the FEM results during the stress test (temperature increases until reaching the maximum point), randomly changing chiplet power consumption (middle portion), and cool-down periods.

TABLE VII: Mean Absolute Error and accuracy in predicting temperature violations are shown as compared to reference FEM results. The thermal RC, DSS, and HotSpot [15] models are shown.

2.5D - 16 Chiplets

2.5D - 36 Chiplets

2.5D - 64 Chiplets

3D - 16x3 Chiplets

Workload

Model

MAE (^∘C)

Temp. Violation

Accuracy (%)

MAE (^∘C)

Temp. Violation

Accuracy (%)

MAE (^∘C)

Temp. Violation

Accuracy (%)

MAE (^∘C)

Temp. Violation

Accuracy (%)

WL1

Thermal RC

1.23

93.4

1.42

96.9

1.17

99.2

0.82

99.7

DSS

1.23

93.4

1.42

96.9

1.17

99.2

0.82

99.7

HotSpot

2.74

67.7

1.64

96.9

1.69

98.1

1.32

94.5

WL2

Thermal RC

0.86

95.9

1.16

100

1.08

100

1.08

100

DSS

0.86

95.9

1.16

100

1.08

100

1.08

100

HotSpot

1.57

75.0

1.24

100

1.19

100

1.19

100

WL3

Thermal RC

1.02

100

1.28

77.2

1.17

89.3

1.05

100

DSS

1.02

100

1.28

77.2

1.17

89.3

1.05

100

HotSpot

1.97

100

1.18

42.3

1.33

74.4

1.08

100

WL4

Thermal RC

1.41

96.6

1.63

95.5

1.55

97.6

1.30

99.3

DSS

1.41

96.6

1.63

95.5

1.55

97.6

1.30

99.3

HotSpot

2.29

95.6

1.89

92.0

2.15

97.7

1.85

98.0

WL5

Thermal RC

1.01

100

1.25

87.8

1.07

82.4

1.03

100

DSS

1.01

100

1.25

87.8

1.07

82.4

1.03

100

HotSpot

1.94

100

1.16

60.7

1.40

52.9

1.05

100

WL6

Thermal RC

0.89

98.1

1.30

84.8

1.21

90.9

1.11

98.3

DSS

0.89

98.1

1.30

84.8

1.21

90.9

1.11

98.3

HotSpot

1.62

85.8

1.28

89.7

1.56

89.5

1.24

86.7

The first 3 columns of Table VII summarize the accuracy results for all 2.5D systems and workloads. The worst-case mean absolute errors are only 1.41, 1.63, and 1.55 degrees (highlighted in dark red) for 16, 36, and 64 chiplet systems, respectively. These results indicate that the proposed models achieve excellent accuracy across different hardware configurations and workloads. Our models also achieve high accuracy in predicting temperature violations. For example, the worst-case accuracy for the 16-chiplet system is 93.4% (i.e., only 6.6% of the time steps where violations are missed) for WL1. Our models capture 100% of the violations during WL3 and WL6 while missing a handful of them for other workloads. The corresponding accuracy for the 36-chiplet system is above 95% for WL1, WL2, and WL4. We observe 77.2%, 84.8%, and 87.8% accuracy for WL3, WL6, and WL5, respectively. These relatively lower accuracy values stem from sudden temperature spikes that lead to short-term temperature violations in these workload-system combinations. Temperature spikes in specific chiplets occur when several grouped chiplets experience a transient power spike simultaneously. This increases the temperature of the more central chiplets in the group. In these cases, the peak temperatures are mostly at or below 85 degrees with infrequent short-term violations, which can be tolerated. For example, FEM simulations indicate only 255 temperature violations across all chiplets in WL3 compared to 11 thousand violations in WL1. Hence, missing even a few violations impacts the accuracy heavily when the RC and DSS models do not capture these spikes. In contrast, our thermal RC and DSS models effectively detect more prolonged violations, as evidenced by WL1, WL3, and WL4. Similarly, our thermal RC and DSS models achieve high accuracy for the 64-chiplet system, as shown in Table VII. The lowest accuracies are 82.4% for WL5 and 89.3% for WL3, which have few total violations.

3D Validation Results: Figure 10(d) plots the temperature as a function of time for the single 3D system of a representative chiplet running WL1. The plot shows similar behavior to the 2.5D comparison, where the RC and DSS match each other exactly and match the reference FEM results extremely closely during the stress test, random, and cool-down portions of the workload.

The far right column of Table VII summarizes the accuracy results for the single 3D system for each workload. The worst-case MAE is only 1.3 degrees (highlighted again in dark red). This result shows that the proposed models maintain high levels of accuracy even when applied to a stacked die package. This high degree of accuracy is repeated when predicting temperature violations. The worst case temperature violation prediction accuracy is only 98.3%, occurring for workload 6.

Accuracy comparison to HotSpot [15]: Table VII also lists the MAE of HotSpot simulations compared to FEM simulations for 2.5D and the 3D system. For the 2.5D system results, the average error of HotSpot is 0.95, 0.05, and 0.35 degrees greater than our thermal RC and DSS models for the 16, 36, and 64 chiplet systems, respectively. Notably, HotSpot also has lower accuracy in detecting temperature violations, but our primary advantage is in execution time, as discussed in Section V-C. For the 3D results, HotSpot’s average error is 0.22 degrees greater than that of our thermal RC and DSS models. HotSpot again shows lower accuracy in detecting temperature violations in workloads 1, 4, and 6. The larger error across system configurations can be attributed to the tuning of thermal RC model parameters based on the available reference FEM thermal model results. Additionally, HotSpot does not include the ability to provide different thermal conductivities along the x, y, and z-axis, further decreasing accuracy.

VI Conclusion

Conventional monolithic 2D chips cannot sustain the increasing performance and compute capacity demands due to increasing manufacturing costs. 2.5D and 3D multi-chiplet systems have emerged as cost-effective solutions to continue the required scaling. However, substantial compute power in a small volume intensifies the power density, leading to severe heat dissipation and thermal challenges. There is a strong need for open-source thermal modeling tools that enable researchers to analyze thermal behavior and perform thermally-aware optimizations. Re-purposing existing approaches developed for monolithic chips incurs accuracy and execution time penalties, while custom-designed singular solutions have limited scope. To fill this gap, this paper proposed MFIT, a set of multi-fidelity thermal models that span a wide range of accuracy and execution time trade-offs. Since the proposed models are consistent by construction, designers can use them throughout the design cycle, from system specification to design space exploration and runtime resource management.

References

[1] “Semiconductor Research Corporation, ”Decadal Plan for Semiconductors”,” 2021, https://www.src.org/about/decadal-plan/decadal-plan-full-report.pdf, Accessed on March 31, 2024.
[2] S. I. Association, “Intl. technology roadmap for semiconductors 2015 edition,” https://www.semiconductors.org/resources/2015-Intl.-technology-roadmap-for-semiconductors-itrs/.
[3] Y.-K. Cheng et al., “Next-generation design and technology co-optimization (dtco) of system on integrated chip (soic) for mobile and hpc applications,” in Proc. of IEEE IEDM, 2020, pp. 41.3.1–41.3.4.
[4] Y. S. Shao et al., “Simba: Scaling deep-learning inference with multi-chip-module-based architecture,” in Proc. of IEEE/ACM MICRO, vol. 64, no. 6. ACM New York, NY, USA, 2021, pp. 107–116.
[5] H. Sharma et al., “Florets for chiplets: Data flow-aware high-performance and energy-efficient network-on-interposer for cnn inference tasks,” ACM Transactions on Embedded Computing Systems, vol. 22, no. 5s, pp. 1–21, 2023.
[6] G. Krishnan et al., “Big-little chiplets for in-memory acceleration of dnns: A scalable heterogeneous architecture,” in 2022 IEEE/ACM ICCAD, 2022, pp. 1–9.
[7] D. Stow, I. Akgun, R. Barnes, P. Gu, and Y. Xie, “Cost analysis and cost-driven ip reuse methodology for soc design based on 2.5d/3d integration,” in Proc. of IEEE/ACM ICCAD, 2016, pp. 1–6.
[8] M. Won, “Agilex fpgas deliver a game-changing combination of flexibility and agility for the data-centric world,” Intel, Tech. Rep., 2023.
[9] B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, G. H. Loh, D. McCaule, P. Morrow, D. W. Nelson, D. Pantuso, P. Reed, J. Rupley, S. Shankar, J. Shen, and C. Webb, “Die stacking (3d) microarchitecture,” in 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’06), 2006, pp. 469–479.
[10] S. Naffziger et al., “Pioneering chiplet technology and design for the amd epyc™ and ryzen™ processor families : Industrial product,” in Proc. of ACM/IEEE ISCA, 2021, pp. 57–70.
[11] S. Bharadwaj, J. Yin, B. Beckmann, and T. Krishna, “Kite: A family of heterogeneous interposer topologies enabled via accurate interconnect modeling,” in Proc. of ACM/IEEE DAC, 2020.
[12] R. Agarwal, P. Cheng, P. Shah, B. Wilkerson, R. Swaminathan, J. Wuu, and C. Mandalapu, “3d packaging for heterogeneous integration,” in 2022 IEEE 72nd Electronic Components and Technology Conference (ECTC). IEEE, 2022, pp. 1103–1107.
[13] J. Park, A. Kanani, L. Pfromm, H. Sharma, P. Solanki, E. Tervo, J. R. Doppa, P. P. Pande, and U. Y. Ogras, “Thermal modeling and management challenges in heterogenous integration: 2.5 d chiplet platforms and beyond,” in 2024 IEEE 42nd VLSI Test Symposium (VTS). IEEE, 2024, pp. 1–4.
[14] H. Sultan, A. Chauhan, and S. R. Sarangi, “A survey of chip-level thermal simulators,” ACM Comput. Surv., vol. 52, no. 2, apr 2019.
[15] K. Skadron and otehrs, “Temperature-aware microarchitecture: Modeling and implementation,” ACM Trans. Archit. Code Optim., vol. 1, no. 1, p. 94–125, mar 2004.
[16] A. Sridhar, A. Vincenzi, D. Atienza, and T. Brunschwiler, “3d-ice: A compact thermal model for early-stage design of liquid-cooled ics,” IEEE Transactions on Computers, vol. 63, no. 10, pp. 2576–2589, 2013.
[17] G. Bhat, G. Singla, A. K. Unver, and U. Y. Ogras, “Algorithmic optimization of thermal and power management for heterogeneous mobile platforms,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 26, no. 3, pp. 544–557, 2017.
[18] S. Sharifi and T. Š. Rosing, “Accurate direct and indirect on-chip temperature sensing for efficient dynamic thermal management,” IEEE TCAD-IC, vol. 29, no. 10, pp. 1586–1599, 2010.
[19] G. Krishnan et al., “Siam: Chiplet-based scalable in-memory acceleration with mesh for deep neural networks,” ACM Transactions on Embedded Computing Systems (TECS), vol. 20, no. 5s, pp. 1–24, 2021.
[20] H. Sharma et al., “Swap: A server-scale communication-aware chiplet-based manycore pim accelerator,” IEEE TCAD-IC, vol. 41, no. 11, pp. 4145–4156, 2022.
[21] Y. Eckert, N. Jayasena, and G. H. Loh, “Thermal feasibility of die-stacked processing in memory,” in Proc. of WoNDP, 2014.
[22] G. L. Loi et al., “A thermally-aware performance analysis of vertically integrated (3-d) processor-memory hierarchy,” in Proc. of DAC, 2006, pp. 991–996.
[23] S. Sadiqbatcha et al., “Hot spot identification and system parameterized thermal modeling for multi-core processors through infrared thermal imaging,” in Proc. of DATE. IEEE, 2019, pp. 48–53.
[24] H. Amrouch and J. Henkel, “Lucid infrared thermography of thermally-constrained processors,” in Proc. of IEEE/ACM ISLPED, 2015, pp. 347–352.
[25] J. Zhang et al., “Full-chip power density and thermal map characterization for commercial microprocessors under heat sink cooling,” IEEE TCAD-IC., vol. 41, no. 5, pp. 1453–1466, 2021.
[26] Y. Zhang, A. Srivastava, and M. Zahran, “Chip level thermal profile estimation using on-chip temperature sensors,” in IEEE Intl. Conf. on Computer Design, 2008, pp. 432–437.
[27] F. Zaruba, F. Schuiki, and L. Benini, “A 4096-core risc-v chiplet architecture for ultra-efficient floating-point computing,” in Proc. of IEEE Hot Chips 32 Symposium. IEEE Computer Society, 2020, pp. 1–24.
[28] J. E. Matsson, An Introduction to Ansys Fluent 2023. Sdc Publications, 2023.
[29] COMSOL, “Comsol multiphysics reference manual,” https://comsol.com. [Online]. Available: https://www.comsol.com/
[30] M. Zhou, L. Li, F. Hou, G. He, and J. Fan, “Thermal modeling of a chiplet-based packaging with a 2.5-d through-silicon via interposer,” IEEE Transactions on Components, Packaging and Manufacturing Technology, vol. 12, no. 6, pp. 956–963, 2022.
[31] Z. Yuan et al., “Pact: An extensible parallel thermal simulator for emerging integration and cooling technologies,” IEEE TCAD-IC, vol. 41, no. 4, pp. 1048–1061, 2021.
[32] F. Zanini, D. Atienza, C. N. Jones, and G. De Micheli, “Temperature sensor placement in thermal management systems for mpsocs,” in Proc. of ISCAS, 2010, pp. 1065–1068.
[33] P. K. Chundi et al., “Hotspot monitoring and temperature estimation with miniature on-chip temperature sensors,” in Proc. of IEEE/ACM ISLPED, 2017, pp. 1–6.
[34] Y. Ma, L. Delshadtehrani, C. Demirkiran, J. L. Abellan, and A. Joshi, “Tap-2.5 d: A thermally-aware chiplet placement methodology for 2.5 d systems,” in Proc. of DATE. IEEE, 2021, pp. 1246–1251.
[35] G. Bhat, G. Singla, A. K. Unver, and U. Y. Ogras, “Algorithmic optimization of thermal and power management for heterogeneous mobile platforms,” IEEE Trans. VLSI Syst., vol. 26, no. 3, pp. 544–557, 2018.
[36] Y. Han, I. Koren, and C. M. Krishna, “Tilts: A fast architectural-level transient thermal simulation method,” Journal of Low Power Electronics, vol. 3, no. 1, pp. 13–21, 2007.
[37] S. Cai, Z. Wang, S. Wang, P. Perdikaris, and G. E. Karniadakis, “Physics-informed neural networks for heat transfer problems,” Journal of Heat Transfer, vol. 143, no. 6, p. 060801, 2021.
[38] B. Kwon, F. Ejaz, and L. K. Hwang, “Machine learning for heat transfer correlations,” International Communications in Heat and Mass Transfer, vol. 116, p. 104694, 2020.
[39] L. Hwang, B. Kwon, and M. Wong, “Accurate models for optimizing tapered microchannel heat sinks in 3d ics,” in Proc. of IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2018, pp. 58–63.
[40] G. Nellis and S. Klein, Heat transfer. Cambridge university press, 2008.
[41] T. L. Bergman, A. S. Lavine, F. P. Incropera, and D. P. DeWitt, Fundamentals of Heat and Mass Transfer. John Wiley & Sons, 2011.
[42] W. A. Khan, J. R. Culham, and M. M. Yovanovich, “Modeling of cylindrical pin-fin heat sinks for electronic packaging,” IEEE Transactions on Components and Packaging Technologies, vol. 31, no. 3, pp. 536–545, 2008.
[43] S. Narasimhan and J. Majdalani, “Characterization of compact heat sink models in natural convection,” IEEE Transactions on Components and Packaging Technologies, vol. 25, no. 1, pp. 78–86, 2002.
[44] K. Skadron, M. R. Stan, K. Sankaranarayanan, W. Huang, S. Velusamy, and D. Tarjan, “Temperature-aware microarchitecture: Modeling and implementation,” ACM Trans. Archit. Code Optim., vol. 1, no. 1, p. 94–125, mar 2004. [Online]. Available: https://doi.org/10.1145/980152.980157
[45] A. C. Hindmarsh and L. R. Petzold, “Lsoda, ordinary differential equation solver for stiff or non-stiff system,” 2005.
[46] S. H. Zak et al., Systems and control. Oxford University Press New York, 2003, vol. 198.
[47] D. D. Sharma, G. Pasdast, Z. Qian, and K. Aygun, “Universal chiplet interconnect express (ucie): An open industry standard for innovations with chiplets at package level,” IEEE Transactions on Components, Packaging and Manufacturing Technology, vol. 12, no. 9, pp. 1423–1431, 2022.
[48] L. Song, X. Qian, H. Li, and Y. Chen, “Pipelayer: A pipelined reram-based accelerator for deep learning,” in Proc. of HPCA. IEEE, 2017, pp. 541–552.
[49] P.-Y. Chen, X. Peng, and S. Yu, “Neurosim: A circuit-level macro model for benchmarking neuro-inspired architectures in online learning,” IEEE TCAD-IC, vol. 37, no. 12, pp. 3067–3080, 2018.
[50] N. Jiang et al., “A detailed and flexible cycle-accurate network-on-chip simulator,” in Proc. of ISPASS. IEEE, 2013, pp. 86–96.
[51] H. David et al., “Rapl: Memory power estimation and capping,” in Proc. of the ACM/IEEE ISLPED, 2010, pp. 189–194.
[52] “pyNVML,” https://github.com/gpuopenanalytics/pynvml.
[53] J. Meng, K. Kawakami, and A. K. Coskun, “Optimizing energy efficiency of 3-d multicore systems with stacked dram under power and thermal constraints,” in Proc. of DAC, 2012, pp. 648–655.
[54] D. Brooks, R. P. Dick, R. Joseph, and L. Shang, “Power, thermal, and reliability modeling in nanometer-scale microprocessors,” IEEE Micro, vol. 27, no. 3, pp. 49–62, 2007.
[55] M. Zhou et al., “Temperature-aware dram cache management—relaxing thermal constraints in 3-d systems,” IEEE TCAD-IC, vol. 39, no. 10, pp. 1973–1986, 2020.

MFIT : Multi-FIdelity Thermal Modeling for 2.5D and 3D Multi-Chiplet Architectures

Abstract

I Introduction

II Related Work

III FEM for Thermal Analysis

III-A Stages of the FEM Simulation Pipeline

III-B Impracticality of FEM Simulations

IV Multi-Fidelity Thermal Modeling

IV-A Overview of the Proposed Approach

IV-B Fine-Grained to Abstracted FEM Modeling

IV-B1 μ𝜇\muitalic_μ-bump Abstracted Model

IV-B2 Link Abstracted Model

IV-B3 Heatsink Abstracted Model

IV-C FEM to Thermal RC models

IV-D Thermal RC to Discrete State Space models

V Experimental Results

V-A Experimental Setup

V-A1 Package Overview

V-B Thermal RC Model Configuration

V-B1 Input Workloads and Power Consumption

V-B2 HotSpot [15] configuration:

V-C Execution Time Evaluation

V-D Validation of Thermal RC and DSS model

VI Conclusion

References

IV-B1 $\mu$ -bump Abstracted Model