A Multiscale Workflow for Thermal Analysis of 3DI Chip Stacks
Max Bloomfield, Amogh Wasti, et al.
ITherm 2025
HPC applications are increasingly utilizing cloud resources due to their cost-effectiveness. Among these resources, spot compute instances present an opportunity to run applications at deep discounts as compared to on-demand instances. However, they present unique challenges for tightly-coupled HPC applications due to potential interruptions. Traditional parallel programming models like MPI are not inherently fault-tolerant, and existing methods to handle these interruptions are inefficient and require significant programmer effort. In this paper, we present Charm++ as an alternative solution that natively supports fault tolerance, dynamic load balancing, and resource rescaling. We present a tool to run Charm++ applications with a mix of on-demand and spot instances which can detect and efficiently handle spot interruptions. We show that using spot instances can result in up to 60% cost savings for our benchmark application.
Max Bloomfield, Amogh Wasti, et al.
ITherm 2025
Marcelo Amaral
OSSEU 2023
Evaline Ju, Kelly Abuelsaad
KubeCon EU 2026
Nikoleta Iliakopoulou, Jovan Stojkovic, et al.
MICRO 2025