PowerGrad: Hierarchical Power Management for Power-Limited ML Inference Clusters

Hyoungwook Nam; Raghavendra Pothukuchi; Alper Buyuktosunoglu; Aporva Amarnath; Pradip Bose; Josep Torrellas

ISCA 2026

Conference paper

27 Jun 2026

PowerGrad: Hierarchical Power Management for Power-Limited ML Inference Clusters

Abstract

As machine learning (ML) workloads demand more power and datacenters integrate renewable energy, workloads have to deal with situations where power demands exceed supply. In such situations, intelligently allocating the power among the nodes is key to maximizing efficiency. However, this is hard to do for ML inference workloads, where system administrators cannot profile the workload ahead of time.

To address this challenge, this paper proposes PowerGrad, a hierarchical power-management framework for power-limited ML inference clusters. The idea is to dynamically identify the performance gradient of each running workload, which characterizes the performance sensitivity of the workload to power changes. At runtime, a Gradient Estimator collects hardware measurements and uses them to estimate performance gradients. Then, to maximize efficiency, Local Controllers and Hierarchical Controllers re-distribute the power from low-gradient workloads to high-gradient ones within a node and across nodes, respectively. PowerGrad is especially effective for severely power-limited environments, where every node demands more power than its maximum allocation.

While PowerGrad can be applied to a variety of compute architectures, it needs dynamic hardware performance counter information that is unavailable in GPUs and accelerators. Consequently, we demonstrate PowerGrad on two CPU clusters running popular ML inference workloads in power-limited setups. The results show that PowerGrad is both effective and portable across different architectures. In traditional dual-CPU nodes, PowerGrad reduces the average and tail latencies by a mean of 22.9% and 23.0%, respectively, relative to a state-of-the-art baseline. In single-CPU nodes with ML acceleration support, PowerGrad reduces the average and tail latencies by a mean of 9.0% and 9.9%, respectively.

Conference paper