Keynote

Designing RoCE Networks for AI Workloads

Abstract

The rapid explosion of Artificial Intelligence (AI) workloads utilizing a growing number of accelerators has placed unprecedented demand on the network. These workloads typically leverage RDMA and require a high-performance network fabric. While many purpose-built/proprietary networking solutions, e.g., Infiniband networks, provide high performance but they expensive to build and operate and does not support application diversity. In this talk, we will first discuss the challenges of designing a high-performance network for private data centers at scales supporting a wide range of workloads over RoCE and TCP using off-the-shelf hardware and software. We will then discuss the network control plane and routing protocols that accommodate flexible reliability and performance requirements. We deployed our solution to a number of GPU clusters spanning multiple racks of servers with Nvidia's H100 8 GPUs and 8 Mellanox 400G dual-port CX7 NICs per server using a 2-stage Clos topology. Using network micro-benchmarks and NCCL benchmarcks that mimic the demands of LLM model training, our results show that the network sustains near-line-rate throughput under traffic patterns where up to 100% of the traffic traverses the spine layer. In production deployments, running LLM training jobs for months, we observe that our RoCE network achieve similar performance compared to Infiniband networks.