Talk

Benchmarking GenAI Like a Pro: Scaling Experiments, Predicting Performance and Keeping Your Sanity

Abstract

Generative AI is moving fast—and if you're responsible for deploying or tuning these models, you're probably feeling the heat. New LLMs, hardware, and training methods are landing constantly. How do you make sense of it all? How do you actually know what’s performant, what’s cost-effective, and what breaks the moment your stack changes?

In this talk, we’ll show you how we went from scattered, ad-hoc experiments to a fully structured, scalable benchmarking system capable of running tens of thousands of GenAI experiments—across models, hardware, and tuning techniques—with speed and repeatability.

We’ll break down how we built the stack: Ray for scale, Pydantic for schema rigor, MySQL to persist the chaos, and a CLI that feels like kubectl. We’ll show how we explore and optimize massive configuration spaces, visualize the results with Apache Superset, and use predictive models to skip the brute-force grind and get insights faster.

If you have to answer questions such as, “Can we serve this model without melting the budget?” or “Why did fine-tuning just fall over on the H100s again?”—this talk is for you.