Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark EvaluationYotam PerlitzAriel Geraet al.2025NeurIPS 2025Workshop paper
DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM EvaluationEliya HabbaOfir Arvivet al.2025ACL 2025Conference paper
Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AIElron BandelYotam Perlitzet al.2024NAACL 2024Demo paper
FastFit: Fast and Effective Few-Shot Text Classification with a Multitude of ClassesAsaf YehudaiElron Bandel2024NAACL 2024Demo paper