Position: Agentic Systems Should be GeneralElron BandelAsaf Yehudaiet al.2026ICML 2026Conference paper
Stop Guessing When to Stop Testing: Efficient Model Evaluation with Just Enough DataOfir ArvivKristjan Greenewaldet al.2026ACL 2026Conference paper
Mediocricity is the key for LLM as a Judge Anchor SelectionShachar Don-YehiyaAsaf Yehudaiet al.2026ACL 2026Conference paper
Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark EvaluationYotam PerlitzAriel Geraet al.2025NeurIPS 2025Workshop paper
Elements of World Knowledge (EWoK): A Cognition-Inspired Framework for Evaluating Basic World Knowledge in Language ModelsAnna A. IvanovaAalok Satheet al.2025Transactions of the Association for Computational LinguisticsPaper
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual EvaluationShivalika SinghAngelika Romanouet al.2025ACL 2025Conference paper
DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM EvaluationEliya HabbaOfir Arvivet al.2025ACL 2025Conference paper
The ShareLM Collection and Plugin: Contributing Human-Model Chats for the Benefit of the CommunityShachar Don-YehiyaLeshem Choshenet al.2025ACL 2025Demo paper
A Hitchhiker's Guide to Scaling Law EstimationLeshem ChoshenYang Zhanget al.2025ICML 2025Conference paper
Compress then Serve: Serving Thousands of LoRA Adapters with Little OverheadRickard GabrielssonJiacheng Zhuet al.2025ICML 2025Conference paper