Hazar Yueksel, Ramon Bertran, et al.
MLSys 2020
Fine-tuning on task-specific data to boost downstream performance is a crucial step for leveraging Large Language Models (LLMs). However, previous studies have demonstrated that fine-tuning the models on several adversarial samples or even benign data can greatly comprise the model's pre-equipped alignment and safety capabilities. In this work, we propose SEAL, a novel framework to enhance safety in LLM fine-tuning. SEAL learns a data ranker based on the bilevel optimization to up rank the safe and high-quality fine-tuning data and down rank the unsafe or low-quality ones. Models trained with SEAL demonstrate superior quality over multiple baselines, with 8.5% and 9.7% win rate increase compared to random selection respectively on LLAMA-3-8B-INSTRUCT and MERLINITE-7B models. Our code is available on github https://github.com/hanshen95/SEAL.
Hazar Yueksel, Ramon Bertran, et al.
MLSys 2020
Saiteja Utpala, Alex Gu, et al.
NAACL 2024
Natalia Martinez Gil, Dhaval Patel, et al.
UAI 2024
Anming Gu, Edward Chien, et al.
ICLR 2025