Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT

Best AI papers explained - A podcast by Enoch H. Kang

Categories:

This academic paper proposes an innovative approach to fine-tune Large Language Models (LLMs) using demonstration data, which typically only provides examples of desired outputs. Unlike standard supervised fine-tuning (SFT) methods that directly mimic demonstrations, this work argues that reward learning from this data can significantly enhance LLM alignment with human preferences. The authors introduce two novel algorithms, Reward-learning Fine-tune (RFT) and Implicit Reward-learning Fine-tune (IRFT), based on an Inverse Reinforcement Learning (IRL) framework, which jointly learns a reward model and the language model policy. They demonstrate through theoretical analysis and empirical results on different LLM sizes and datasets that these reward-based methods consistently outperform traditional SFT in various evaluation metrics, including the HuggingFace Open LLM Leaderboard. The paper also reveals an interesting connection between their implicit reward learning approach and the recent Self-Play Fine-tune (SPIN) algorithm, providing a theoretical grounding for this type of training.