View publication

Reinforcement Learning from Human Feedback (RLHF) is an effective approach for aligning language models to human preferences. Central to RLHF is learning a reward function for scoring human preferences. Two main approaches for learning a reward model are 1) training an explicit reward model as in RLHF, and 2) using an implicit reward learned from preference data through methods such as Direct Preference Optimization (DPO). Prior work has shown that the implicit reward model of DPO can approximate a trained reward model, but it is unclear to what extent DPO can generalize to distribution shifts, an issue which can occur due to limited preference data, or changing language from the trained model. We address this question by comparing the accuracy at distinguishing preferred and rejected answers using both DPO and RLHF rewards. Our findings indicate that DPO's implicit reward performs similarly to RLHF rewards on in-distribution data, but severely under-performs RLHF reward models. Across five out-of-domain settings, DPO has a mean drop in accuracy of 3% and a maximum drop of 7%, highlighting the shortcomings of DPO's implicit reward model for preference optimization. These findings highlight that DPO's implicit reward model has limited generalization ability and substantiates the integration of an explicit reward model in iterative DPO approaches.

Related readings and updates.

Direct Preference Optimization (DPO) has been widely adopted for preference alignment of Large Language Models (LLMs) due to its simplicity and effectiveness. However, DPO is derived as a bandit problem in which the whole response is treated as a single arm, ignoring the importance differences between tokens, which may affect optimization efficiency and make it difficult to achieve optimal results. In this work, we propose that the optimal data…
Read more
This paper was accepted at the "Human in the Loop Learning Workshop" at NeurIPS 2022. Specification of reward functions for Reinforcement Learning is a challenging task which is bypassed by the framework of Preference Based Learning methods which instead learn from preference labels on trajectory queries. These methods, however, still suffer from high requirements of preference labels and often would still achieve low reward recovery. We present…
Read more