Skip to yearly menu bar Skip to main content


Poster

Energy-Based Preference Model Offers Better Offline Alignment than the Bradley-Terry Preference Model

Yuzhong Hong · Hanshan Zhang · Junwei Bao · Hongfei Jiang · yang song

[ ]
Wed 16 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

Since the debut of DPO, it has been shown that aligning a target LLM with human preferences via the KL-constrained RLHF loss is mathematically equivalent to a special kind of reward modeling task. Concretely, the task requires: 1) using the target LLM to parameterize the reward model, and 2) tuning the reward model so that it has a 1:1 linear relationship with the true reward. However, we identify a significant issue: the DPO loss might have multiple minimizers, of which only one satisfies the required linearity condition. The problem arises from a well-known issue of the underlying Bradley-Terry preference model: it does not always have a unique maximum likelihood estimator (MLE). Consequently, the minimizer of the RLHF loss might be unattainable because it is merely one among many minimizers of the DPO loss. As a better alternative, we propose an energy-based preference model (EBM) that always has a unique MLE, inherently satisfying the linearity requirement. To showcase the practical utility of replacing BTM with our EBM in the context of offline alignment, we adapt a simple yet scalable objective function from the recent literature on fitting EBM and name it as Energy Preference Alignment (EPA). Empirically, we demonstrate that EPA consistently delivers better performance on open benchmarks compared to DPO, thereby validating the theoretical superiority of our EBM.

Live content is unavailable. Log in and register to view live content