Token-level Direct Preference Optimization
[ICLM'24]
Finetuning pretrained LLMs is essential to align them with human values and intentions. The overall process is often done with pairwise comparisons and KL divergence against a reference LLM, focusing on the evaluation of full answers generated by the models. However, in contrast to the process of evaluation, the generation of answer occurs in a token level, following a sequential and auto-regressive fashion.
To more align the evaluation with the generation, this work propose Token-level Direct Preference (TDPO) to align LLMs with human preferences by optimizing policy at the token level. TDPO incorporates forward KL divergence constraints for each token, improving alignment and diversity. Utilizing the Bradley-Terry model for a token-based reward system, TDPO enhances the regulation of KL divergence, while preserving simplicity without the need for explicit reward modeling.
Preliminaries
RLHF
The RLHF (Reinforcement Learning from Human Feedback) pipeline usually includes 3 phrases: supervised finetuning (SFT), preference sampling & reward learning and RL optimization.
SFT: RLHF typically begins by fine-tuning a pre-trained LM with supervised learning on high-quality data for the downstream task(s) of interest (dialogue, summarization, etc.), to obtain a model \(\pi^{SFT}\).
Reward Modeling Phase: In the second phase, the SFT model is prompted with prompt \(x\) to produce pairs of answers \((y_1,y_2)\sim \pi^{SFT}(y\|x)\). The pair is then presented to human labelers to decide the preferred (\(y_w\)) and dispreferred answers (\(y_l\)) in it, denoted as \(y_w\succ y_l\|x\). The preferences are assumed to be generated by some latent reward model \(r^*(y,x)\), which we do not have access to. Generally, we have a number of approaches to model preference, among which the Bradley-Terry modeling is a popular choice. The BT model stipulates that the human preference distribution \(p^*\) can be written as:
\[P^*_{BT}(y_1\succ y_2|x)=\frac{\exp(r^*(x,y_1))}{\exp(r^*(x,y_1))+\exp(r^*(x,y_2))}.\]Assuming access to a static dataset of comparison \(\mathcal{D}=\big\{x^i,y_w^i,y_l^i\big\}_{i=1}^N\) sampled from \(p*\), we can parameterize a reward ward \(r_\phi(x,y)\) and estimate the parameters via maximum likelihood. Framing the problem as a binary classification, we have the NLL loss:
\[L_R(r_\phi,\mathcal{D})=-\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}[\log\sigma(r_\phi(x,y_w)-r_\phi(x,y_l))].\]In the context of LMs, the network \(r_\phi\) is often initialized from the SFT model with the addition of a linear layer on the top of the final transformer layer, producing a single scalar prediction for the reward value. To ensure a reward function with lower variance, prior works normalized the rewards, such that \(\mathbb{E}_{x,y\sim\mathcal{D}}[r_\phi(x,y)]=0\) for all \(x\).
RL Finetuning Phase
During this phase, we use the learned reward function to provide feedback to the language model. In particular, the optimization problem is as follows:
\[\max_{\pi_\theta}\mathbb{E}_{x\sim\mathcal{D},y\sim\pi_\theta(y|x)}\big[r_\phi(x,y)\big]-\beta\cdot KL(\pi_\theta\parallel \pi_{\rm ref}),\]where \(\pi_{\rm ref}\) is namely the SFT model. In practice, the language model policy \(\pi_\theta\) is also initialized to the SFT model. The added constraint is important, as it prevents the model from deviating too far from the distribution on which the reward model is accurate, as well as maintaining the generation diversity and preventing mode-collapse to single high-reward answers.
Due to the discrete nature of NLG, this objective is not differentiable and is typically optimized with RL, like constructing the reward function \(r(x,y)=r_\phi(x,y)-\beta\cdot(\log\pi_\theta(y\|x)-\log\pi_{\rm ref}(y\|x))\) and maximizing with PPO.
Direct Preference Optimization (DPO)
The key insight of DPO is to leverage an analytical mapping from reward functions to optimal policies, which enables us to transform a loss function over reward functions into a loss function over policies. This change-of-variables approach avoids fitting an explicit, standalone reward model, while still optimizing under existing models of human preferences, such as the Bradley-Terry model. In essence, the policy network represents both the language model and the (implicit) reward.
Following the RL objective above, we can show that the optimal solution to it takes the form:
\[\pi_r(y\|x)=\frac{1}{Z(x)}\cdot\pi_{\rm ref}(y\|x)\cdot\exp(\frac{1}{\beta}\cdot r(x,y)),\]where \(Z(x)=\sum_y \pi_{\rm ref}(y\|x)\cdot\exp(\frac{1}{\beta}\cdot r(x,y))\) is the partition function.
We could transform the equation to obtain the reward function \(r\). Actually, DPO established a mapping between the reward model and the optimal policy under the reverse KL divergence:
\[r(x,y)=\beta\cdot\log\frac{\pi_r(y|x)}{\pi_{\rm ref}(y|x)}+\beta\cdot\log Z(x).\]By substituting the reward into the Bradley-Terry equation, we can express the human preference probability in terms of only the optimal policy \(\pi^*\) and reference policy \(\pi_{\rm ref}\):
\[p^*(y_1\succ y_2\|x)=\sigma(r^*(x,y_1)-r^*(x,y_2))=\frac{1}{1+\exp(\beta\cdot\log\frac{\pi^*(y_2\|x)}{\pi_{\rm ref}(y_2\|x)}-\beta\cdot\log\frac{\pi^*(y_1\|x)}{\pi_{\rm ref}(y_1\|x)})}\]Hence, we can obtain the objective by NLL loss on this:
\[u(x,y_w,y_l)=\beta\cdot\log\frac{\pi_\theta(y_w|x)}{\pi_{\rm ref}(y_w|x)}-\beta\cdot\log\frac{\pi_\theta(y_l|x)}{\pi_{\rm ref}(y_l|x)},\] \[L_{DPO}=-\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}[\log\sigma(u(x,y_w,y_l))]\]where \(\mathcal{D}\) represents the human preference dataset, \(\pi_{\rm ref}(\cdot\|x)\) serves as a reference model (typically chosen the LM after SFT), \(\pi_\theta(\cdot\|x)\) represents the model undergoing RL finetuning, initialized with \(\pi_\theta=\pi_{\rm ref}\).
This objective could also be represented by the reverse KL divergence:
\[\max_{\pi_\theta}\mathbb{E}_{x\sim\mathcal{D},y\sim\pi_\theta(\cdot|x)}\big[r(x,y)-\beta\cdot \rm{KL}(\pi_\theta(\cdot|x)\parallel\pi_{\rm ref}(\cdot|x))\big],\]Methodology
Let’s denote the response consists of \(T\) tokens \(y\) as \(y^{<T+1}:=[y_1,\cdots,y_T]\). When modeling text generation as a Markov decision process a state is a combination of the prompt and the generated response up to the current step, denoted as \(s_t:=[x,y^{<t}]\). An action corresponds to the next generated token, denoted as \(a_t:=y^t\). The token-wise reward is defined as \(R_t:=R(s_t,a_t)\).
Expanding on the provided definitions, we establish the state-action function \(Q_\pi\), the state value function \(V_\pi\) and the advantage function \(A_\pi\) for a policy \(\pi\):
\[Q_\pi(s_t,a_t)=\mathbb{E}_\pi\big[\sum_{k=0}^{\infty}{\gamma^kR_{t+k}|s_t,a_t}\big],\] \[V_\pi(s_t)=\mathbb{E}_\pi\big[Q_\pi(s_t,a_t)|s_t\big],\] \[A_\pi(s_t,a_t)=Q_\pi(s_t,a_t)-V_\pi(s_t).\]In contrast to DPO’s sentence-level objective, this work proposes a token-level objective:
\[\max_{\pi_\theta}\mathbb{E}\big[A_{\pi_{\rm ref}}(s_t,z)-\beta\cdot{KL}(\pi_\theta(\cdot\|s_t)\parallel\pi_{\rm ref}(\cdot\|s_t))\big].\]This constrained problem has closed-form solution:
\[\pi^*_\theta(z|s_t)=\frac{\pi_{\rm ref}(z|s_t)\cdot\exp(\frac{1}{\beta}\cdot Q_{\pi_{\rm ref}}(s_t,z))}{Z(s_t;\beta)},\] \[Z(s_t;\beta)=\mathbb{E}_{z\sim\pi_{\rm ref}(\cdot\|s_t)}[\exp(\frac{1}{\beta}\cdot Q_{\pi_{\rm ref}}(s_t,z))],\]where \(Z\) is the partition function.
To facilitate subsequent derivations, this work first introduce the sequential KL divergence:
\[{\rm SeqKL}(x,y;\pi_1\parallel\pi_2)=\sum_{t=1}^T KL(\pi_1(\cdot;s_t)\parallel\pi_2(\cdot;s_t)).\]In the KL-constrained advantage function maximization problem, he Bradley-Terry model express the human preference probability in terms of the optimal policy \(\pi^*_\theta\) and reference policy \(\pi_{\rm ref}\):
\[P^*_{\rm{BT}}(y_1\succ y_2|x)=\sigma(u^*(x,y_1,y_2)-\delta^*(x,y_1,y_2)),\]where \(u\) is the difference in rewards implicitly declared above, and \(\delta\) is the difference in sequential forward KL divergence:
\[\delta(x,y_1,y_2)=\beta\cdot{\rm SeqKL}(x,y_2;\pi_{\rm ref}\parallel\pi_\theta)-\beta\cdot{\rm SeqKL}(x,y_1;\pi_{\rm ref}\parallel\pi_\theta).\]Consequently, the initial version of TDPO can be represented as:
\[L_{\rm{TDPO_1}}(\pi_\theta;\pi_{\rm ref})=-\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}[\log\sigma(u(x,y_w,y_l)-\delta(x,y_w,y_l))].\]For improved performance, we could stop the gradient of the loss to obtain the second version of TDPO. We put the summarization of the loss function below: