Large Language Models (LLMs) perform well across diverse tasks, but aligning them with human demonstrations is challenging. Recently, Reinforcement Learning (RL)-free methods like Direct Preference Optimization (DPO) have emerged, offering improved stability and scalability while retaining competitive performance relative to RL-based methods. However, while RL-free methods deliver satisfactory performance, they require significant data to develop a robust Supervised Fine-Tuned (SFT) model and an additional step to fine-tune this model on a preference dataset, which constrains their utility and scalability. In this paper, we introduce Triple Preference Optimization (TPO), a new preference learning method designed to align an LLM with three preferences without requiring a separate SFT step and using considerably less data. Through a combination of practical experiments and theoretical analysis, we show the efficacy of TPO as a single-step alignment strategy. Specifically, we fine-tuned the Phi-2 (2.7B) and Mistral (7B) models using TPO directly on the UltraFeedback dataset, achieving superior results compared to models aligned through other methods such as SFT, DPO, KTO, IPO, CPO, and ORPO. Moreover, the performance of TPO without the SFT component led to notable improvements in the MT-Bench score, with increases of +1.27 and +0.63 over SFT and DPO, respectively. Additionally, TPO showed higher average accuracy, surpassing DPO and SFT by 4.2% and 4.97% on the Open LLM Leaderboard benchmarks.
(a) During the SFT step, a pre-trained model is fine-tuned to align with human expectations. (b) To further enhance the performance of the SFT model, we train it with human preferences using reinforcement learning. (c) Alternatively, we can directly align an SFT model with human preferences using RL-free methods such as DPO. (d) In TPO, we merge preference optimization with gold standard response learning, enabling direct fine-tuning of a pre-trained model based on three preferences.
We assessed the performance of TPO alongside SFT, DPO, KTO, IPO, CPO, and ORPO across ten different benchmarks in three distinct scenarios: 1) Aligning an SFT model fine-tuned on 10K data, 2) Aligning an SFT model fine-tuned on 200K data, and 3) Directly fine-tuning a pre-trained model.
Comparing TPO's performance with other alignment methods reveals that the Mistral+TPO model exhibits comparable performance across different benchmarks and, on average, outperforms other methods. In particular, Mistral+TPO performed remarkably on the TruthfulQA benchmark. It's worth noting that the Mistral+TPO model is directly trained with TPO, which contributes to its superior performance. Additionally, for all benchmarks, accuracy is the metric used to gauge performance.
In our comparison of TPO with other alignment methods across more benchmarks, Mistral+SFT+TPO and Mistral+TPO emerge as the top performer, surpassing other methods in MT-Bench and BB-causal, BB-sports, OpenBookQA. For BB-causal, BB-sports, BB-formal, and OpenBookQA, performance is evaluated based on accuracy, while MT-Bench uses a scoring system generated by GPT-4.
The comparison of Phi-2's performance when aligned with various methods on MT-Bench shows that Phi-2+TPO surpasses other alignment techniques.
Comparison of the performance of various alignment methods on different SFT models using the MT-Bench. Notably, the score for Mistral+SFT trained on 10K data is 4.2, while the score for Mistral+SFT trained on 200K data is 5.94 and for Mistral+TPO on 10K is 6.66.
Comparison of the performance of various alignment methods on MT-Bench.
@misc{saeidi2024triple,
title={Triple Preference Optimization: Achieving Better Alignment with Less Data in a Single Step Optimization},
author={Amir Saeidi and Shivanshu Verma and Aswin RRV and Chitta Baral},
year={2024},
eprint={2405.16681},
archivePrefix={arXiv},
primaryClass={cs.CL}
}