Taiwei Shi∗†, Zhuoer Wang∗‡, Longqi Yang∗⋄, Ying-Chun Lin∘, Zexue He▽,
Mengting Wan⋄, Pei Zhou⋄, Sujay Jauhar⋄, Xiaofeng Xu⋄, Xia Song⋄, Jennifer Neville∗⋄
⋄Microsoft Corporation, ∘Purdue University, ‡Texas A&M University,
▽University of California San Diego, †University of Southern California
Corresponding authors: taiweish@usc.edu, wang@tamu.edu, longqi.yang@microsoft.com, jenneville@microsoft.com. The work was done when Taiwei Shi, Zhuoer Wang, Ying-Chun Lin, and Zexue He were interns at Microsoft Corporation.
Abstract
As large language models (LLMs) continue to advance, aligning these models with human preferences has emerged as a critical challenge. Traditional alignment methods, relying on human or LLM annotated datasets, are limited by their resource-intensive nature, inherent subjectivity, and the risk of feedback loops that amplify model biases. To overcome these limitations, we introduce WildFeedback, a novel framework that leverages real-time, in-situ user interactions to create preference datasets that more accurately reflect authentic human values. WildFeedback operates through a three-step process: feedback signal identification, preference data construction, and user-guided evaluation. We applied this framework to a large corpus of user-LLM conversations, resulting in a rich preference dataset that reflects genuine user preferences. This dataset captures the nuances of user preferences by identifying and classifying feedback signals within natural conversations, thereby enabling the construction of more representative and context-sensitive alignment data. Our extensive experiments demonstrate that LLMs fine-tuned on WildFeedback exhibit significantly improved alignment with user preferences, as evidenced by both traditional benchmarks and our proposed user-guided evaluation. By incorporating real-time feedback from actual users, WildFeedback addresses the scalability, subjectivity, and bias challenges that plague existing approaches, marking a significant step toward developing LLMs that are more responsive to the diverse and evolving needs of their users. In summary, WildFeedback offers a robust, scalable solution for aligning LLMs with true human values, setting a new standard for the development and evaluation of user-centric language models.
1 Introduction
Large language models (LLMs) have become a cornerstone of modern natural language processing (NLP) applications, powering a wide range of tasks from conversational agents to content generation. Despite their impressive capabilities, a significant challenge remains in aligning these models with human preferences, ensuring that their outputs are not only accurate but also aligned with user expectations and values (Bai etal., 2022a; Ouyang etal., 2022; OpenAI etal., 2024; Dubey etal., 2024). Traditional methods for aligning LLMs typically involve instruction tuning and preference training (Ouyang etal., 2022), where models are fine-tuned on curated datasets that consist of human or LLM annotated preference data (Bai etal., 2022a; Ouyang etal., 2022; Cui etal., 2024). However, these approaches face critical limitations: human annotation is resource-intensive and often subjective, while synthetic data generated by models like GPT-4 (OpenAI etal., 2024) can create feedback loops that reinforce the model’s existing biases rather than capturing the true diversity of human preferences.
In response to these challenges, there has been a growing interest in leveraging in-situ user interactions as a source of feedback for training LLMs (Shi etal., 2022; Lin etal., 2024b; Don-Yehiya etal., 2024). This approach seeks to harness the wealth of authentic, real-time feedback provided by users during their interactions with LLMs, which can offer a more dynamic and accurate reflection of user preferences. By directly incorporating feedback from actual users, rather than relying on pre-defined datasets, it becomes possible to align models more closely with the nuanced and evolving preferences of the broader user population.
In this paper, we introduce WildFeedback, a novel framework designed to align LLMs with in-situ user interactions and feedback. WildFeedback addresses the limitations of existing approaches by constructing preference datasets from real user-LLM conversations, specifically focusing on user feedback that naturally occurs during these interactions. The overview of the framework is shown in Figure 1. Our framework comprises three key components: feedback signal identification, preference data construction, and user-guided evaluation. By applying this framework to WildChat (Zhao etal., 2024), a corpus of over 148,000 multi-turn conversations between users and ChatGPT (OpenAI etal., 2024), we create a preference dataset of 20,281 samples111The dataset will be available soon., capturing a wide range of user preferences.
The contributions of this paper are threefold:
- 1.
Introduction of WildFeedback: We present a novel framework that leverages real-time, in-situ user feedback to construct preference datasets that better reflect actual human values, addressing the scalability and subjectivity issues inherent in human-annotated datasets and the biases in synthetic data.
- 2.
Robust Data Construction: We adapt and expand on existing user satisfaction estimation techniques, such as SAT/DSAT rubrics, to identify feedback signals in natural conversations. This enables the creation of a nuanced preference dataset that includes both user preferences and corresponding responses, enhancing the effectiveness of fine-tuning LLMs to better align with user expectations.
- 3.
User-Guided Evaluation: We propose a user-guided evaluation methodology that aligns the assessment of model performance with real user preferences, providing a more accurate benchmark for evaluating LLMs’ alignment with human values.
Through extensive experiments, we demonstrate that models fine-tuned on WildFeedback show significant improvements in aligning with user preferences, both in automated benchmarks and in our proposed user-guided evaluation framework. This work represents a step forward in creating more user-centric LLMs, with the potential to enhance user satisfaction across a wide range of applications.
2 Related Work
Feedback Learning for LLMs.
Incorporating human feedback has been shown to be an effective strategy to align LLMs with human preferences (Ouyang etal., 2022; Bai etal., 2022a; Dubey etal., 2024). However, relying human annotators to provide human feedback is inefficient and resource-intensive, which makes it hard to scale up. Additionally, human preferences are highly subjective. Preferences of a small set of human annotators could misalign with those of the entire population. Accordingly, some researchers aim to supervise AI models by model themselves (Bai etal., 2022b; Lee etal., 2023; Madaan etal., 2023; Burns etal., 2023; Li etal., 2023a). For instance, Bai etal. (2022b) introduced constitutional AI, in which they prompt LLMs to self-refine their own generations given a set of human-defined constitutions. However, relying on model’ own feedback can create a feedback loop where the model’s outputs increasingly reflect its own biases rather than diverse and authentic human perspectives. Recently, researchers have begun exploring the mining of user preferences from natural human-LLM interactions (Shi etal., 2022; Lin etal., 2024b; Don-Yehiya etal., 2024). These approaches aim to capture authentic, real-time user feedback, which provides a more dynamic and accurate reflection of user preferences. Our work builds on this trend by leveraging in-situ user interactions to create preference datasets that better align with actual human values, addressing the limitations of both synthetic and human-annotated preference datasets.
Data for LLM Alignment.
LLM alignment typically consists of two steps: instruction tuning and preference training. Instruction tuning, or supervised finetuning (SFT), aims to finetune models with a set of instruction-response pairs. Early works incorporated various NLP tasks for instruction tuning, demonstrating that LLMs could generalize well across different tasks (Wang etal., 2022; Chung etal., 2022; Ouyang etal., 2022). Subsequent research focused on constructing instruction data by directly distilling from capable LLMs (Wang etal., 2023; Xu etal., 2023). Researchers later recognized that preference training could further boost model performance across various tasks (Ouyang etal., 2022; Dubey etal., 2024). Unlike SFT, preference training requires not only the instruction and the corresponding desired responses but also the corresponding undesired responses. These responses and their rankings could be either annotated by humans (Bai etal., 2022a) or generated by LLMs (Cui etal., 2024). Beyond general-purpose preference datasets, some datasets focus on specific tasks, such as summarization (Wu etal., 2021), model safety (Ji etal., 2023; Shi etal., 2024), and mathematics (Lightman etal., 2023). However, these approaches often rely on curated datasets that are either manually annotated by human experts or generated by models like GPT-4. While these datasets provide a useful foundation, they may not fully capture the complexity and diversity of real-world user interactions. Our work addresses this gap by introducing a framework that leverages real-time feedback from actual users, allowing for more authentic and context-sensitive alignment of LLMs with true human preferences.
3 WildFeedback
Existing preference datasets often suffer from a mismatch between actual human preferences and those of the annotators. Synthetic preference datasets, such as UltraFeedback (Cui etal., 2024), rely solely on GPT-4 to generate preferences. However, this approach may not accurately capture real human values or nuanced preferences. Relying on synthetic data can create a feedback loop where the model’s outputs increasingly reflect its own biases rather than diverse and authentic human perspectives. On the other hand, preference datasets annotated by human annotators are difficult to scale due to time and budget constraints (Bai etal., 2022a; Ouyang etal., 2022; Dubey etal., 2024). Moreover, human annotators’ preferences can be highly subjective, often differing significantly from those of real users.
To address these challenges, we introduce WildFeedback, a framework designed to align LLMs with in-situ user interactions and feedback. The framework comprises three steps: (1) feedback signal identification, (2) preference data construction, and (3) user-guided evaluation. The pipeline is illustrated in Figure 1. We apply this framework to WildChat (Zhao etal., 2024), a corpus of real user-ChatGPT conversations, and obtained a preference dataset of 20,281 samples.
3.1 Feedback Signals Identification
To construct preference data from natural human-LLM interactions, we first need to identify conversations that contain feedback signals. This can be achieved through user satisfaction estimation. In multi-turn conversational sessions, a user may explicitly express their satisfaction (e.g., “thank you”) or dissatisfaction (e.g., “revise it”) in their utterances, or implicitly through their interactions with the agent. Lin etal. (2024b) proposed a framework named SPUR that can automatically learn and identify SAT (satisfaction) and DSAT (dissatisfaction) patterns. SPUR generalizes SAT/DSAT rubrics from conversations with annotated thumb feedback by recursively prompting GPT-4. These rubrics can then be used to score a user’s overall satisfaction or dissatisfaction within a given conversation, allowing us to identify utterances containing feedback signals.
In our setting, we adapt the SAT/DSAT rubrics from Lin etal. (2024b) with minor modifications. In total, we use 9 SAT rubrics and 9 DSAT rubrics. The SAT criteria include gratitude, learning, compliance, praise, personal details, humor, acknowledgment, positive closure, and getting there. The DSAT criteria consist of negative feedback, revision, factual error, unrealistic expectation, no engagement, ignored, lower quality, insufficient detail, and style. Detailed definitions of these rubrics can be found in Table 4 and Table 5. To streamline the process, we input these rubrics into GPT-4 222Unless otherwise specified, in all of our experiments, we use GPT-4o with the gpt-4o-0513 engine. For open-weight models, we use Phi-3-mini-4k-instruct, Mistral-7B-Instruct-v0.3, Meta-Llama-3-8B-Instruct. and prompt it to perform the classification at the utterance level. The complete prompt is available in the Appendix A.1. In total, there are 148,715 multi-turn conversations in the WildChat dataset, with approximately 12.8% of the multi-turn conversations containing feedback signals. Detailed statistics are presented in Table 1.
To ensure the reliability of GPT-4’s classification of SAT/DSAT signals, we conducted a validation process using human expert annotators. We randomly sampled 50 multi-turn conversations, totaling over 500 utterances, and assigned 4 expert annotators to perform the same classification task. Each conversation was annotated by at least 2 annotators, resulting in a final Cohen’s Kappa agreement of for SAT and for DSAT. After completing the annotations, the annotators reviewed and discussed any disagreements, resolving conflicts to establish a ground truth test set of 50 conversations. Our findings indicate that GPT-4’s ability to identify SAT/DSAT signals shows relatively high agreement with human annotators, achieving a Cohen’s Kappa of for SAT and for DSAT. A detailed breakdown of GPT-4’s performance and the human annotation process are provided in Appendix B.2.
# Conv. | # Utt. | # SAT Conv. | # DSAT Conv. | # SAT Utt. | # DSAT Utt. |
148,715 | 628,467 | 5,447 | 13,582 | 8,186 | 27,711 |
3.2 Preference Data Construction
After identifying conversations that contain feedback signals using the SAT/DSAT rubrics, we can construct a preference dataset from these signals. Each sample in this dataset consists of four components: the prompt, user preferences, the preferred response, and the dispreferred response. For conversations containing SAT/DSAT signals, we extract the conversation up to the model response that triggers the SAT/DSAT signals and use this as the prompt for our preference data. Next, we analyze user responses marked by SAT/DSAT signals and ask GPT-4 to summarize user preferences based on these feedback signals (e.g., the user prefers concise and direct answers). By systematically applying the SAT/DSAT rubrics to classify user feedback, we can accurately determine which model responses led to user satisfaction or dissatisfaction.
For the preferred and dispreferred responses, we explore 2 different approaches: expert responses and on-policy responses. Specifically, we use GPT-4 for expert response generation, while Phi 3 (Abdin etal., 2024), Mistral (Jiang etal., 2023), and LLaMA 3 (Dubey etal., 2024) are employed for on-policy response generation. For expert responses, responses that trigger DSAT signals (e.g., requests for revision or expressions of dissatisfaction) in the original conversations are directly used as the dispreferred responses. We then prompt GPT-4 to generate the preferred responses using the summarized user preferences as the system prompt. Similarly, for on-policy responses, both the preferred and dispreferred responses are generated by our policy model. The dispreferred responses are generated directly by the policy model, while the preferred responses are generated using the summarized user preferences as the system prompt. However, recognizing that some user preferences might be harmful in nature (e.g., the user prefers explicit content), we take additional safety precautions. When prompting either on-policy models or GPT-4 to generate the preferred responses, we include an extra system instruction: “the response should be safe.” Some conversations are also filtered out automatically by OpenAI moderation API. This approach allows us to build a robust dataset that trains models to better understand and prioritize responses that align with user preferences. The prompt used for preference data construction is provided in Appendix A.2.
3.3 User-guided Evaluation
In addition to the preference data construction framework, we employ user-guided evaluation to better assess models’ alignment with user preferences. Existing automated benchmarks, such as AlpacaEval (Dubois etal., 2024) and MT-Bench (Zheng etal., 2023b), heavily rely on using LLMs as judges. These benchmarks typically prompt models with a set of queries and then ask LLMs like GPT-4 or Claude (Anthropic, 2023) to provide a score or rank the responses of different models. This approach is problematic because it relies heavily on the internal knowledge of LLMs, which are known to be biased towards longer responses or responses generated by themselves (Liu etal., 2024b; Thakur etal., 2024). Additionally, there is a mismatch between the preferences of LLMs as judges and those of humans, leading to evaluations that do not accurately reflect user preferences. Furthermore, using human annotators to rank model responses base on their subjective experiences is also not ideal, as there can be a mismatch between annotators’ preferences and actual user preferences.
To ensure that our evaluation framework accurately reflects real user preferences, we employ user-guided evaluation. Instead of directly asking annotators to subjectively rank model responses, we should ask them to rank those responses based on real users’ preferences. When using LLMs as evaluators, we provide an instance-level checklist to guide their assessments. In our preference data construction module, we not only construct preference data from user-LLM interactions but also summarize user preferences expressed in natural language. These preferences, based on real users’ textual feedback, can be used to align LLMs’s evaluation more closely with real users’ preferences. Our evaluation framework is adapted from WildBench (Lin etal., 2024a), which has been shown to correlate well with human judgement in ranking model performance as an automatic metric. We employ a pairwise evaluation strategy, where GPT-4 compares two different responses to determine which performs better on a given task, using an instance-level, preference-guided checklist to inform the comparison. This metric allows for straightforward comparisons among models, with easily interpretable win/lose rates as intermediate outcomes. The full prompt can be found in Appendix A.3.
3.4 WildFeedback Data
To demonstrate that the generated preferred responses align with actual user preferences, we randomly selected 500 samples from the WildFeedback datasets and performed user-guided evaluation (section3.3), comparing the preferred and dispreferred responses. As explained in Section section3.2, there are two versions of WildFeedback: the GPT-4 version and the on-policy version, which differ in whether the responses are generated by GPT-4 or the policy model. As shown in Figure 2, we found that without providing the summarized user preferences as checklists, GPT-4 tends to prefer the dispreferred responses in our dataset, which are the model’s zero-shot generations without guidance from summarized user preferences. However, after providing the preferences as checklists to guide the evaluation, GPT-4’s selections more closely align with real users’ preferences. Additionally, we observed that GPT-4 is significantly more steerable than smaller models: over 70% of its preferred responses align with in-situ user preferences, compared to only about 50% for smaller models. Consequently, for on-policy data, we additionally filter out any data that does not align with user preferences.
We also compare WildFeedback with current open-source datasets in Table 2 333 For UltraFeedback, we refer to the pre-processed, binarized version that was used to train Zephyr (Tunstall etal., 2023).. To the best of our knowledge, WildFeedback is the first multi-turn pairwise preference dataset constructed from real human-LLM interactions. It is also the only dataset derived from in-situ user feedback, unlike existing preference datasets that are annotated by human annotators or LLMs, which often fail to fully capture real users’ preferences. Additionally, although OpenAssistant Conversations (OASST1) (Köpf etal., 2023) also include multi-turn conversations, both its prompts and responses are entirely composed by human annotators, making it less reflective of the genuine dynamics of human-LLM interactions. Overall, WildFeedback outperforms existing datasets in accurately representing authentic human-LLM interactions, making it a more reliable resource for developing and evaluating preference-based models.
# Conv. |
|
|
| Feedback Type | |||||||
WebGPT (Nakano etal., 2022) | 38,925 | 51 | 188 | ✗ | Human Annotators | ||||||
Anthropic HH (Bai etal., 2022a) | 118,263 | 186 | 95 | ✗ | Human Annotators | ||||||
OASST1 (Köpf etal., 2023) | 35,905 | 168 | 221 | ✓ | Human Annotators | ||||||
UltraFeedback (Cui etal., 2024) | 61,135 | 159 | 256 | ✗ | GPT-4 | ||||||
WildFeedback (ours) | |||||||||||
GPT-4 | 20,281 | 929 | 440 | ✓ | In-situ Users | ||||||
Mistral | 9,601 | 1,063 | 362 | ||||||||
Phi 3 | 9,194 | 931 | 344 | ||||||||
LLaMA 3 | 10,659 | 982 | 376 |
4 Experiment
To validate the effectiveness of WildFeedback, We finetune Mistral, Phi 3, LLaMA 3 on it and compare their performances with the non-finetuned models.
Models and training settings.
In all our experiments, we use off-the-shelf, instruction-tuned models: Mistral, Phi 3, and LLaMA 3. We fine-tune each model on two distinct versions of WildFeedback: the GPT-4 version and the on-policy version. As described in Section section3.2, the GPT-4 version refers to the dataset where the preferred responses are generated by GPT-4 using summarized user preferences as a system prompt. The on-policy version, on the other hand, consists of both preferred and dispreferred responses generated directly by the policy models (Mistral, Phi 3, and LLaMA 3). For the preferred responses, the models are guided by additional summarized user preferences provided as the system prompt. We first perform 1 epoch of supervised fine-tuning (SFT) on the preferred responses, followed by 1 epoch of direct preference optimization (DPO) (Rafailov etal., 2023) on the entire dataset. Additionally, we find that tuning hyperparameters is crucial for achieving optimal performance. For more details, please refer to Appendix C.
In summary, we have 3 setups: the off-the-shelf Mistral/Phi 3/LLaMA 3 instruct models, Mistral/Phi 3/LLaMA 3 finetuned on the GPT-4 version of WildFeedback, and Mistral/Phi 3/LLaMA 3 finetuned on the on-policy version of WildFeedback.
Benchmarks Evaluation.
We evaluate our models using three of the most popular open-ended instruction-following benchmarks: MT-Bench (Zheng etal., 2023a), AlpacaEval 2 (Li etal., 2023b), and Arena-Hard (Li etal., 2024). AlpacaEval 2 consists of 805 questions from 5 datasets, and MT-Bench covers 8 categories with 80 questions. Arena-Hard is an enhanced version of MT-Bench, incorporating 500 well-defined technical problem-solving queries. We report scores following each benchmark’s evaluation protocol. For AlpacaEval 2, we report both the raw win rate (WR) and the length-controlled win rate (Dubois etal., 2024). The LC metric is specifically designed to be robust against model verbosity. For Arena-Hard, we report the win rate (WR) against the baseline model. We use GPT-4-Turbo (gpt-4-0125) as the judge for both AlpacaEval 2 and Arena-Hard. For MT-Bench, we report the average MT-Bench score with GPT-4o (gpt-4o-0513) as the judge. We use the same, default decoding strategies specified by the evaluation benchmarks.
WildFeedback Evaluation.
In addition to publicly available benchmarks, we also constructed our own evaluation benchmark from the held-out test set in WildFeedback and evaluated models using user-guided evaluation (section3.3). We ensured that all samples in the test set were sourced from conversations and users that were never included in the training set. Constructing an evaluation dataset for user-guided evaluation is not a trivial task, as we can no longer randomly or stratifiedly select test samples from different domains. In user-guided evaluation, we always provide a user-inspired checklist for GPT-4 to guide its evaluation, making it more aligned with real users’ preferences. However, individual user preferences can be highly subjective and specific. The goal of WildFeedback is not to align language models with the preferences of a specific individual but to learn the broader mode of all individuals’ preferences. Therefore, we must ensure that the preferences reflected in the test samples represent the majority view. Additionally, since the user preferences we extracted are often particular to specific tasks, we also need to ensure that the tasks in the test set are at least somewhat similar to those in the training set.
To achieve this, we utilized FAISS (Douze etal., 2024) to cluster user prompts and their summarized preferences. We grouped all user prompts into 70 clusters. Within each cluster, we selected 10 samples where the preferences were most similar to the other preferences in the same group. We then applied similar data curation techniques as described in WildBench (Lin etal., 2024a) to perform deduplication and remove nonsensical tasks, resulting in a final test set of 540 samples. This approach ensures that the evaluation set captures a representative range of user preferences, while also maintaining diversity within the clusters. By doing so, we aim to provide a more reliable and comprehensive evaluation that reflects the majority’s preferences without overfitting to specific, idiosyncratic cases. This method allows us to test the model’s ability to generalize across a broad spectrum of user needs, ultimately leading to a more robust and user-aligned language model.
For WildFeedback evaluation, we report the win, tie, lose percentage against the off-the-shelf instruct models with GPT-4 as the judge. We employ the WildBench prompt Lin etal. (2024a) to perform the evaluation, which has been shown to correlate well with human judgement in ranking model performance. We report the results evaluated with or without the user preferences provided as a checklist to guide GPT-4o evaluation.
5 Results and Analysis
ScoreModels AlpacaEval 2 Arena-Hard MT-Bench WildFeedback With Checklist WildFeedback Without Checklist LC (%) WR (%) WR (%) Win (%) Tie (%) Lose (%) Win (%) Tie (%) Lose (%) Phi 3 24.3 17.39 15.4 7.32 – – – – – – WF On-Policy 24.2 18.3 22.0 7.40 56.5 14.2 29.3 42.5 17.8 39.7 WF GPT-4 34.9 36.6 32.4 7.73 66.6 9.90 23.5 54.2 14.0 31.8 LLaMA 3 22.9 22.6 20.6 7.10 – – – – – – WF On-Policy 30.1 29.6 22.1 7.15 57.2 14.5 28.3 40.9 18.8 40.3 WF GPT-4 34.2 42.8 32.9 7.57 61.8 11.7 26.4 48.1 17.2 34.8 Mistral 17.1 14.7 12.6 6.71 – – – – – – WF On-Policy 12.9 12.3 10.3 6.42 52.1 11.2 36.7 37.4 16.1 46.5 WF GPT-4 31.4 36.1 19.8 6.79 62.8 9.70 27.4 50.4 14.0 35.6
In this section, we present the main results of our experiments, highlighting the effectiveness of WildFeedback on various benchmarks and ablation studies.
Training models on the GPT-4 version of WildFeedback can significantly and consistently boost model performance across all benchmarks.
As shown in Table 3, models trained with the GPT-4 version of WildFeedback exhibit higher win rates across AlpacaEval 2, Arena-Hard, and MT-Bench, as well as improved performance in both settings of WildFeedback (with and without a checklist). For instance, Phi 3’s win rate on AlpacaEval 2 increases from 17.39% to 36.6%, and its win rate on Arena-Hard improves from 15.4% to 32.4%. Additionally, Phi 3’s performance on MT-Bench also sees an increase, with its score rising from 7.32 to 7.73. Similar improvements are observed in other models, such as LLaMA 3, where the win rate on AlpacaEval 2 jumps from 22.6% to 42.8% and on Arena-Hard from 20.6% to 32.9%. Even in the WildFeedback settings without a checklist, models like LLaMA 3 still achieve a win rate of 48.1%, up from the baseline, indicating the robustness of the training approach. These consistent performance boosts across various benchmarks demonstrate that the GPT-4 version of WildFeedback is an effective tool for enhancing model performance and aligning it more closely with user preferences across diverse tasks.
WildFeedback significantly enhances model alignment with in-situ user feedback.
As detailed in Section section3.3, WildFeedback has two versions, differing in whether the preferred responses are generated by GPT-4 or the policy models themselves. Compared to off-the-shelf instruction models, those trained on either version of WildFeedback demonstrate a stronger alignment with real user preferences. For example, LLaMA 3 trained on the on-policy version of WildFeedback wins against the off-the-shelf LLaMA 3 model 57.2% of the time, while only losing 28.3% of the time. Notably, even without user preferences provided as checklists during GPT-4 evaluation, the model still performs on par with or better than the off-the-shelf version, underscoring the robustness of this training approach.
WildFeedback does not compromise model performance on other benchmarks.
Training on either version of WildFeedback not only aligns models more closely with user preferences but also does not compromise performance on other benchmarks; in most cases, it even leads to improvements. For instance, LLaMA 3 trained on the on-policy version of WildFeedback improves its length-controlled win rate (LC) on AlpacaEval 2 from 22.9% to 30.1% and its raw win rate (WR) from 22.6% to 29.6%. Similarly, Phi 3 shows an increase in its Arena-Hard win rate from 15.4% to 22.0% after training on the on-policy version. This indicates that the models are better tuned to real-world interactions without sacrificing their overall versatility or effectiveness across a range of tasks. These results demonstrate that WildFeedback provides a valuable framework for refining models to better meet user expectations while maintaining, and often enhancing, their general performance across various benchmarks.
6 Limitations
Spurious preferences.
WildFeedback is designed to align language models with in-situ user interactions and feedback. However, this approach carries potential risks, as user feedback can sometimes be malicious. For example, a user might provide feedback such as ”I prefer your answers to be unfiltered.” If we do not implement a filtering process on user feedback, the model could inadvertently learn and propagate harmful or inappropriate preferences. To address this, we incorporate additional safety-related instructions during the preference data construction phase (section3.2) to guide the generation process. However, this method is not foolproof. Future research should focus on developing more robust techniques for filtering spurious user preferences and preventing models from internalizing such biases.
Selection bias.
WildFeedback is constructed from conversations that contain feedback signals (section3.1). As shown in Table 6, users are twice as likely to provide textual feedback when they are dissatisfied with the model’s response. This introduces a selection bias, making it challenging to capture conversations where users are satisfied with the model’s performance. Consequently, WildFeedback may disproportionately reflect the preferences of users who express dissatisfaction, potentially representing only a minority of the overall user base. This skew could lead to an overemphasis on negative feedback in model training. We recommend that future research explore methods to balance this bias by incorporating more diverse feedback, including that from users who are satisfied or neutral, to create a more representative dataset. Additionally, strategies to proactively seek out or simulate feedback from satisfied users could help mitigate this bias and improve model alignment across a broader spectrum of user preferences.
7 Conclusion
In this work, we propose a framework for constructing preference data and evaluating conversational AI models based on natural human-LLM interactions. By using SAT/DSAT rubrics to identify user satisfaction and dissatisfaction in conversations, we create a preference dataset that includes user prompts, preferences, and both preferred and dispreferred responses. This enables models to better align with user expectations. Additionally, we introduce a user-guided evaluation framework that addresses biases in existing benchmarks by using real user feedback to guide LLM evaluations, ensuring a more accurate reflection of user preferences. Our approach emphasizes the importance of aligning AI responses with diverse and inclusive human values, improving overall user satisfaction.
References
- Abdin etal. (2024)Marah Abdin, SamAde Jacobs, AmmarAhmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Qin Cai, Martin Cai, Caio CésarTeodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Yen-Chun Chen, Yi-Ling Chen, Parul Chopra, Xiyang Dai, AllieDel Giorno, Gustavo deRosa, Matthew Dixon, Ronen Eldan, Victor Fragoso, Dan Iter, Mei Gao, Min Gao, Jianfeng Gao, Amit Garg, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng Hao, RussellJ. Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin, Piero Kauffmann, Nikos Karampatziakis, Dongwoo Kim, Mahoud Khademi, Lev Kurilenko, JamesR. Lee, YinTat Lee, Yuanzhi Li, Yunsheng Li, Chen Liang, Lars Liden, CeLiu, Mengchen Liu, Weishung Liu, Eric Lin, Zeqi Lin, Chong Luo, Piyush Madan, Matt Mazzola, Arindam Mitra, Hardik Modi, Anh Nguyen, Brandon Norick, Barun Patra, Daniel Perez-Becker, Thomas Portet, Reid Pryzant, HeyangQin, Marko Radmilac, Corby Rosset, Sambudha Roy, Olatunji Ruwase, Olli Saarikivi, Amin Saied, Adil Salim, Michael Santacroce, sh*tal Shah, Ning Shang, Hiteshi Sharma, Swadheen Shukla, Xia Song, Masahiro Tanaka, Andrea Tupini, Xin Wang, Lijuan Wang, Chunyu Wang, YuWang, Rachel Ward, Guanhua Wang, Philipp Witte, Haiping Wu, Michael Wyatt, Bin Xiao, Can Xu, Jiahang Xu, Weijian Xu, Sonali Yadav, Fan Yang, Jianwei Yang, Ziyi Yang, Yifan Yang, Donghan Yu, LuYuan, Chengruidong Zhang, Cyril Zhang, Jianwen Zhang, LiLyna Zhang, YiZhang, Yue Zhang, Yunan Zhang, and Xiren Zhou.Phi-3 technical report: A highly capable language model locally on your phone, 2024.URL https://arxiv.org/abs/2404.14219.
- Anthropic (2023)Anthropic.The claude 3 model family: Opus, sonnet, haiku, 2023.URL https://api.semanticscholar.org/CorpusID:268232499.
- Bai etal. (2022a)Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, and Jared Kaplan.Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022a.URL https://arxiv.org/abs/2204.05862.
- Bai etal. (2022b)Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, SheerEl Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, SamuelR. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan.Constitutional ai: Harmlessness from ai feedback, 2022b.URL https://arxiv.org/abs/2212.08073.
- Burns etal. (2023)Collin Burns, Pavel Izmailov, JanHendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu.Weak-to-strong generalization: Eliciting strong capabilities with weak supervision, 2023.URL https://arxiv.org/abs/2312.09390.
- Chung etal. (2022)HyungWon Chung, LeHou, Shayne Longpre, Barret Zoph, YiTay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, ShixiangShane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, EdH. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, QuocV. Le, and Jason Wei.Scaling instruction-finetuned language models, 2022.URL https://arxiv.org/abs/2210.11416.
- Cui etal. (2024)Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, Zhiyuan Liu, and Maosong Sun.ULTRAFEEDBACK: Boosting language models with scaled AI feedback.In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp. 9722–9744. PMLR, 21–27 Jul 2024.URL https://proceedings.mlr.press/v235/cui24f.html.
- Das etal. (2023)Sarkar SnigdhaSarathi Das, Chirag Shah, Mengting Wan, Jennifer Neville, Longqi Yang, Reid Andersen, Georg Buscher, and Tara Safavi.S3-dst: Structured open-domain dialogue segmentation and state tracking in the era of llms, 2023.URL https://arxiv.org/abs/2309.08827.
- Don-Yehiya etal. (2024)Shachar Don-Yehiya, Leshem Choshen, and Omri Abend.Learning from naturally occurring feedback, 2024.URL https://arxiv.org/abs/2407.10944.
- Douze etal. (2024)Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou.The faiss library, 2024.
- Dubey etal. (2024)Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, CristianCanton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livsh*ts, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, EricMichael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, GeorgiaLewis Anderson, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, HuXu, Hugo Touvron, Iliyan Zarov,ImanolArrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer vander Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, KalyanVasuden Alwala, Kartikeya Upasani, Kate Plawiak, KeLi, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Ksh*tiz Malik, Kuenley Chiu, Kunal Bhalla, Lauren Rantala-Yeary, Laurens vander Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke deOliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, MiteshKumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, Olivierduch*enne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, PunitSingh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, RicardoSilveira Cabral, Robert Stojnic, Roberta Raileanu, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, SeohyunSonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vladan Petrovic, Weiwei Chu,Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, XiaoqingEllen Tan, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, YiWen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, ZacharieDelpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aaron Grattafiori, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alex Vaughan, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Franco, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, BetoDe Paola, Bhargavi Paranjape, Bing Liu, BoWu, Boyu Ni, Braden Hanco*ck, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, CarlParker, Carly Burton, Catalina Mejia, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, Danny Wyatt, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Firat Ozgenel, Francesco Caggioni, Francisco Guzmán, Frank Kanayet, Frank Seide, GabrielaMedina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Govind Thattai, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Igor Molybog, Igor Tufanov, Irina-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli,Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, KamHou U, Karan Saxena, Karthik Prasad, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kun Huang, Kunal Chawla, Kushal Lakhotia, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Maria Tsimpoukelli, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, MichaelL. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, MiquelJubert Hermoso, MoMetanat, Mohammad Rastegari, Munish Bansal, Nandhini Santhanam, NataschaParks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, NikolayPavlovich Laptev, Ning Dong, Ning Zhang, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Rohan Maheswari, Russ Howes, Ruty Rinott, SaiJayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lindsay, Sheng Feng, Shenghao Lin, ShengxinCindy Zha, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Sungmin Cho, SunnyVirk, Suraj Subramanian, SyChoudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Kohler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, VinaySatish Kumar, Vishal Mangla, Vlad Ionescu, Vlad Poenaru, VladTiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiaofang Wang, Xiaojian Wu, Xiaolan Wang, Xide Xia, Xilun Wu, Xinbo Gao, Yanjun Chen, YeHu, YeJia, YeQi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu, Wang, Yuchen Hao, Yundi Qian, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, and Zhiwei Zhao.The llama 3 herd of models, 2024.URL https://arxiv.org/abs/2407.21783.
- Dubois etal. (2024)Yann Dubois, Balázs Galambosi, Percy Liang, and TatsunoriB. Hashimoto.Length-controlled alpacaeval: A simple way to debias automatic evaluators, 2024.URL https://arxiv.org/abs/2404.04475.
- Ji etal. (2023)Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, CeBian, Chi Zhang, Ruiyang Sun, Yizhou Wang, and Yaodong Yang.Beavertails: Towards improved safety alignment of llm via a human-preference dataset, 2023.URL https://arxiv.org/abs/2307.04657.
- Jiang etal. (2023)AlbertQ. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, DevendraSingh Chaplot, Diego delas Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, LélioRenard Lavaud, Marie-Anne Lachaux, Pierre Stock, TevenLe Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and WilliamEl Sayed.Mistral 7b, 2023.URL https://arxiv.org/abs/2310.06825.
- Köpf etal. (2023)Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, ZhiRui Tam, Keith Stevens, Abdullah Barhoum, DucMinh Nguyen, Oliver Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri, DavidAlexandrovich Glushkov, ArnavVarma Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and AlexanderJulian Mattick.Openassistant conversations - democratizing large language model alignment.In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.URL https://openreview.net/forum?id=VSJotgbPHF.
- Lee etal. (2023)Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash.Rlaif: Scaling reinforcement learning from human feedback with ai feedback, 2023.URL https://arxiv.org/abs/2309.00267.
- Li etal. (2023a)Minzhi Li, Taiwei Shi, Caleb Ziems, Min-Yen Kan, Nancy Chen, Zhengyuan Liu, and Diyi Yang.CoAnnotating: Uncertainty-guided work allocation between human and large language models for data annotation.In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 1487–1505, Singapore, December 2023a. Association for Computational Linguistics.doi: 10.18653/v1/2023.emnlp-main.92.URL https://aclanthology.org/2023.emnlp-main.92.
- Li etal. (2024)Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, JosephE. Gonzalez, and Ion Stoica.From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline, 2024.URL https://arxiv.org/abs/2406.11939.
- Li etal. (2023b)Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and TatsunoriB. Hashimoto.Alpacaeval: An automatic evaluator of instruction-following models.https://github.com/tatsu-lab/alpaca_eval, 5 2023b.
- Lightman etal. (2023)Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe.Let’s verify step by step, 2023.URL https://arxiv.org/abs/2305.20050.
- Lin etal. (2024a)BillYuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, RonanLe Bras, and Yejin Choi.Wildbench: Benchmarking llms with challenging tasks from real users in the wild, 2024a.URL https://arxiv.org/abs/2406.04770.
- Lin etal. (2024b)Ying-Chun Lin, Jennifer Neville, JackW. Stokes, Longqi Yang, Tara Safavi, Mengting Wan, Scott Counts, Siddharth Suri, Reid Andersen, Xiaofeng Xu, Deepak Gupta, SujayKumar Jauhar, Xia Song, Georg Buscher, Saurabh Tiwary, Brent Hecht, and Jaime Teevan.Interpretable user satisfaction estimation for conversational systems with large language models, 2024b.URL https://arxiv.org/abs/2403.12388.
- Liu etal. (2024a)Jie Liu, Zhanhui Zhou, Jiaheng Liu, Xingyuan Bu, Chao Yang, Han-Sen Zhong, and Wanli Ouyang.Iterative length-regularized direct preference optimization: A case study on improving 7b language models to gpt-4 level, 2024a.URL https://arxiv.org/abs/2406.11817.
- Liu etal. (2024b)Yiqi Liu, NafiseSadat Moosavi, and Chenghua Lin.Llms as narcissistic evaluators: When ego inflates evaluation scores, 2024b.URL https://arxiv.org/abs/2311.09766.
- Madaan etal. (2023)Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, BodhisattwaPrasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark.Self-refine: Iterative refinement with self-feedback, 2023.URL https://arxiv.org/abs/2303.17651.
- Nakano etal. (2022)Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, XuJiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman.Webgpt: Browser-assisted question-answering with human feedback, 2022.URL https://arxiv.org/abs/2112.09332.
- OpenAI etal. (2024)OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, FlorenciaLeoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, HyungWon Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, SimónPosada Fishman, Juston Forte, Isabella Fulford, LeoGao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, ShixiangShane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, NitishShirish Keskar, Tabarak Khan, Logan Kilpatrick, JongWook Kim, Christina Kim, Yongjik Kim, JanHendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, ChakMing Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, RyanLowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, ScottMayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe deAvila BelbutePeres, Michael Petrov, HenriquePonde deOliveiraPinto, Michael, Pokorny, Michelle Pokrass, VitchyrH. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez,Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, FelipePetroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, MadeleineB. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan FelipeCerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, JustinJay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJWeinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, ShengjiaZhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph.Gpt-4 technical report, 2024.URL https://arxiv.org/abs/2303.08774.
- Ouyang etal. (2022)Long Ouyang, Jeff Wu, XuJiang, Diogo Almeida, CarrollL. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe.Training language models to follow instructions with human feedback, 2022.URL https://arxiv.org/abs/2203.02155.
- Parmar etal. (2024)Jupinder Parmar, Sanjev Satheesh, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro.Reuse, don’t retrain: A recipe for continued pretraining of language models, 2024.URL https://arxiv.org/abs/2407.07263.
- Pei etal. (2022)Jiaxin Pei, Aparna Ananthasubramaniam, Xingyao Wang, Naitian Zhou, Apostolos Dedeloudis, Jackson Sargent, and David Jurgens.Potato: The portable text annotation tool.In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2022.
- Rafailov etal. (2023)Rafael Rafailov, Archit Sharma, Eric Mitchell, ChristopherD Manning, Stefano Ermon, and Chelsea Finn.Direct preference optimization: Your language model is secretly a reward model.In Thirty-seventh Conference on Neural Information Processing Systems, 2023.URL https://openreview.net/forum?id=HPuSIXJaa9.
- Shi etal. (2024)Taiwei Shi, Kai Chen, and Jieyu Zhao.Safer-instruct: Aligning language models with automated preference data.In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 7636–7651, Mexico City, Mexico, June 2024. Association for Computational Linguistics.doi: 10.18653/v1/2024.naacl-long.422.URL https://aclanthology.org/2024.naacl-long.422.
- Shi etal. (2022)Weiyan Shi, Emily Dinan, Kurt Shuster, Jason Weston, and Jing Xu.When life gives you lemons, make cherryade: Converting feedback from bad responses into good labels, 2022.URL https://arxiv.org/abs/2210.15893.
- Team etal. (2024)Gemma Team, Morgane Riviere, Shreya Pathak, PierGiuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, CharlineLe Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, Nikola Momchev, Matt Hoffman, Shantanu Thakoor, Jean-Bastien Grill, Behnam Neyshabur, Olivier Bachem, Alanna Walton, Aliaksei Severyn, Alicia Parrish, Aliya Ahmad, Allen Hutchison, Alvin Abdagic, Amanda Carl, Amy Shen, Andy Brock, Andy Coenen, Anthony Laforge, Antonia Paterson, Ben Bastian, Bilal Piot, BoWu, Brandon Royal, Charlie Chen, Chintu Kumar, Chris Perry, Chris Welty, ChristopherA. Choquette-Choo, Danila Sinopalnikov, David Weinberger, Dimple Vijaykumar, Dominika Rogozińska, Dustin Herbison, Elisa Bandy, Emma Wang, Eric Noland, Erica Moreira, Evan Senter, Evgenii Eltyshev, Francesco Visin, Gabriel Rasskin, Gary Wei, Glenn Cameron, Gus Martins,Hadi Hashemi, Hanna Klimczak-Plucińska, Harleen Batra, Harsh Dhand, Ivan Nardini, Jacinda Mein, Jack Zhou, James Svensson, Jeff Stanway, Jetha Chan, JinPeng Zhou, Joana Carrasqueira, Joana Iljazi, Jocelyn Becker, Joe Fernandez, Joost van Amersfoort, Josh Gordon, Josh Lipschultz, Josh Newlan, Juyeong Ji, Kareem Mohamed, Kartikeya Badola, Kat Black, Katie Millican, Keelin McDonell, Kelvin Nguyen, Kiranbir Sodhia, Kish Greene, LarsLowe Sjoesund, Lauren Usui, Laurent Sifre, Lena Heuermann, Leticia Lago, Lilly McNealus, LivioBaldini Soares, Logan Kilpatrick, Lucas Dixon, Luciano Martins, Machel Reid, Manvinder Singh, Mark Iverson, Martin Görner, Mat Velloso, Mateo Wirth, Matt Davidow, Matt Miller, Matthew Rahtz, Matthew Watson, Meg Risdal, Mehran Kazemi, Michael Moynihan, Ming Zhang, Minsuk Kahng, Minwoo Park, Mofi Rahman, Mohit Khatwani, Natalie Dao, Nenshad Bardoliwalla, Nesh Devanathan, Neta Dumai, Nilay Chauhan, Oscar Wahltinez, Pankil Botarda, Parker Barnes, Paul Barham, Paul Michel, Pengchong Jin,Petko Georgiev, Phil Culliton, Pradeep Kuppala, Ramona Comanescu, Ramona Merhej, Reena Jana, RezaArdeshir Rokni, Rishabh Agarwal, Ryan Mullins, Samaneh Saadat, SaraMc Carthy, Sarah Perrin, Sébastien M.R. Arnold, Sebastian Krause, Shengyang Dai, Shruti Garg, Shruti Sheth, Sue Ronstrom, Susan Chan, Timothy Jordan, Ting Yu, Tom Eccles, Tom Hennigan, Tomas Kocisky, Tulsee Doshi, Vihan Jain, Vikas Yadav, Vilobh Meshram, Vishal Dharmadhikari, Warren Barkley, Wei Wei, Wenming Ye, Woohyun Han, Woosuk Kwon, Xiang Xu, Zhe Shen, Zhitao Gong, Zichuan Wei, Victor Cotruta, Phoebe Kirk, Anand Rao, Minh Giang, Ludovic Peran, Tris Warkentin, Eli Collins, Joelle Barral, Zoubin Ghahramani, Raia Hadsell, D.Sculley, Jeanine Banks, Anca Dragan, Slav Petrov, Oriol Vinyals, Jeff Dean, Demis Hassabis, Koray Kavukcuoglu, Clement Farabet, Elena Buchatskaya, Sebastian Borgeaud, Noah Fiedel, Armand Joulin, Kathleen Kenealy, Robert Dadashi, and Alek Andreev.Gemma 2: Improving open language models at a practical size, 2024.URL https://arxiv.org/abs/2408.00118.
- Thakur etal. (2024)AmanSingh Thakur, Kartik Choudhary, VenkatSrinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes.Judging the judges: Evaluating alignment and vulnerabilities in llms-as-judges, 2024.URL https://arxiv.org/abs/2406.12624.
- Tunstall etal. (2023)Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, AlexanderM. Rush, and Thomas Wolf.Zephyr: Direct distillation of lm alignment, 2023.URL https://arxiv.org/abs/2310.16944.
- Wang etal. (2022)Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, ArutSelvan Dhanasekaran, Anjana Arunkumar, David Stap, Eshaan Pathak, Giannis Karamanolakis, Haizhi Lai, Ishan Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, KuntalKumar Pal, Maitreya Patel, Mehrad Moradshahi, Mihir Parmar, Mirali Purohit, Neeraj Varshney, PhaniRohitha Kaza, Pulkit Verma, RavsehajSingh Puri, Rushang Karia, Savan Doshi, ShailajaKeyur Sampat, Siddhartha Mishra, Sujan ReddyA, Sumanta Patro, Tanay Dixit, and Xudong Shen.Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks.In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 5085–5109, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.doi: 10.18653/v1/2022.emnlp-main.340.URL https://aclanthology.org/2022.emnlp-main.340.
- Wang etal. (2023)Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, NoahA. Smith, Daniel Khashabi, and Hannaneh Hajishirzi.Self-instruct: Aligning language models with self-generated instructions.In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 13484–13508, Toronto, Canada, July 2023. Association for Computational Linguistics.doi: 10.18653/v1/2023.acl-long.754.URL https://aclanthology.org/2023.acl-long.754.
- Wu etal. (2021)Jeff Wu, Long Ouyang, DanielM. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, and Paul Christiano.Recursively summarizing books with human feedback, 2021.URL https://arxiv.org/abs/2109.10862.
- Xu etal. (2023)Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, PuZhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang.Wizardlm: Empowering large language models to follow complex instructions, 2023.URL https://arxiv.org/abs/2304.12244.
- Zhao etal. (2024)Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng.Wildchat: 1m chatgpt interaction logs in the wild, 2024.URL https://arxiv.org/abs/2405.01470.
- Zheng etal. (2023a)Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, ZiLin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, JosephE Gonzalez, and Ion Stoica.Judging llm-as-a-judge with mt-bench and chatbot arena.In A.Oh, T.Naumann, A.Globerson, K.Saenko, M.Hardt, and S.Levine (eds.), Advances in Neural Information Processing Systems, volume36, pp. 46595–46623. Curran Associates, Inc., 2023a.URL https://proceedings.neurips.cc/paper_files/paper/2023/file/91f18a1287b398d378ef22505bf41832-Paper-Datasets_and_Benchmarks.pdf.
- Zheng etal. (2023b)Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, ZiLin, Zhuohan Li, Dacheng Li, EricP. Xing, Hao Zhang, JosephE. Gonzalez, and Ion Stoica.Judging llm-as-a-judge with mt-bench and chatbot arena, 2023b.URL https://arxiv.org/abs/2306.05685.
- Zheng etal. (2024)Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma.Llamafactory: Unified efficient fine-tuning of 100+ language models.In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024. Association for Computational Linguistics.URL http://arxiv.org/abs/2403.13372.
Appendix A Prompts
A.1 Prompt for Feedback Signals Identification
The following is the full prompt we used for dialogue state tracking and SAT/DSAT classification. In addition, we also prompt GPT-4 to do domain and intent classification. The prompt is adapted from Das etal. (2023) and Lin etal. (2024b).
## LABEL DEFINITION ##
{
"valid_preceding_topical_relation_labels": [
{
"label": "YES",
"definition": "The current turn has **some or any** topical/subtopical relation to the preceding conversation context."
},
{
"label": "NO",
"definition": "The current turn has **absolutely no** topical/subtopical relation to the preceding conversation context OR is the first turn in the conversation, marking the beginning of a new dialogue segment."
}
],
"valid_domain_labels": [
"AI MACHINE LEARNING AND DATA SCIENCE",
"ASTROLOGY",
"BIOLOGY AND LIFE SCIENCE",
"BUSINESS AND MARKETING",
"CAREER AND JOB APPLICATION",
"CLOTHING AND FASHION",
"COOKING FOOD AND DRINKS",
"CRAFTS",
"CULTURE AND HISTORY",
"CYBERSECURITY",
"DATING FRIENDSHIPS AND RELATIONSHIPS",
"DESIGN",
"EDUCATION",
"ENTERTAINMENT",
"ENVIRONMENT AGRICULTURE AND ENERGY",
"FAMILY PARENTING AND WEDDINGS",
"FINANCE AND ECONOMICS",
"GAMES",
"GEOGRAPHY AND GEOLOGY",
"HEALTH AND MEDICINE",
"HOUSING AND HOMES",
"HUMOR AND SARCASM",
"LANGUAGE",
"LAW AND POLITICS",
"LITERATURE AND POETRY",
"MANUFACTURING AND MATERIALS",
"MATH LOGIC AND STATISTICS",
"MUSIC AND AUDIO",
"NEWS",
"PETS AND ANIMALS",
"PHILOSOPHY",
"PHYSICS CHEMISTRY AND ASTRONOMY",
"PRODUCTIVITY",
"PSYCHOLOGY AND EMOTIONS",
"RELIGION AND MYTHOLOGY",
"SHIPPING AND DELIVERY",
"SHOPPING AND GIFTS",
"SMALL TALK",
"SOCIAL MEDIA",
"SOFTWARE AND WEB DEVELOPMENT",
"SPORTS AND FITNESS",
"TAXATION",
"TECHNOLOGY",
"TIME AND DATES",
"TRANSPORTATION AUTOMOTIVE AND AEROSPACE",
"TRAVEL",
"VISUAL ARTS AND PHOTOGRAPHY",
"WEATHER",
"WRITING JOURNALISM AND PUBLISHING",
"OTHER"
],
"valid_intent_labels": [
{
"label": "INTENT:1-INFORMATION_SEEKING",
"definition": "The user wants to find factual information or answers to specific questions."
},
{
"label": "INTENT:2-ANALYSIS",
"definition": "The user asks analytical or conceptual questions about a complex topic or problem. The user’s questions require some degree of reasoning, interpretation, argumentation, comparison, and/or data processing."
},
{
"label": "INTENT:3-CREATION",
"definition": "The user asks the agent to either generate original content or translate existing content into new content based on specified criteria or constraints."
},
{
"label": "INTENT:4-OPEN-ENDED_DISCOVERY",
"definition": "The user wants to casually chat or play with the agent out of curiosity, boredom, or humor, OR the user’s intent is so unclear/underspecified that it’s impossible to categorize in any of the other intent classes. The user mainly treats the agent as a conversation or chitchat partner, and none of the other intent categories can be assigned."
}
],
"valid_satisfaction_labels": [
{
"label": "Gratitude",
"definition": "The user thanks or compliments the AI agent for its responses"
},
{
"label": "Learning",
"definition": "The user learns something new or useful by indicating curiosity and satisfaction with the information provided"
},
{
"label": "Compliance",
"definition": "The user follows the AI agent’s suggestions or instructions when applicable"
},
{
"label": "Praise",
"definition": "The user uses positive feedback words (e.g., excellent, amazing) or emojis, indicating enthusiasm and enjoyment of the conversation"
},
{
"label": "Personal_Details",
"definition": "The user shares more personal details or opinions with the AI agent when satisfied with its responses"
},
{
"label": "Humor",
"definition": "The user jokes with or challenges the AI agent in a friendly manner when suitable"
},
{
"label": "Acknowledgment",
"definition": "The user acknowledges or confirms that they understood or agreed with the AI agent’s explanations when relevant"
},
{
"label": "Positive_Closure",
"definition": "The user ends the conversation on a positive note without asking for more information or assistance"
},
{
"label": "Getting_There",
"definition": "The user acknowledges that the model’s response is getting better or has merit but is not fully satisfied. Appropriate dissatisfaction criteria may need to be checked as well when Getting_There presents"
},
{
"label": "N/A",
"definition": "The user utterance of the turn does NOT match the definition of any other valid satisfaction labels"
}
],
"valid_dissatisfaction_labels": [
{
"label": "Negative_Feedback",
"definition": "The user explicitly expresses dissatisfaction, frustration, annoyance, or anger with the AI agent’s response or behavior"
},
{
"label": "Revision",
"definition": "The user explicitly asks the AI agent to revise its previous response or repeatedly asks similar questions"
},
{
"label": "Factual_Error",
"definition": "The user points out the AI agent’s factual mistakes, inaccuracies, or self-contradiction in its information or output"
},
{
"label": "Unrealistic_Expectation",
"definition": "The user has unrealistic expectations of what the AI agent can do and does not accept its limitations or alternatives"
},
{
"label": "No_Engagement",
"definition": "The user does not respond to the AI agent’s questions, suggestions, feedback requests, etc."
},
{
"label": "Ignored",
"definition": "The user implies that their query was ignored completely or that the response did not address their intent/goal at all"
},
{
"label": "Lower_Quality",
"definition": "The user perceives a decline in quality of service compared to previous experience with other agents/tools, etc."
},
{
"label": "Insufficient_Detail",
"definition": "The user wants more specific/useful information than what is provided by the AI agent"
},
{
"label": "Style",
"definition": "The user feels that there is a mismatch between their preferred style (e.g. bullet point vs paragraph, formal vs casual, short vs long, etc.) and what is provided by the AI agent"
},
{
"label": "N/A",
"definition": "The user utterance of the turn does NOT match the definition of any other valid dissatisfaction labels"
}
],
"valid_state_labels": [
{
"label": "FEEDBACK",
"definition": "The user utterance of the turn contains a comment or evaluation or judgement of the previous turn’s agent response"
},
{
"label": "REFINEMENT",
"definition": "The user utterance of the turn is a repetition or refinement of unclear/underspecified instruction given in the previous turn’s user utterance"
},
{
"label": "NEWTOPIC",
"definition": "The user utterance of the turn is either the first turn of the conversation or is not related in terms of topic or task to its previous turn, introducing a new topic or task"
},
{
"label": "CONTINUATION",
"definition": "The user utterance of the turn is a topical or logical continuation of the previous turn"
}
]
}
## TASK ##
You are given a dialogue between a user and an agent comprised of turns starting with T. For each turn, solely based on the turn’s User utterance, you must carefully analyze the conversation and answer the following questions by replacing $instruction$ with correct answers in JSON format.- Summarize the user utterance in sentences
- Analyze the user utterance’s relation with the previous turn and output an appropriate label from the “valid_preceding_topical_relation_labels” list.
- Analyze the user utterance’s domain and output an appropriate label from the “valid_domain_labels” list. If preceding_topical_relation is YES, the domain label must be consistent with the preceding turn’s domain label.
- Analyze the user utterance’s intent and output an appropriate label from the ”valid_intent_labels” list.
- Analyze the user utterance’s satisfaction with respect to the previous turn’s AI response and output all applicable labels from the “valid_satisfaction_labels” list.
- Analyze the user utterance’s dissatisfaction with respect to the previous turn’s AI response and output all applicable labels from the “valid_dissatisfaction_labels” list.
- Analyze the user utterance’s state and output an appropriate label from the “valid_state_labels” list.
## OUTPUT FORMAT ##
The length and turn order of the output list must match the length and turn order of the input list. The sample output format is given as follow:[{
"T-$turn number$": {
"summary": "$turn summary in sentence$",
"preceding_topical_relation": "$an appropriate valid preceding topical relation label$",
"domain": "$an appropriate valid domain label$",
"intent": "INTENT:$an appropriate valid intent label$",
"satisfaction": [$a comma separated string list of applicable valid satisfaction label(s)$],
"dissatisfaction": [$a comma separated string list of applicable valid dissatisfaction label(s)$],
"state": "$an appropriate valid state label$"
}
}]
## INPUT ##
#D1#
## OUTPUT ##
A.2 Prompt for Preference Data Construction
The following is the prompt for constructing preference data.
# Conversation between User and AI
history
# Instruction
What are the user’s query and preferences? The query should be the user’s first attempt before providing any feedbacks to the model. Only output the turn id. The preference should always be based on user’s feedbacks and in complete sentences. Generate your answer in json format like
[{
"query": turn id,
"preferences": [preference 1, preference 2, ...]
}]
A.3 Prompt for User-guided Evaluation
The following is the prompt for user-guided evaluation. We borrow the WB-Reward prompt from WildBench (Lin etal., 2024a).
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by two AI models. We will provide you with the user query and a pair of AI-generated responses (Response A and B). You should first read the user query and the conversation history carefully for analyzing the task, and then evaluate the quality of the responses based on and rules provided below.
# Conversation between User and AI
## History
{history}
## Current User Query
{query}
## Response A
{response_a}
## Response B
{response_b}
# Evaluation
## Checklist
{checklist}
Please use this checklist to guide your evaluation, but do not limit your assessment to the checklist.
## Rules
You should compare the above two responses based on your analysis of the user queries and the conversation history. You should first write down your analysis and the checklist that you used for the evaluation, and then provide your assessment according to the checklist. There are five choices to give your final assessment: [“A++”, “A+”, “A=B”, “B+”, “B++”], which correspond to the following meanings:
- ‘A++’: Response A is much better than Response B.
- ‘A+’: Response A is only slightly better than Response B.
- ‘A=B’: Response A and B are of the same quality. Please use this choice sparingly.
- ‘B+’: Response B is only slightly better than Response A.
- ‘B++’: Response B is much better than Response A.
## Output Format
First, please output your analysis for each model response, and then summarize your assessment to three aspects: “reason A=B”, “reason A B”, and “reason B A”, and finally make your choice for the final assessment. Please provide your evaluation results in the following json format by filling in the placeholders in []:
{
"analysis of A": "[analysis of Response A]",
"analysis of B": "[analysis of Response B]",
"reason of A=B": "[where Response A and B perform equally well]",
"reason of A>B": "[where Response A is better than Response B]",
"reason of B>A": "[where Response B is better than Response A]",
"choice": "[A++ or A+ or A=B or B+ or B++]"
}
Appendix B SAT and DSAT
B.1 Detailed SAT and DSAT Criteria
The detailed definitions of SAT and DSAT can be found in Table 4 and Table 5.
Keyword | Definition |
---|---|
Gratitude | The user thanks or compliments the AI agent for its responses. |
Learning | The user learns something new or useful by indicating curiosity and satisfaction with the information provided. |
Compliance | The user follows the AI agent’s suggestions or instructions when applicable. |
Praise | The user uses positive feedback words (e.g., excellent, amazing) or emojis, indicating enthusiasm and enjoyment of the conversation. |
Personal Details | The user shares more personal details or opinions with the AI agent when satisfied with its responses. |
Humor | The user jokes with or challenges the AI agent in a friendly manner when suitable. |
Acknowledgment | The user acknowledges or confirms that they understood or agreed with the AI agent’s explanations when relevant. |
Positive Closure | The user ends the conversation on a positive note without asking for more information or assistance. |
Getting There | The user acknowledges that the model’s response is getting better or has merit but is not fully satisfied. |
Keyword | Definition |
---|---|
Negative Feedback | The user explicitly expresses dissatisfaction, frustration, annoyance, or anger with the AI agent’s response or behavior. |
Revision | The user explicitly asks the AI agent to revise its previous response or repeatedly asks similar questions. |
Factual Error | The user points out the AI agent’s factual mistakes, inaccuracies, or self-contradiction in its information or output. |
Unrealistic Expectation | The user has unrealistic expectations of what the AI agent can do and does not accept its limitations or alternatives. |
No Engagement | The user does not respond to the AI agent’s questions, suggestions, feedback requests, etc. |
Ignored | The user implies that their query was ignored completely or that the response did not address their intent/goal at all. |
Lower Quality | The user perceives a decline in quality of service compared to previous experience with other agents/tools, etc. |
Insufficient Detail | The user wants more specific/useful information than what is provided by the AI agent. |
Style | The user feels that there is a mismatch between their preferred style and what is provided by the AI agent. |
B.2 SAT and DSAT Annotation
GPT-4’s performances on SAT and DSAT classification can be found in table 6. GPT-4 demonstrates strong performance in classifying SAT (satisfaction) signals, with high accuracy at 91.7% and balanced precision and recall, both around 73%. The Cohen’s Kappa of 68.5% reflects substantial agreement with human annotators. For DSAT (dissatisfaction) signals, GPT-4 achieves a precision of 83.3%, with a recall of 48.4%, leading to an F1 score of 61.2% and a Cohen’s Kappa of 50.4%. These metrics indicate that GPT-4 is effective at recognizing both SAT and DSAT signals. For human annotation, we utilized a web-based annotation tool named Potato (Pei etal., 2022). The interface is shown in Figure 3.
Accuracy | Precision | Recall | F1 | GPT-Human | Human-Human | |
---|---|---|---|---|---|---|
SAT | 91.7 | 73.2 | 73.6 | 73.4 | 68.5 | 70.0 |
DSAT | 81.8 | 83.3 | 48.4 | 61.2 | 50.4 | 54.1 |
Appendix C Implementation Details
We found that hyperparameter tuning is crucial for achieving optimal performance in preference optimization. Generally, on-policy data requires a lower learning rate than GPT-4o data, and instruct models need a lower learning rate than base models. Specifically, Mistral and Gemma (Team etal., 2024) require a lower learning rate than Phi 3 and LLaMA 3. Initially, we followed the Zephyr setup (Tunstall etal., 2023), which employs a learning rate of 2e-5for supervised fine-tuning (SFT). However, we found that our models quickly collapsed, failing to generate sensible outputs after just a few dozen iterations. After conducting a grid search on the hyperparameters for both SFT and DPO training, we discovered that while it is acceptable to use a larger learning rate for training base models, a much smaller learning rate is required for instruct models, likely due to the various annealing techniques applied during the post-training process (Parmar etal., 2024). We also explored NLL regularization (Liu etal., 2024a) with a regularization strength of 0.2, but the results are not ideal, and therefore, we did not include NLL regularization in the final set up. We trained all the models using LLaMA Factory (Zheng etal., 2024), a unified efficient LLM finetuning framework. The following is the hyperparameters we used in our final experiment.
SFT Training.
For SFT training, we trained all the models for 1 epoch with a batch size of 128, a learning rate of 5e-6, a linear warm-up ratio of 0.1, and a cosine learning rate scheduler. Better results may be achievable by decreasing the learning rate for Mistral. Additionally, it is recommended to use a higher learning rate (e.g., 2e-5) if you are fine-tuning from the base models.
DPO Training.
For DPO training, we trained all the models for 1 epoch with a batch size of 32, a learning rate of 5e-7, and . All other hyperparameters remained the same as in the SFT training.