Jiaqi Zhang, Chen Gao, Liyuan Zhang, Yong Li*, Hongzhi Yin*
Tsinghua University, University of Queensland
*Corresponding authors: Yong Li <liyong07@tsinghua.edu.cn>, Hongzhi Yin <h.yin1@uq.edu.au>
Abstract
Recent advances in embodied agents with multimodal perception and reasoning capabilities based on large vision-language models (LVLMs), excel inautonomously interactingeither real or cyber worlds, helping people make intelligent decisions in complex environments.However, the current works are normally optimized by golden action trajectories or ideal task-oriented solutions toward a definitive goal.This paradigm considers limited user-oriented factors, which could be the reason for their performance reduction in a wide range of personal assistant applications.To address this, we propose Chain-of-User-Thought (COUT), a novel embodied reasoning paradigm that takes a chain of thoughtfrom basic action thinking to explicit and implicit personalized preference thoughtto incorporate personalized factors into autonomous agent learning.The main challenges of achieving COUT include:1) the definition of embodied personalized tasks,2) the embodied environment epitomizes personalized preference, and3) the way to model embodied personalized actions.To target COUT, we introduce SmartAgent, an agent framework perceiving cyber environments and reasoning personalized requirements as:1) interacting with GUI to access an item pool,2) generating users’ explicit requirements implied by previous actions, and3) recommending items to fulfill users’ implicit requirements.To demonstrate SmartAgent’s capabilities,we also create a brand-new dataset SmartSpot that offers a full-stage personalized action-involved environment.To our best knowledge, our work is the first to formulate the COUT process, serving as a preliminary attempt towards embodied personalized agent learning.Our extensive experiments on SmartSpot illuminate SmartAgent’s functionality among a series of embodied and personalized sub-tasks.We will release code and data upon paper notification at https://github.com/tsinghua-fib-lab/SmartAgent.
{strip}
1 Introduction
Embodied artificial intelligence[9] is considered as a crucial stride toward Artificial General Intelligence (AGI)[10].Powered by the recent advances in large multi-modal models, embodied agents have been built upon to behave like real humans, capable of perceiving, reasoning, and acting with their surroundings in both real and cyber worlds.The enthusiasm for deploying such humanoid capabilities is evident in various tasks, including autonomic robotics[7, 2], game AI[44, 35], smart device assistants[30, 17], and smart city[12, 42].Many of these scenarios require embodied agents to do more than follow instructions and execute actions like emotionless robots; otherwise, they are expected to serve as personal assistants attuned to human preferences in the meantime.For instance, for smart device assistance, agents struggle to personalize responses to ambiguous user queries such as providing music recommendations, though they fully understand the operation logic of a music player.Generally speaking, a fully functional embodied agent necessitates personalized perceptual capabilities, thereby enabling a comprehensive agent-environment-user triadic perception of the world, similar to JARVIS111Stands for ’Just a Rather Very Intelligent System’, a fictional artificial intelligent character in the Marvel Comics, a fictional artificial assistant created by Iron Manwhich can make personalized intelligent decisions for him.
However, this personalized consideration is absent among the current embodied agent works, where the optimization normally relies on golden action trajectories[31, 6] or ideal task-oriented solutions[8, 22].Although these fixed paths can effectively accomplish task goals,they can only train embodied agents to be rigid task-oriented problem solvers, overlooking the multiple valid approaches that often exist as user-oriented indicators.Furthermore, the practical environment may exhibit unpredictable behavior, such as when new functions are involved or unseen scenes are observed.In these cases, task-oriented agents are not flexible enough to even capture dynamic changes in basic task goals[23, 34], let alone discern changes in user preference.As a result, such training paradigms actually restrict the learning of user-oriented perceptual capabilities, which could be the reason for the lower performance in many embodied personalized scenarios.For example, in the daily usage of smart devices as shown in Fig.1, there are no golden line actions to access a music pool but rather diverse paths imply different user intents[30].That is, these normal embodied agents are designed to only be able to use the product (music App on mobile phones), which are unable to think about users’ music preferences or extract user intentions behind usage habits and collected behaviors.
In this work, we propose a novel reasoning paradigm Chain-of-User-Thought (COUT), which extends embodied agents from task-only optimization to personalization-oriented optimization.We summarize COUT process as: training agents withprogressive thinking from basic embodied behaviors, gradually to explicit requirements reasoning, and finally to high-level implicit personalization understanding, as shown in Fig.1.Note that the chain of user thoughts is a general paradigm and thus can be flexibly adapted to various cyber environments and enhance personalized reasoning.However, there exist several critical challenges to achieving COUT, as follows.
- •
First, the learning task of embodied personalized agents has not been systematically defined, because the task goals are often ambiguous queries from users, which go beyond the existing works that take explicit task instructions.
- •
Second, there is a lack of suitable datasets and benchmarks to support research on COUT. The commonly used datasets generally do not include personalized features.
- •
Third, the modeling of personalized features is underexplored, largely due to the absence of clear task definition and supportive training environments.
To support COUT research, we collect and construct the first embodied AI benchmark with explicit personalization-related evaluations, SmartSpot.It comprises five single channels and two multi-channel scenarios to simulate complex real-world environments, featuring a total of 144 episodes and over 1,400 steps.To address the above challenges, we introduce SmartAgent, the first embodied personalized agent.It takes visual observations of GUI screenshots and textual instructions as task input and generates multi-step thoughts.Specifically, the SmartAgent undergoes a two-stage training process: embodiment stage and personalization stage, as illustrated in Fig.3.

In embodiment stage, the agent takes GUI actions and item pool screenshots as visual inputs, along with other textual contexts. After a series of encoding processes, these multimodal tokens are fed into a Perceiver model to generate specific GUI actions, regarded as Thought #1. Based on these initial GUI thoughts, a Reasoner model then serves to infer the Thought #2, identifying the user’s potential underlying requirements in a short textual output.Finally in the personalization stage, powered by the deeper-level Thought #3, the same Perceiver directly outputs the recommendation result as “Yes” or “No”.
We quantitatively evaluate SmartAgent’s functionality in comprehensive embodied and personalized sub-tasks, including GUI Grounding, Autonomous GUI Operation, Underlying Reasoning, Personalized Recommendation, and Zero-shot Reasoning, as illustrated in Fig.2.The results indicate(i) through three-step of COUT process with efficient LoRA tuning, SmartAgent achieves the first full-stage embodied personalized reasoning;(ii) SmartAgent delivers comparable performances to state-of-the-art task-specific models on fundamental GUI Grounding and advanced Autonomous GUI Operation tasks, capable of generating accurate action commands;(iii) SmartAgent shows proficiency in reasoning explicit underlying intentions, effectively bridging surface-level operations to more implicit-level user needs;(iv) SmartAgent excels at leveraging these preceding thoughts to uncover users’ implicit items requirements;(v) SmartAgent manifests zero-shot reasoning capabilities across new channels, a hallmark of a well-established embodied agent.We also present qualitative evaluations to illustrate SmartAgent’s effectiveness and proficiency in personal assistant scenarios.Our contributions are summarized as follows:
- •
We take a pioneering step to formulate a novel reasoning paradigm COUT for embodied AI, which introduces personalized concerns that have not been addressed in previous embodied AI research.To support COUT, we construct the SmartSpot evaluation benchmark as the first case environment designed to support further research.
- •
We develop the first embodied personalized agent SmartAgent, built upon a meticulous two-stage training process that effectively implements the three-step COUT thinking process.SmartAgent is capable of performing a range of tasks, from basic embodied functions to both explicit and implicit-level personalized reasoning.
- •
Our rigorous experiments demonstrate the outstanding performance of SmartAgent on both overall evaluations and sub-tasks. SmartAgent demonstrates notable embodied personalized reasoning and zero-shot reasoning capabilities.
2 Related Work
Embodied AI with Large Vision-Language Models.Embodied AI marks a significant transformation from traditional artificial intelligence which relies on static datasets, to autonomous agents that learn through interactions with their environments.Thanks to the rapid development of LVLM, embodied agents can now process pure visual and textual observation as input.The current effort for training such embodied agents can be categorized into main directions, considering fixed and dynamic environments respectively[30].Many existing approaches follow fixed environments, by comparing agents’ action trajectories to pre-collected human demonstrations[20, 43, 28, 26, 52, 50].LEO[20] presents a generalist agent capable of interacting with 3D environments for various vision-language tasks, following ground-truth solutions.GAIA[28] introduces a fixed dataset that evaluates agents’ ability to interact with live web environments.Towards more realistic evaluation, emerging research has introduced dynamic environments where agents learn interactively from mistakes to probe the boundaries of systems[2, 16, 30, 35, 13, 41].CRADLE[35] tests a General Computer Control setting in real-world AAA games, serving as a milestone towards AGI in the digital world.EmbodiedCity[13] introduces the first platform with a high-quality 3D real environment based on a real city, providing systematic benchmark tasks for embodied intelligence.
Though the above training environments inspire promising embodied capabilities in various tasks, they primarily optimize agents as ideal problem solvers.This approach is still unsuitable for many personal assistant scenarios where user-oriented considerations are critical for embodied decision-making.SmartAgent is the first that aims to address this by integrating both embodied and personalized capabilities.
GUI Navigation.Automatically execution of user instructions on smart devices, like smartphones and desktops is an advanced task, as it requires agents to perceive, reason, and finally make decisions, i.e. Graphic User Interface (GUI) Navigation.As a result, GUI navigation is one of the typical scenarios where personalized intention under instructions can greatly influence embodied behavior.Early research in this field usually relies on some intermediate observation such as GUI layout annotation or text-only HTML format[36, 53, 22]. Recently, A number of works have utilized LVLMs’ multimodal knowledge and emergent zero-shot embodied abilities to perform GUI Navigation[50, 45, 5, 37, 17, 27, 3]. SeeClick[3] presents an LVLM-based agent that purely relies on interface screenshots, enhanced with GUI grounding-specific pre-training. COAT[52] introduces a dataset AITZ with semantic prompts, allowing agents with fine-grained step-wise reasoning. They also demonstrate effectiveness on three classic benchmarks, AiTW[31], MiniWob[33], and Mind2Web[6].Despite the advance of the above premier GUI Navigators,they are primarily designed to respond to instructions with clear ground-truth outcomes that can be accurately evaluated through screenshots.In practice, however, user queries often appear as ambiguous instructions, a hallmark of personal assistant tasks that previous works have not addressed.In this paper, we take the GUI Navigation task as a case embodied environmentto demonstrate SmartAgent’s effectiveness in analyzing ambiguous user thoughts.
Personalized Recommendation.Recommendation, a classic task in the field of data mining and information retrieval, specializes in modeling user preferences based on their previous interactions.Recommendation models have found wide-ranging applications, such as in e-commerce websites[18], search engines[29], streaming services[51], and advertising systems[15].These applications inherently integrate closely with personal smart devices as interfaces, allowing them to collect changes in user historical interactions[46, 48].Existing studies typically represent this personalized preference through sequences of item IDs[24, 32, 21, 47] or advanced item content[49, 51, 11, 4].With the enhancement of LVLMs, modeling user multimodal behavior has emerged as a new research hotspot in this field[25, 39, 14, 40].This multimodal capability allows personalized recommendations at more diverse levels. The one includes the screen operation level’s behavior, which was unobservable in the back-end item sequences before.In this paper, we integrate the setting of multimodal recommendation task into an embodied GUI case to enable the principles of COUT, creating the first embodied environment with explicit personalized evaluation.
3 Chain-of-User-Thought (COUT)
3.1 Definition
Chain-of-User-Thought (COUT) is a reasoning paradigm where an agent controls smart devices based on both task goals and user personalized preference as follows:
(1) |
where is the action space, which differs from the existing work that relies solely on task goal as:
(2) |
Specifically, an agent is required to generate through a progressive reasoning chain from basic embodied behavior level, gradually to deeper explicit personalized reasoning level, and finally to high implicit personalized reasoning level.This process requires embodied agents the ability of a multi-faceted understanding of user personalized preferences.
Scenario | Channel | #Episodes | #Steps | InstructionMean | UnderlyingMean | ||
GUI | Item Pool | RS | |||||
Single-channel | FOOD | 20 | 7.00 | 1.00 | 4.00 | 12.90 | 27.50 |
HOTEL | 20 | 11.0 | 1.00 | 4.00 | 14.80 | 35.45 | |
FLIGHT | 20 | 9.20 | 1.00 | 4.50 | 19.95 | 24.25 | |
MOVIE | 20 | 8.00 | 1.00 | 7.00 | 13.45 | 34.15 | |
MEDICINE | 20 | 6.55 | 1.00 | 4.20 | 13.15 | 19.00 | |
Multi-channel | TRAVEL1 | 12 | 17.30 | 1.00 | 4.00 | 23.10 | 48.30 |
TRAVEL2 | 10 | 21.33 | 1.00 | 4.00 | 23.67 | 55.25 |
3.2 Components
As illustrated inEq.1, we formulate the COUT process in a common cyber world case, personal assistance of smart devices, in terms of three stages of thoughts: a) Thought #1 GUI Navigation, b) Thought #2 Underlying Reasoning, and c) Thought #3 Personalized Recommendation. Thought #1 denotes the surface-level thought for basic embodied behavior on the device GUI, e.g., This action clicks a button named [Button_name].Thought #2 denotes the deeper-level thought for explicit user preference, e.g., It seems the user needs items with [Restriction1], [Restriction2], and [Restriction3].Thought #3 denotes the high-level thought for implicit user preference, e.g., I recommend [item#1] from the item pool.
Challenges. There are several key challenges to achieving COUT.The key challenge starts with the fact that task goals are often user’s ambiguous queries, lacking definitive goals matched to the observations. For example, the existing embodied agents are trained to either touch a specific object using robotic arms in 3D space or click a particular button on 2D screens. However, these golden targets are not presented in user queries as personalized preference is typically subjective and nonverbal. How to define the personalized task in embodied environments and evaluate it remains unknown and challenging.Second, the deficiency of supportive data for COUT research poses a considerable challenge.The commonly used datasets are typically collected from task-oriented demonstrations. This gap highlights the need for more comprehensive datasets that can better facilitate advancements in COUT research.Third, due to the lack of clear task definitions and the absence of supportive training environments, the methods for analyzing personalized features have not been thoroughly explored.
4 The SmartSpot Benchmark
Given the scarcity of training data for embodied agents that explicitly captures the personalized analysis highlighted in the COUT paradigm,we propose to construct a novel benchmark named SmartSpot.
4.1 Dataset Summary
We collect SmartSpot data from Meituan, a well-known Internet service platform in China that offers a variety of life services, including food recommendations, hotel bookings, online flight ticket sales, etc.To create more practical personal scenes, we develop SmartSpot with two scenarios: the single-channel scenario which focuses on one type of service, and the multi-channel scenario which combines two single channels.We select five of the most daily used single channels: Food, Hotel, Flight, Movie, and Medicine.For the multi-channel scenarios, we pair Flight and Food as Travel1, and Flight and Hotel as Travel2. These combinations reflect more complicated and practical situations, such as traveling to a destination and booking hotels or restaurants.The data in every scenario are GUI action episodes which contain several steps to complete an instruction.Specifically, each episode consists of three groups of steps, as shown in different colors inTab.1. The blue “GUI” steps denote a series of GUI actions, like entering and completing a search bar, to access an item pool, the yellow “Item Pool” step denotes the page for the found item pool, and the “RS” steps denote the details page of each item awaiting recommendations.Each step contains a GUI screenshot, a ground-truth action (with possibly a bounding box or textual value), a list of previous actions, and an episodic instruction with a corresponding underlying requirement.Finally, SmartSpot covers seven scenarios that present a wealth of personal assistant tasks, supporting more embodied personalized research. The full data statistics are illustrated inTab.1.
Action Space. SmartSpot provides common human-GUI interaction operations. Following AiTW[31], each action type is assigned by a action_type_id as ground truth. We detail the full action space in Sec.8 of Appendix.

4.2 Dataset Collection & Analysis
We curate SmartSpot following the real-life usage of this platform.The process begins with generating pairs of user instructions along with the underlying requirements.To ensure consistency, we establish intention seeds to generate them simultaneously.Specifically, we recruit annotators experienced on this platform to select 2-3 significant attributes as seeds for searching specific item pools for each channel.For example, the “[recently played]” inFig.1 presents one seed result, which could be other choices that users can click on.By creating multiple permutations of intention seeds, we collect a diverse batch of instruction pairs while ensuring there were no duplicates.Then, according to these instruction pairs, the annotators performed episodic GUI operations, capturing screenshots and ground-truth actions along with their bounding boxes at each step.All episodes are completed based on the annotators’ personal usage habits on the platform.In total, we gather 144 episodes with over 1,400 steps.
5 SmartAgent Approach
5.1 Overview
Given a user instruction to complete, the agent will navigate through an episode comprising three groups of steps: searching for an item pool, finding the item pool, and making item recommendations.At time step , the agent receives a screenshot observation .Then, the agent should takes an action according to its current assets , where is the historical actions.The chosen action signifies either a GUI command, a signal that an item pool has been found, or a recommendation result for items within that pool.We will details the action space in Sec.4
The primary design principles of SmartAgent are two-fold:i) It should process the multimodal input of high-resolution screenshot images and textual instructions, and the output of embodied action, along with textual thoughts;ii) It should deal with ambiguous instructions as reasoning goals at all the stages of COUT.We therefore transform all data of different modalities into a token sequence, illustrated below:
(3) |
Using this representation, we formulate the learning of SmartAgent as GPT-style auto-regressive approach, in line with[20].For instance, in Fig.1, given a smartphone screenshot and user’s instruction “Can you give me some music recommendations?”, we craft a query prompt as: “Please generate the next action according to the <screenshot>and <instruction>”.Next, we will detail the training progress, evaluation, and detailed settings.
5.2 Agent Training
To achieve COUT reasoning, we divide each episode into the following embodied and personalized stages to train SmartAgent successively.Specifically, the SmartAgent backbone LVLM functions in two roles: a Perceiver trained in our environment to predict actions or a Reasoner utilizes the original LVLMs to generate thoughts.

Embodied Stage.This stage aims to complete ambiguous instructions to find the item pool. The agent takes only GUI action and item pool screenshots as visual input. As stated in the Sec.5.1, the multimodal assets first feed into the Perceiver to predict the specific embodied actions, referred to as Thought #1 in COUT.Subsequently, the Reasoner infers step-wise action thoughts and summaries as an underlying requirement as Thought #2.The Thought #2 indicates the user’s intention explicitly reflected in this stage.For instance, as illustrated in Fig.1, for user inquiries about music, Thought #2 may include specific constraints not present in the original instruction, such as “Justin Bieber”.As a result, the underlying requirement serves as a key intermediate output, offering a clearer representation of user intentions for the next personalized stage.
Personalized Stage.With the fine-grained underlying requirement in Thought #2, this stage focuses on making personalized recommendations.The same Perceiver model takes item screenshots as visual input and determines whether a recommendation is warranted, responding with either textual “Yes” or “No”, which is designated as Thought #3.
5.3 Implementation Details
We choose Qwen-VL[1] as our backbone LVLM, which encodes visual inputs with a high resolution of 448*448.The training of SmartAgent starts from the continual pre-training on SeeClick base model[3] for basic GUI grounding abilities.We train SmartAgent for around 30 epochs.During training, we apply LoRA[19] to fine-tune both the visual encoder and LLM.We utilize AdamW as the optimizer,starting with a learning rate of 3e-5 and a global batch size of 14. All training is conducted on two NVIDIA A100 GPUs. More details can be found in Appendix.
6 Evaluation
We demonstrate SmartAgent’s capabilities by comprehensive evaluations of overall abilities in Sec.6.1 and a full spectrum of sub-tasks encompassing GUI Grounding in Sec.6.2, Autonomous GUI Operation in Sec.6.3, Explicit & Implicit Personalized Reasoning in Sec.6.4, and Zero-shot Reasoning in Sec.6.5. We also report more insights in Sec.6.6.
Metrics.Following most of the setting in[3],we compute the cross-entropy loss for Thought #1 and Thought #3 with their ground-truth actions, and semantic similarity for Thought #2 with the underlying requirement.A bounding box may be contained in ground-truth action to verify if a click action is hit.We therefore evaluate SmartAgent using the below metrics, in terms of embodied action metrics (following Mind2web[6]) for Thought #1, and personalized preference metrics for Thought #2 and Thought #3:
- •
Element Accuracy (Ele.Acc): The accuracy of predicted coordinate with ground-truth for click and type action.
- •
Step Successful Rate (SSR): The rate of steps that both the action type and value are successfully predicted.
- •
Explicit Preference Accuracy (Exp.Acc): The semantic similarity between the predicted underlying requirement and ground truth.
- •
Implicit Preference Accuracy (Imp.Acc): The accuracy of predicted item recommendations with ground truth.
Methods | SmartSpot | |||
Ele.Acc | SSR | Exp.Acc | Imp.Acc | |
Qwen2-VL | 0.04 | 0.03 | / | / |
Qwen-VL | 0.61 | 0.64 | / | / |
SeeClick | 0.61 | 0.40 | / | / |
SmartAgent | 0.64 | 0.50 | 0.71 | 0.24 |
6.1 Embodied Personalized Reasoning
We first investigate the comprehensive capabilities of SmartAgent in handling embodied tasks and personalized reasoning, primarily focusing on SmartSpot for validation. Specifically, we categorize the tasks based on the SmartSpot into simpler single-domain tasks and more complicated scenarios that mimic real multi-channel interactions. We exclude the MEDICINE channel to evaluate the zero-shot ability later. We select SeeClick[3] (the current SOTA GUI agent based on pure visual observation), along with Qwen-VL[1] and Qwen2-VL[38], as our baselines.
Results & analysis. Tab.2 shows SmartAgent’s average performance across all channels. It achieved the best results in the Ele.Acc metric and ranked second in SSR, indicating strong foundational embodied perception ability. SmartAgent inferred 71% of users’ underlying requirements, afterward leading to a 24% accuracy in item recommendations.More importantly, Fig.4 highlights that SmartAgent consistently outperformed in all multi-channel scenarios, whereas specialized GUI agents and general LLMs showed slightly weaker embodied outputs. Overall, these evaluations confirm SmartAgent’s comprehensive capabilities on both embodied and personalized tasks.Qwen2-VL had unexpectedly poor results and will not be discussed further.
6.2 GUI Grounding

An important consideration is whether training with personalized capabilities leads to catastrophic forgetting of pre-trained embodied abilities. In this section, we evaluate SmartAgent on a renowned GUI Grounding benchmark SmartSpot to assess its foundational perception of raw GUI data. The comparative baselines are divided into two main categories: GUI specialist models and general LLMs.
Results & analysis. As shown in Appendix Tab.5, SmartAgent achieves second-best results in most metrics, even securing the top position in Desktop-Icon/Widget. This indicates that training with personalized capabilities not only preserves foundational embodied abilities but also enhances its proficiency in operations that involve user intent. This is also the primary achievement of the COUT paradigm.
6.3 Autonomous GUI Operation
Automated execution of human instructions is a foundational capability of embodied agents. In this section, we validate SmartAgent’s basic abilities to handle GUI action episodes autonomously using the classic GUI agent benchmark Mind2Web.
Results & analysis. We transfer the well-trained SmartAgent on SmartSpot to Mind2Web. As shown in Appendix Tab.6, it achieves second place generally in the pure vision-based baselines across three downstream scenarios. This indicates that SmartAgent possesses strong embodied generalization capabilities, enabling seamless integration into various downstream tasks.
6.4 Explicit & Implicit Personalized Reasoning
In this section, we present a case study to demonstrate SmartAgent’s performance in explicit user underlying reasoning and implicit item recommendation tasks. As shown in Fig.5, SmartAgent correctly predicted the action of clicking on the economy class option in the visual observation. This prediction is reflected in the summarization of the user’s underlying requirements. Leveraging this textual representation of explicit needs, SmartAgent made recommendations for the last two flight options, effectively addressing the user’s implicit requirements.
6.5 Zero-shot Reasoning
Zero-shot perception is the ultimate goal for embodied agents, enabling them truly to learn from interactions with their environment. In this section, we utilize the MEDICINE channel in SmartSpot to evaluate SmartAgent’s zero-shot performance in unseen scenarios.
Results & analysis. As shown in Tab.3, SmartAgent surprisingly exceeds its average performance achieved through fine-tuning on SmartSpot in the Exp.Acc metric.This demonstrates SmartAgent’s robustness in interpreting users’ explicit intentions and also indicates a preliminary zero-shot reasoning capability.
Type | MEDICINE | |||
Ele.Acc | SSR | Exp.Acc | Imp.Acc | |
Zero-shot | 0.04 | 0.08 | 0.77 | 0.14 |
Full-stage | 0.64 | 0.50 | 0.71 | 0.24 |
6.6 More Insights into COUT
In this section, we provide deeper insights into COUT reasoning, focusing on the two-stage training and end-to-end training settings (without underlying thought generation). As shown in Tab.4, SmartAgent with end-to-end training exhibits slightly better embodied performance, although it is limited in user underlying perceptions. We therefore argue that balancing embodied perceptions in response to changing environments with reliable personal service is crucial for future COUT research.
Ele.Acc | SSR | Exp.Acc | Imp.Acc | |
Two-stage | 0.64 | 0.50 | 0.71 | 0.24 |
End-to-End | 0.67 | 0.53 | 0.66 | 0.31 |
7 Conclusion and Future Work
In this paper, we introduce a novel embodied reasoning paradigm, COUT, which for the first time defines an embodied personalized task.We establish a clear definition and components of the COUT paradigm and analyze its challenges.To address these challenges, we propose SmartAgent to instantiate COUT through a two-stage training from essential GUI reasoning to high-level user thought reasoning.For benchmarking this progress in SmartAgent, we created SmartSpot, the first embodied AI benchmark featuring explicit personalization evaluations.Results on SmartSpot demonstrate the effectiveness and proficiency of SmartAgent over full-stage embodied personalized reasoning tasks.Furthermore, SmartAgent showcases the key capability of zero-shot embodied reasoning, highlighting its potential for efficient adaptation in new scenarios.
References
- Bai etal. [2023]Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou.Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023.
- Barreiros etal. [2022]JoseA Barreiros, Artemis Xu, Sofya Pugach, Narahari Iyengar, Graeme Troxell, Alexander Cornwell, Samantha Hong, Bart Selman, and RobertF Shepherd.Haptic perception using optoelectronic robotic flesh for embodied artificially intelligent agents.Science Robotics, 7(67):eabi6745, 2022.
- Cheng etal. [2024a]Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu.Seeclick: Harnessing gui grounding for advanced visual gui agents.arXiv preprint arXiv:2401.10935, 2024a.
- Cheng etal. [2024b]Yu Cheng, Yunzhu Pan, Jiaqi Zhang, Yongxin Ni, Aixin Sun, and Fajie Yuan.An image dataset for benchmarking recommender systems with raw pixels.In Proceedings of the 2024 SIAM International Conference on Data Mining (SDM), pages 418–426. SIAM, 2024b.
- Deng etal. [2024a]Shihan Deng, Weikai Xu, Hongda Sun, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, Rui Yan, etal.Mobile-bench: An evaluation benchmark for llm-based mobile agents.arXiv preprint arXiv:2407.00993, 2024a.
- Deng etal. [2024b]Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su.Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36, 2024b.
- Driess etal. [2023]Danny Driess, Fei Xia, MehdiSM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, etal.Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023.
- Drouin etal. [2024]Alexandre Drouin, Maxime Gasse, Massimo Caccia, IssamH Laradji, Manuel DelVerme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, etal.Workarena: How capable are web agents at solving common knowledge work tasks?arXiv preprint arXiv:2403.07718, 2024.
- Duan etal. [2022]Jiafei Duan, Samson Yu, HuiLi Tan, Hongyuan Zhu, and Cheston Tan.A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022.
- Duéñez-Guzmán etal. [2023]EdgarA Duéñez-Guzmán, Suzanne Sadedin, JaneX Wang, KevinR McKee, and JoelZ Leibo.A social path to human-like artificial intelligence.Nature Machine Intelligence, 5(11):1181–1188, 2023.
- Fu etal. [2024]Junchen Fu, Fajie Yuan, Yu Song, Zheng Yuan, Mingyue Cheng, Shenghui Cheng, Jiaqi Zhang, Jie Wang, and Yunzhu Pan.Exploring adapter-based transfer learning for recommender systems: Empirical studies and practical insights.In Proceedings of the 17th ACM International Conference on Web Search and Data Mining. ACM, 2024.
- Gao etal. [2024a]Chen Gao, Fengli Xu, Xu Chen, Xiang Wang, Xiangnan He, and Yong Li.Simulating human society with large language model agents: City, social media, and economic system.In Companion Proceedings of the ACM on Web Conference 2024, pages 1290–1293, 2024a.
- Gao etal. [2024b]Chen Gao, Baining Zhao, Weichen Zhang, Jinzhu Mao, Jun Zhang, Zhiheng Zheng, Fanhang Man, Jianjie Fang, Zile Zhou, Jinqiang Cui, etal.Embodiedcity: A benchmark platform for embodied agent in real-world city environment.arXiv preprint arXiv:2410.09604, 2024b.
- Geng etal. [2022]Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang.Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5).In Proceedings of the 16th ACM Conference on Recommender Systems, pages 299–315, 2022.
- Guo etal. [2017]Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He.Deepfm: a factorization-machine based neural network for ctr prediction.arXiv preprint arXiv:1703.04247, 2017.
- He etal. [2024]Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu.Webvoyager: Building an end-to-end web agent with large multimodal models.arXiv preprint arXiv:2401.13919, 2024.
- Hong etal. [2024]Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, etal.Cogagent: A visual language model for gui agents.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290, 2024.
- Hou etal. [2024]Yupeng Hou, Jiacheng Li, Zhankui He, An Yan, Xiusi Chen, and Julian McAuley.Bridging language and items for retrieval and recommendation.arXiv preprint arXiv:2403.03952, 2024.
- Hu etal. [2021]EdwardJ Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, etal.Lora: Low-rank adaptation of large language models.In International Conference on Learning Representations, 2021.
- Huang etal. [2023]Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang.An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023.
- Kang and McAuley [2018]Wang-Cheng Kang and Julian McAuley.Self-attentive sequential recommendation.In 2018 IEEE international conference on data mining (ICDM), pages 197–206. IEEE, 2018.
- Kim etal. [2024]Geunwoo Kim, Pierre Baldi, and Stephen McAleer.Language models can solve computer tasks.Advances in Neural Information Processing Systems, 36, 2024.
- Kim etal. [2022]Hyounghun Kim, Aishwarya Padmakumar, Di Jin, Mohit Bansal, and Dilek Hakkani-Tur.On the limits of evaluating embodied agent model generalization using validation sets.arXiv preprint arXiv:2205.09249, 2022.
- Koren etal. [2009]Yehuda Koren, Robert Bell, and Chris Volinsky.Matrix factorization techniques for recommender systems.Computer, 42(8):30–37, 2009.
- Li etal. [2023]Ruyu Li, Wenhao Deng, Yu Cheng, Zheng Yuan, Jiaqi Zhang, and Fajie Yuan.Exploring the upper limits of text-based collaborative filtering using large language models: Discoveries and insights.arXiv preprint arXiv:2305.11700, 2023.
- Li etal. [2020]Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge.Mapping natural language instructions to mobile ui action sequences.arXiv preprint arXiv:2005.03776, 2020.
- Ma etal. [2024]Xinbei Ma, Zhuosheng Zhang, and Hai Zhao.Coco-agent: A comprehensive cognitive mllm agent for smartphone gui automation.arXiv preprint arXiv:2402.11941v3, 2024.
- Mialon etal. [2023]Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom.Gaia: a benchmark for general ai assistants.arXiv preprint arXiv:2311.12983, 2023.
- Moskalenko etal. [2020]Oleksii Moskalenko, Denis Parra, and Diego Saez-Trumper.Scalable recommendation of wikipedia articles to editors using representation learning.arXiv preprint arXiv:2009.11771, 2020.
- Rawles etal. [2024a]Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, etal.Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024a.
- Rawles etal. [2024b]Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap.Androidinthewild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36, 2024b.
- Rendle etal. [2012]Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme.Bpr: Bayesian personalized ranking from implicit feedback.arXiv preprint arXiv:1205.2618, 2012.
- Shi etal. [2017]Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang.World of bits: An open-domain platform for web-based agents.In International Conference on Machine Learning, pages 3135–3144. PMLR, 2017.
- Srivastava etal. [2022]Sanjana Srivastava, Chengshu Li, Michael Lingelbach, Roberto Martín-Martín, Fei Xia, KentElliott Vainio, Zheng Lian, Cem Gokmen, Shyamal Buch, Karen Liu, etal.Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments.In Conference on robot learning, pages 477–490. PMLR, 2022.
- Tan etal. [2024]Weihao Tan, Ziluo Ding, Wentao Zhang, Boyu Li, Bohan Zhou, Junpeng Yue, Haochong Xia, Jiechuan Jiang, Longtao Zheng, Xinrun Xu, etal.Towards general computer control: A multimodal agent for red dead redemption ii as a case study.arXiv preprint arXiv:2403.03186, 2024.
- Wang etal. [2023]Bryan Wang, Gang Li, and Yang Li.Enabling conversational interaction with mobile ui using large language models.In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–17, 2023.
- Wang etal. [2024a]Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang.Mobile-agent: Autonomous multi-modal mobile device agent with visual perception.arXiv preprint arXiv:2401.16158, 2024a.
- Wang etal. [2024b]Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, etal.Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024b.
- Wei etal. [2024]Wei Wei, Xubin Ren, Jiabin Tang, Qinyong Wang, Lixin Su, Suqi Cheng, Junfeng Wang, Dawei Yin, and Chao Huang.Llmrec: Large language models with graph augmentation for recommendation.In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, pages 806–815, 2024.
- Wu etal. [2024]Likang Wu, Zhi Zheng, Zhaopeng Qiu, Hao Wang, Hongchao Gu, Tingjia Shen, Chuan Qin, Chen Zhu, Hengshu Zhu, Qi Liu, etal.A survey on large language models for recommendation.World Wide Web, 27(5):60, 2024.
- Xie etal. [2024]Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, TohJing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, etal.Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.arXiv preprint arXiv:2404.07972, 2024.
- Xu etal. [2023]Fengli Xu, Jun Zhang, Chen Gao, Jie Feng, and Yong Li.Urban generative intelligence (ugi): A foundational platform for agents in embodied city environment.arXiv preprint arXiv:2312.11813, 2023.
- Yan etal. [2023]An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, etal.Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation.arXiv preprint arXiv:2311.07562, 2023.
- Yang etal. [2025]Jingkang Yang, Yuhao Dong, Shuai Liu, Bo Li, Ziyue Wang, Haoran Tan, Chencheng Jiang, Jiamu Kang, Yuanhan Zhang, Kaiyang Zhou, etal.Octopus: Embodied vision-language programmer from environmental feedback.In European Conference on Computer Vision, pages 20–38. Springer, 2025.
- Yang etal. [2023]Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu.Appagent: Multimodal agents as smartphone users.arXiv preprint arXiv:2312.13771, 2023.
- Yin etal. [2024]Hongzhi Yin, Liang Qu, Tong Chen, Wei Yuan, Ruiqi Zheng, Jing Long, Xin Xia, Yuhui Shi, and Chengqi Zhang.On-device recommender systems: A comprehensive survey.arXiv preprint arXiv:2401.11441, 2024.
- Yuan etal. [2019]Fajie Yuan, Alexandros Karatzoglou, Ioannis Arapakis, JoemonM Jose, and Xiangnan He.A simple convolutional generative network for next item recommendation.In Proceedings of the twelfth ACM international conference on web search and data mining, pages 582–590, 2019.
- Yuan etal. [2023a]Wei Yuan, Hongzhi Yin, Fangzhao Wu, Shijie Zhang, Tieke He, and Hao Wang.Federated unlearning for on-device recommendation.In Proceedings of the sixteenth ACM international conference on web search and data mining, pages 393–401, 2023a.
- Yuan etal. [2023b]Zheng Yuan, Fajie Yuan, Yu Song, Youhua Li, Junchen Fu, Fei Yang, Yunzhu Pan, and Yongxin Ni.Where to go next for recommender systems? id-vs. modality-based recommender models revisited.In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2639–2649, 2023b.
- Zhan and Zhang [2023]Zhuosheng Zhan and Aston Zhang.You only look at screens: Multimodal chain-of-action agents.arXiv preprint arXiv:2309.11436, 2023.
- Zhang etal. [2024a]Jiaqi Zhang, Yu Cheng, Yongxin Ni, Yunzhu Pan, Zheng Yuan, Junchen Fu, Youhua Li, Jie Wang, and Fajie Yuan.Ninerec: A benchmark dataset suite for evaluating transferable recommendation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024a.
- Zhang etal. [2024b]Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang.Android in the zoo: Chain-of-action-thought for gui agents.arXiv preprint arXiv:2403.02713, 2024b.
- Zheng etal. [2023]Longtao Zheng, Rundong Wang, Xinrun Wang, and Bo An.Synapse: Trajectory-as-exemplar prompting with memory for computer control.In The Twelfth International Conference on Learning Representations, 2023.
\thetitle
Supplementary Material
{strip}
LVLMs | GUISpecific | Mobile | Desktop | Web | Average | |||
Text | Icon/Widget | Text | Icon/Widget | Text | Icon/Widget | |||
GPT-4V | ✗ | 22.6% | 24.5% | 20.2% | 11.8% | 9.2% | 8.8% | 16.2% |
MiniGPT-v2 | ✗ | 8.4% | 6.6% | 6.2% | 2.9% | 6.5% | 3.4% | 5.7% |
Qwen-VL | ✗ | 9.5% | 4.8% | 5.7% | 5.0% | 3.5% | 2.4% | 5.2% |
Fuyu | ✓ | 41.0% | 1.3% | 33.0% | 3.6% | 33.9% | 4.4% | 19.5% |
CogAgent | ✓ | 67.0% | 24.0% | 74.2% | 20.0% | 70.4% | 28.6% | 47.4% |
SeeClick | ✓ | 78.0% | 52.0% | 72.2% | 30.0% | 55.7% | 32.5% | 53.4% |
SmartAgent | ✓ | 77.3% | 49.8% | 72.7% | 32.1% | 57.4% | 24.8% | 52.4% |
Methods | PureVisual | Cross-Task | Cross-Website | Cross-Domain | ||||||
Ele.Acc | Op.F1 | SSR | Ele.Acc | Op.F1 | SSR | Ele.Acc | Op.F1 | SSR | ||
MindAct (gen) | ✗ | 20.2 | 52.0 | 17.5 | 13.9 | 44.7 | 11.0 | 14.2 | 44.7 | 11.9 |
MindAct | ✗ | 5.1 | 75.7 | 52.0 | 42.0 | 65.2 | 38.9 | 42.1 | 66.5 | 39.6 |
GPT-3.5-Turbo | ✗ | 20.3 | 56.6 | 17.4 | 19.3 | 48.8 | 16.2 | 21.6 | 52.8 | 18.6 |
GPT-4 | ✗ | 41.6 | 60.6 | 36.2 | 35.8 | 51.1 | 30.1 | 37.1 | 46.5 | 26.4 |
Qwen-VL | ✓ | 15.9 | 86.7 | 13.3 | 13.2 | 83.5 | 9.2 | 14.1 | 84.3 | 12.0 |
SeeClick | ✓ | 28.3 | 87.0 | 25.5 | 21.4 | 80.6 | 16.4 | 23.2 | 84.8 | 20.8 |
SmartAgent | ✓ | 24.5 | 80.2 | 20.6 | 18.9 | 74.9 | 15.0 | 20.3 | 77.9 | 17.1 |
8 Action Space of SmartSpot
SmartSpot offers common GUI actions in daily usage.We follow AiTW[31] to assign a action_type_id to each action for model prediction.The full action space is shown as follows:
- •
click(x,y):4. A click action at (x,y), where each value ranges from [0,1], indicates the corresponding position ratio relative to the image’s width and height.
- •
type("text"):3. An action that types text at (x,y).
- •
scroll(direction): Scroll actions for the screen, scroll up/down/left/right are assigned with 1, 0, 8, and 9.
- •
complete:10. An action of determining if has reached the item pool.
- •
recommendation("Yes/No"):2. An action of recommending an item, as either “Yes” or “No”.
- •
BACK:5. An action for returning to the previous step.
- •
HOME:6. An action for returning to the homepage.
9 More Training Details
Following the approach of[3], We intuitively present numerical coordinates as natural languages without additional tokenization or pre-/post-processing.We train SmartAgent 15 epochs for both the embodied and personalized stages.Embodied baselines are trained for 15 rounds.Results on the ScreenSpot and Mind2Web benchmarks are evaluated via direct testing and 10 epochs of training, as in [3].
10 More Results
This section shows SmartAgent’s performance results on basic embodied sub-tasks.The results of GUI grounding on the benchmark ScreenSpot are presented in Tab.5, while the results of autonomous GUI operation on the benchmark Mind2Web can be found in Tab.6.