Agent S is an open-source framework designed to revolutionize how humans interact with computers. It empowers users to achieve complex tasks autonomously through an innovative Agent-Computer Interface (ACI), effectively bridging the gap between human intent and computer execution. This comprehensive guide will delve into the intricacies of Agent S, providing a detailed walkthrough of its installation, configuration, usage, and underlying architecture. We'll explore both Agent S1 and Agent S2, highlighting their capabilities and differences. Feedback is crucial to our development, and we encourage users to share their experiences and suggestions. See our documentation for all available qualifiers.
Understanding Agent S: A Paradigm Shift in Computer Interaction
Traditional computer interaction often requires users to navigate complex interfaces, execute multiple commands, and struggle with the limitations of direct manipulation. Agent S offers a radical alternative. By employing advanced AI techniques, Agent S acts as an intelligent intermediary, learning from past interactions to perform tasks efficiently and autonomously. Imagine issuing high-level commands like "Prepare a presentation using these documents" and having Agent S handle the entire process, from gathering information to formatting slides and launching the presentation software.
This is achieved through a combination of powerful language models, sophisticated grounding mechanisms, and a robust ACI. The framework's open-source nature fosters collaboration and innovation, allowing developers worldwide to contribute and expand its capabilities. Agent S is not merely an automation tool; it's a step towards a more intuitive and seamless human-computer interaction experience.
Agent S1 and Agent S2: A Comparison
While sharing the core philosophy of autonomous computer interaction, Agent S1 and Agent S2 differ in their capabilities and underlying architectures:
Agent S1: Blog Paper (ICLR 2025) Video Agent S1 represents the initial iteration of the framework, laying the foundation for the advanced features found in Agent S2. It may possess limitations in terms of complexity handling and task execution speed compared to its successor.
Agent S2: Blog Paper Video Agent S2 builds upon Agent S1, incorporating significant improvements in performance, efficiency, and the sophistication of its ACI. It boasts enhanced capabilities in handling complex tasks and offers a more streamlined user experience. Agent S2 demonstrates a considerably higher success rate on benchmark datasets like OSWorld (see below).
Results of Agent S2's Successful Rate (%) on the OSWorld Full Test Set Using Screenshot Input Only: (Insert table or chart here displaying the success rate data)
This data highlights the significant advancements achieved in Agent S2 compared to its predecessor.
Installation and Setup: A Step-by-Step Guide
Before delving into the functionality of Agent S, let's walk through the installation and configuration process. The steps may vary slightly depending on your operating system and chosen dependencies.
Important Considerations:
- Linux Users and Conda: If you're on a Linux machine, creating a conda environment might interfere with
pyatspi
. Currently, there's no clean solution; installation should proceed without using conda or any virtual environment. - UI-TARS Grounding Model: Agent S2 leverages the UI-TARS grounding model (7B-DPO or 72B-DPO for optimal performance). While not strictly required, UI-TARS significantly enhances the framework's capabilities. It can be hosted locally or on Hugging Face Inference Endpoints. Alternatives like Claude can also be used.
1. Package Installation:
Follow the appropriate instructions for your system to install the Agent S package. (Provide specific commands or links to installation guides here).
2. Setting API Keys and Environment Variables:
Agent S requires access to various Large Language Model (LLM) APIs. You can set your API keys via your .bashrc
(Linux) or .zshrc
(macOS) file:
```bash export OPENAIAPIKEY="youropenaiapikey" export OLLAMAAPIURL="http://host.docker.internal:11434" # Adjust port as needed export GROQAPIKEY="yourgroqapikey" export ANTHROPICAPIKEY="youranthropicapi_key"
... other API keys ...
```
Alternatively, set these variables within your Python script.
Agent S supports several LLMs, including Azure OpenAI, Anthropic, Gemini, Open Router, and vLLM inference. Refer to models.md
for a complete list and further instructions.
3. Web Knowledge Retrieval with Perplexica:
For enhanced performance, Agent S integrates with Perplexica for web knowledge retrieval. This requires setting up a Perplexica instance:
Prerequisites: Ensure Docker Desktop is installed and running.
Configuration: Navigate to the project directory, rename
sample.config.toml
toconfig.toml
, and fill in the necessary API keys (OpenAI, Ollama, Groq, Anthropic) as described in Step 2. TheSIMILARITY_MEASURE
parameter usually defaults appropriately; only modify if needed.Docker Compose: Execute
docker-compose up -d
from the directory containingdocker-compose.yaml
.Perplexica URL: Export the Perplexica API URL. This URL, determined by your
config.toml
file, is used for API interaction. (Example:export PERPLEXICA_URL="http://localhost:8000"
).Customization: For advanced configuration, refer to the Perplexica Search API Documentation and Perplexica Repository for detailed instructions on tailoring the API to your specific needs. You may modify the URL and request parameters in
agent_s/query_perplexica.py
.
4. Running Agent S2:
With the environment configured, you can run Agent S2. The recommended configuration utilizes Claude 3.7 with extended thinking and UI-TARS-72B-DPO. If resources are limited, UI-TARS-7B-DPO offers a viable alternative with minimal performance degradation.
Running with a Specific Model:
bash
python -m gui_agents.s2.cli_app --model gpt-4o
Using a Custom Endpoint:
You can specify a custom endpoint using either Configuration 1 or Configuration 2:
Configuration 1:
bash
python -m gui_agents.s2.cli_app --endpoint_provider openai --endpoint_url <your_custom_url>
Configuration 2 (takes precedence):
bash
python -m gui_agents.s2.cli_app --endpoint_url <your_custom_url>
This will launch Agent S2's interactive prompt, allowing you to enter queries and interact with the agent. Consult models.md
for the list of supported models.
Agent S2's Architecture: A Deep Dive
Agent S2's functionality hinges on a sophisticated interplay between several key components:
1. The Main Agent (AgentS2):
The AgentS2
class forms the core of the framework. It manages user interaction, orchestrates task execution, and leverages the grounding agent to translate high-level commands into executable code.
2. The Grounding Agent (OSWorldACI):
OSWorldACI
plays a crucial role in translating Agent S2's actions into executable Python code. This ensures the agent's commands can directly interact with the computer's operating system and applications.
3. Inference Loop and Knowledge Base:
Agent S2 employs a continuous inference loop, constantly updating its internal knowledge base during execution. This knowledge base is downloaded during initialization, specific to the platform and agent version (e.g., Agent S1 or S2 for Linux). The knowledge base is stored as assets in GitHub Releases. You can download the knowledge base programmatically using code like this:
```python
Example: Downloading Agent S2's knowledge base for Linux from release tag v0.2.2
downloadknowledgebase("linux", "s2", "v0.2.2", "kb_data") ```
This dynamic knowledge base allows Agent S2 to adapt to changing contexts and refine its actions over time.
4. Engine Parameters:
The framework uses distinct engine parameters (engine_params
for the main agent and engine_params_for_grounding
for the grounding agent). This allows for customization of the LLMs used for different aspects of the interaction, supporting Claude, GPT series models, and Hugging Face Inference Endpoints.
5. Code Example: A Glimpse Inside cli_app.py
The inference loop's implementation can be examined in gui_agents/s2/cli_app.py
. This file provides insights into how the agent processes user queries, interacts with the grounding agent, and updates its knowledge base. (Provide a snippet of relevant code here to illustrate the inference loop).
Deployment in OSWorld
For deployment within the OSWorld environment, follow the instructions provided in the OSWorld Deployment documentation.
Acknowledgements and Citation
We express our gratitude to Tianbao Xie for creating OSWorld and fostering discussions on computer usage challenges. We also acknowledge Yujia Qin and Shihao Liang for insightful discussions on UI-TARS.
If you find this codebase helpful, please cite:
(Provide citation information here)
Safety Precautions and Disclaimer
Important Warning: Agent S will directly execute Python code to control your computer. Use with extreme caution and ensure you understand the potential risks before using it. We are not responsible for any unintended consequences. Always test in a safe and controlled environment.
This detailed guide provides a comprehensive overview of Agent S, from installation to advanced usage. We encourage users to explore the framework's capabilities and contribute to its ongoing development. Remember to consult the documentation and various resources linked throughout this guide for more detailed information.