I vibe coded a web agent

Catalogue

1 Foreword
2 Architecture
3 How does it work
3.1 Expose the capabilities of the web agent as MCP tools
3.2 MCP tool description is prompt
3.3 Delegate the task orchestration to AI
3.4 Use DOM hierarchy to convey page context and locate the target element
4 Cost
5 Potentials
6 Summary
7 References
8 Source code

1 Foreword

In previous article, I introduced the web agent I built using Puppeteer and Anthropic's Claude. It is still a fun project but its cons weigh more than its pros. In this article, I will make it with a new approach but reach the same goal without the drawbacks it has by leveraging DOM hierarchy and MCP(Model Context Protocol) technique.

2 Architecture

3 How does it work

Looking back at the previous solution, the most fragile parts fall into these two aspects:

Complex and instable prompts

As I mentioned, the prompts are not stable and there is no guarantee that the AI will always respond consistently with the same prompt using different models. It's impractical to modify the prompts every time when the model is updated or changed.

Leveraging screenshot to understand the page and locate the target element

Not to mention its cost, there will be multiple SCROLL actions required when the page is very long, it is a big challenge for AI to understand the whole page context using these screenshots at once.

These two issues will be smashed by DOM hierarchy and MCP tools respectively within this new solution. Let's use the same example and have another deep dive into this new solution.

Go to https://npmjs.com and then jump into the "Pricing" tab, tell me what the pricing is for each category in Pricing page

3.1 Expose the capabilities of the web agent as MCP tools

All the capabilities of the web agent first need to be exposed as MCP tools, this way, they are easily to be identified what they are and what they do by users and MCP client.

3.2 MCP tool description is prompt

MCP client will use the MCP tool description to understand what this tool can do and what inputs it requires. Through my investigation, I found that the MCP tool description is actually an equivalent of the prompt in the previous solution but with the following benefits:

Fine-grained control over the tool's capabilities

You don't need to go through the whole workflow to only use a specific capability of the web agent.

Much easier to maintain and debug

When you test if the description of a tool takes effect, you can completely concentrate on this tool only, making the adjustments accordingly until it works as expected.

Stronger inputs validation and type checking

One cause that makes the AI response unstable is that the inputs are not validated or checked properly. With MCP, the inputs of the tools are validated and checked by the MCP Client before they are sent to the MCP Server, it won't be mistakenly passed to MCP Server if they are not valid or not in the expected type.

3.3 Delegate the task orchestration to AI

This is the extra benefit of MCP that I was not aware of before. When you prompt to the MCP client, MCP tools will be discovered automatically and the AI will be able to understand what tools are available and how to use them as long as the AI mode support the feature of "tools". This means that you don't need to worry about the orchestration, the AI model will take care of it for you. This is a big improvement over the previous solution where you need to manually orchestrate the workflow with loops, even sometimes with recursion.

3.4 Use DOM hierarchy to convey page context and locate the target element

We still need to collect the interactive elements on the page through Puppeteer, but instead of highlighting on the page, we just use them as a reference purpose which helps us to locate the target element late on.

DOM hierarchy, on the other hand, will replace the screenshots to convey the page context and the interactive elements user can interact with. It isn't the same as root html tree, it requires the followings considerations to be taken into account:

Collect all the interactive elements on the page, such as links, buttons, inputs, etc.

Sometimes, the element is just a div with role attribute to indicate its purpose, we will also collect these elements by following the up-to-date web specification. In turn, if the website's html isn't semantic enough, we will not be able to collect the interactive elements properly.

Trim the unnecessary nodes to ensure the hierarchy concise enough

Imagine a Link button wraps a complex sub tree, its label is on the leaf node, the algorithm needs to clearly understand the relationship between the parent node and leaf node, and trim the unnecessary nodes between them. Otherwise, we have to capture the whole DOM hierarchy which will be too large to be processed by the AI model, and it has the same overhead problem as the screenshot approach.

Display interactive elements with basic information on the trimmed DOM

When interactive elements are collected, basic information such as id, status (disabled or not), label and element id will be displayed in the hierarchy for the AI to locate the target element. A good format of the DOM hierarchy is crucial for the AI to understand the page context.

The following is an example of the DOM hierarchy that will be sent to the AI model:

- Page Snapshot
- document [ref=s1e1]:
  - link "skip to content" [ref=s1e2]
    - /url: #main
  - link "skip to package search" [ref=s1e3]
    - /url: #search
  - link "skip to sign in" [ref=s1e4]
    - /url: #signin
  - span "❤"
  - list "Nav Menu"
    - listitem
      - menuitem "Pro" [ref=s1e5]
        - /url: /products/pro
    - listitem
      - menuitem "Teams" [ref=s1e6]
        - /url: /products/teams
    - listitem
      - menuitem "Pricing" [ref=s1e7]
        - /url: /products
    - listitem
      - menuitem "Documentation" [ref=s1e8]
        - /url: https://docs.npmjs.com
  - span "npm"
  - link "Npm" [ref=s1e9]
    - /url: /
  - form
    - input "Search packages" [ref=s1e10]
    - button "Search" [ref=s1e11]
  - link "Sign Up" [ref=s1e12]
    - /url: /signup
  - link "Sign In" [ref=s1e13]
    - /url: /login
  - heading "Build amazing things" [level=1]:
  - div "We're GitHub, the company behind the npm Registry and npm CLI. We offer those to the community for free, but our day job is building and selling useful tools for developers like you."
  - heading "Take your JavaScript development up a notch" [level=2]:
  - div "Get started today for free, or step up to npm Pro to enjoy a premium JavaScript development experience, with features like private packages."
  - link "Sign up for free" [ref=s1e14]
    - /url: /signup
  - link "Learn about Pro" [ref=s1e15]
    - /url: /products/pro
  - img
  - heading "Bring the best of open source to you, your team, and your company" [level=2]:
  - div "Relied upon by more than 17 million developers worldwide, npm is committed to making JavaScript development elegant, productive, and safe. The free npm Registry has become the center of JavaScript code sharing, and with more than two million packages, the largest software registry in the world. Our other tools and services take the Registry, and the work you do around it, to the next level."
  - heading "Footer" [level=2]:
  - link "Visit npm GitHub page" [ref=s1e16]
    - /url: https://github.com/npm
  - link "GitHub" [ref=s1e17]
    - /url: https://github.com
  - heading "Support" [level=3]:
  - list
    - listitem
      - link "Help" [ref=s1e18]
        - /url: https://docs.npmjs.com
    - listitem
      - link "Advisories" [ref=s1e19]
        - /url: https://github.com/advisories
    - listitem
      - link "Status" [ref=s1e20]
        - /url: http://status.npmjs.org/
    - listitem
      - link "Contact npm" [ref=s1e21]
        - /url: /support
  - heading "Company" [level=3]:
  - list
    - listitem
      - link "About" [ref=s1e22]
        - /url: /about
    - listitem
      - link "Blog" [ref=s1e23]
        - /url: https://github.blog/tag/npm/
    - listitem
      - link "Press" [ref=s1e24]
        - /url: /press
  - heading "Terms & Policies" [level=3]:
  - list
    - listitem
      - link "Policies" [ref=s1e25]
        - /url: /policies/
    - listitem
      - link "Terms of Use" [ref=s1e26]
        - /url: /policies/terms
    - listitem
      - link "Code of Conduct" [ref=s1e27]
        - /url: /policies/conduct
    - listitem
      - link "Privacy" [ref=s1e28]
        - /url: /policies/privacy

So the whole workflow will be like this:

The user sends a prompt to the MCP client
The MCP client will discover the MCP tools and find the relevant tools based on the prompt
The MCP client then will send tool input by analyzing the tool description and input definition
The requested tool in MCP server receive the input and execute the action(through Puppeteer), such as clicking a button, filling a form, etc.
Once action is taken, the up-to-date DOM hierarchy will be sent along with the action result to MCP client
The MCP client will determine if the task is completed or it needs to go through the next round of 2 to 6

4 Cost

Cost for tools

MCP server doesn't cost too much except the ANALYZE tool which is used to analyze page. The other tools such as CLICK, FILL, SUBMIT are not AI-related. However, not every prompt requires this capability, thus shouldn't be a concern.

Cost for chat

It depends on your account of the AI client, saying if you are on Claude desktop with free plan, it will cost you nothing.

5 Potentials

Web Testing: Automated testing of web applications

Data Extraction: Scraping and analyzing web content

UI Automation: Automating repetitive web tasks

Accessibility Testing: Analyzing page accessibility

Performance Monitoring: Capturing page performance metrics

AI-Assisted Browsing: Intelligent web navigation and analysis

The following scenarios are tested and working well with this new solution:

Scenario 1

Go to https://npmjs.com and click the Pricing tab, tell me what the pricing is for each category in Pricing page

Scenario 2

My name is xxx, a fullstack developer, I'm looking for a job with an annual salary of 200K. I got an job application from my prefered company (https://docs.google.com/forms/d/e/1FAIpQLScmUIF_AC67QMy0LjA9TFF7slcFJjZuppoG7JBc7T_e4jOfEQ/viewform). Now I need you to go to this job application and help me to submit this form based on my background. Just a heads-up, this forms contains many required fields needs to be filled before get it submitted.

Scenario 3

go to this Project page(https://www.realestate.com.au/project/emerald-grove-jordan-springs-600045364), take a look at all the content in this page except the form, map and video. When you see "show more" button, you need to click it to ensure you read the complete content, this page is very informative pleasegive me a feature summary about this project.

Scenario 4

please navigate to this survey https://docs.google.com/forms/d/1KQeKNz8Iu8Gayt_gF1VY5y2Q27Zbc6uGpLIpTaky6pw/edit and it contains many questions about my company health, each question is a section, please select one positive option until you fill out all the sections, then get it submitted.

6 Summary

In this article, I have introduced an improved approach to building a web agent by leveraging DOM hierarchy and MCP (Model Context Protocol) technique. The new architecture, MCP tool-based workflow, and DOM hierarchy parsing are explained in detail.

The key improvements over the previous solution address the two major pain points: the instability of complex prompts and the cost of screenshot-based page analysis. By using MCP tools with fine-grained descriptions, the web agent achieves better maintainability and stronger input validation. The DOM hierarchy approach eliminates the need for expensive screenshot analysis while providing accurate page context and element location capabilities.

However, this approach has its own limitations. The effectiveness heavily depends on the semantic quality of the website's HTML structure. Poorly structured websites may not provide sufficient interactive elements for the algorithm to collect, potentially limiting the web agent's capabilities. Besides, despite it has already reduced complexity of DOM hierarchy by just focusing on the interactive elements, it still has a cost problem when the page is super interactive or complex.

Overall, this solution represents a more practical and cost-effective approach compared to the screenshot-based method, making it more suitable for production environments where token costs and reliability are important considerations.

7 References

I was inspired by this Browser MCP despite it is not fully open-sourced yet, the idea of leveraging DOM hierarchy is amazing, significantly reducing the complexity of the web agent and improving its maintainability.

8 Source code

https://github.com/kkkkkxiaofei/web-agent/tree/main/mcp