A tool to "Select an element in the page to inspect it"
Is your feature request related to a problem? Please describe.
Select an element in the page to inspect it could arguably be the most used Chrome Dev Tools "tool".
This allows the (human) programmer to focus on just that part of the DOM he is interested in. On the other hand, most tools we put in the hands of AI agents today always give back the full DOM. The AI is then forced to first filter out all the irrelevant parts leading to the context window filling up extremely fast, and the agent loop slowing down a lot.
Describe the solution you'd like
If this MCP server could implement something like Select an element in the page to inspect it, we can give the AI agents the same efficient method of inspecting the DOM.
I would imagine the AI can send an annotated screenshot as input; with a suitably coloured bounding box to indicate the region of the page the AI hopes to inspect.
Describe alternatives you've considered
Alternative solutions include a tool that allows the AI to specify a CSS selector, and the tool will return the DOM rooted at the corresponding element (instead of the whole page). But this does mean the selector be identified somehow first, which may not be easy. Working with a screenshot and placing a bounding box on the region of interest could be markedly more easy in certain cases.
Additional context
The Select an element in the page to inspect it tool in Chrome Dev Tools:
Do you mean that you can select an element/artifact in Chrome DevTools and use it as context in your MCP client?
Do you mean that you can select an element/artifact in Chrome DevTools and use it as context in your MCP client?
@natorion No, i didn't mean, me, as a human user selecting. I meant giving the AI agent the ability to select DOM elements by position (as opposed to by a css selector).
Understood, so the human would then use the selected element to continue debugging on their own in DevTools?
Understood, so the human would then use the selected element to continue debugging on their own in DevTools?
@natorion No, the idea is, this way, the AI agent can use its vision to select the part of the DOM it is interested in.
What we have today is the AI agent first reading the DOM structure of the entire page (this can be massive and could use a lot of tokens very quickly) and then trying to figure out the subtree of interest.
But, that is not how a human would debug typically. Because the human can use his vision and the Select an element in the page to inspect it in ChromeDevTools to quickly inspect just the part of the page he is interested in. The request here is to give the AI agents the same capabilities so that they can be more effective (in autonomous operation).
@natorion I see the confusion here. When i say, Select an element in the page to inspect it tool, i do NOT mean the AI agent interacting with the Chrome Dev Tools front end to select a DOM element for the human.
What i mean is: What if we gave the AI agent a "virtual mouse pointer", if you will, so that it can use its vision to select DOM elements.
Hope it clarifies the intent here.
Hmmm, in what prompts would you expect your MCP client make use of let's say select_element_visually(imageData, position) function?
Hmmm, in what prompts would you expect your MCP client make use of let's say select_element_visually(imageData, position) function?
Consider,
The "Delete" button is not aligned correctly on http://localhost:3000/my-complex-page. Could you fix it?
The agent is likely to first attempt to "look" at the page, issuing some command that returns a dump of the page's DOM. But this could result in a lot of token usage (and potentially the context window filling up).
So, instead if i could,
The "Delete" button is not aligned correctly on http://localhost:3000/my-complex-page. Could you fix it?
The page is complex and you might want to select the button's container visually to inspect its DOM.
With this, the agent will hopefully take the cue and "look" at the page, issuing a screenshot command, leveraging its vision capability. And then following it up with "select_element_visually".
To summarise, in cases where the page is large in terms of the number of DOM nodes, vision based selection, could give the agent a method that can scale irrespective of the DOM level complexity.
In this case you still would need to match the (marked) screenshot to the actual page, which could very well have changed visually in the meantime (consider elements slightly shading or flickering, or a video element, or a autoscrollig carousel.
The IDs we generated or the A11y tree should be more resilient.
Assuming the visual representation of the page is relevant and take_snapshot is not enough it might be more useful to be a good MCP server and help the client to make the decision: Add another parameter to take_screenshot called "overlayUIDs", which overlays UIDs visually like a map over the screenshot. This way the MCP client has the information where stuff on the page is located. I have no clue though, if MCP clients nowadays are smart enough to figure it out and what resolution you would need to provide a good enough map.
Actually, the overlay idea is a good alternative — it reminds me of the well-known Set-of-Marks prompting method.
Staying on topic, though: I hadn't really thought through how the client could communicate to the Chrome DevTools MCP which element it wants to select.
I'm curious how the "Select an element in the page to inspect it" feature in Chrome DevTools actually works. What's involved in translating the mouse pointer's position into the corresponding DOM element?
My thinking is that, in all scenarios where a human benefits from inspecting the DOM with that tool, an AI agent could potentially benefit as well.
My thinking is that, in all scenarios where a human benefits from inspecting the DOM with that tool, an AI agent could potentially benefit as well.
I think this is an example of difference in behavior between humans and AI agents. A human is a very visual entity, that will react timely to something on a screen. A typical MCP client will use a model that is not optimized for speed and image processing and will have a harder time figuring out stuff on the screen in a timely manner. Converting visual information to structured information that is less time-dependent (the map proposal above) would actually be a better input mechanism.
This is just mental gymnastics at this point though, no clue if its actually useful.
@natorion Really appreciate you looking into this. Am glad I got a chance to explain my thinking too.
In this case you still would need to match the (marked) screenshot to the actual page, which could very well have changed visually in the meantime (consider elements slightly shading or flickering, or a video element, or a autoscrollig carousel.
The IDs we generated or the A11y tree should be more resilient.
Assuming the visual representation of the page is relevant and take_snapshot is not enough it might be more useful to be a good MCP server and help the client to make the decision: Add another parameter to take_screenshot called "overlayUIDs", which overlays UIDs visually like a map over the screenshot. This way the MCP client has the information where stuff on the page is located. I have no clue though, if MCP clients nowadays are smart enough to figure it out and what resolution you would need to provide a good enough map.
Let's collect feedback if something like this would be useful for more people.
I would just add my use case to this ticket as well - coming more from a design -, than engineering, perspective (knowledge wise).
Using a prompts like
Check my current selection in the browser, help me replace that with my current selection in Figma
or
Help me change the background color of the current selection in the browser
would help a lot where my CSS skills fails to help me convey the exact thing I want to change.
My current workaround is to use the MCP, make the changes in the DevTools with the help of the built in AI Assistant in DevTools, then ask Cursor to inspect the changes made and implement them. But this approach is a lot of back and forth, I would just like to work 100% in Cursor and use the MCPs for it instead.
a good library fixing this problem is https://github.com/zh-lx/code-inspector
- it can be configured to copy the file path of that component to the clipboard, which you can then paste into Claude or Cursor.
I think using selectors is good enough. If you have the code, just add some ids (by humans or models) and models can easily locate the elements. For external page, models are also smart enough to use tag name and even greping to generate the selector. They just need to be told to do so.
I’m surprised this topic is only three weeks old. When I first heard about MCP for devtools, the first thing I thought of was giving an AI agent the ability to look at a screenshot of a page, click with a virtual inspect cursor, and instantly get the HTML subtree under the cursor — exactly the way it works for a human in inspect mode.
I think the speed of development and “convergence to results” will increase by an order of magnitude once AI gets interactive feedback in the form of image–virtual cursor–HTML subtree.
It will also bring a huge boost in the efficiency of various data crawlers, since there will no longer be a need to manually calculate the effective XPath leading to the desired content for mass parsing of similar pages (after all, web pages have been generated using templating engines for the past 30 years).
I would just add my use case to this ticket as well - coming more from a design -, than engineering, perspective (knowledge wise).
Using a prompts like
Check my current selection in the browser, help me replace that with my current selection in Figmaor
Help me change the background color of the current selection in the browserwould help a lot where my CSS skills fails to help me convey the exact thing I want to change.
My current workaround is to use the MCP, make the changes in the DevTools with the help of the built in AI Assistant in DevTools, then ask Cursor to inspect the changes made and implement them. But this approach is a lot of back and forth, I would just like to work 100% in Cursor and use the MCPs for it instead.
We are actually working on sharing context in DevTools with the MCP client :-). See https://github.com/ChromeDevTools/chrome-devtools-mcp/issues/129
I’m surprised this topic is only three weeks old. When I first heard about MCP for devtools, the first thing I thought of was giving an AI agent the ability to look at a screenshot of a page, click with a virtual inspect cursor, and instantly get the HTML subtree under the cursor — exactly the way it works for a human in inspect mode.
I think the speed of development and “convergence to results” will increase by an order of magnitude once AI gets interactive feedback in the form of image–virtual cursor–HTML subtree.
It will also bring a huge boost in the efficiency of various data crawlers, since there will no longer be a need to manually calculate the effective XPath leading to the desired content for mass parsing of similar pages (after all, web pages have been generated using templating engines for the past 30 years).
take_snapshot already gives you the a tree representation of the page. Can you elaborate?
We are actually working on sharing context in DevTools with the MCP client :-). See https://github.com/ChromeDevTools/chrome-devtools-mcp/issues/129
#129 would work as well, but as the title of this issue says, using the already existing tool "Select an element in the page to inspect it" would be the my go to.
Looks like Cursor is coming for the rescue - cursor 2.0 with built-in browser.
Looks like Cursor is coming for the rescue - cursor 2.0 with built-in browser.
Whoa ! Here is a screenshot from their demo video: https://cursor.com/blog/2-0
As the OP, let me point out that the original ask was to also let agents do this kind of "select an element" to avoid context bloat. But this is an important piece as well. Wonderful !
Relevant PR https://github.com/ChromeDevTools/chrome-devtools-mcp/pull/486.
If you select an element in DevTools on the actuated page, the information which element is now also told to the AI agent. Try to use "use the selected element in devtools" or similar to reference it in your prompt.
We are actually working on sharing context in DevTools with the MCP client :-). See #129
#129 would work as well, but as the title of this issue says, using the already existing tool "Select an element in the page to inspect it" would be the my go to.
It's the same, can you elaborate what is missing for you?
For inspection I think this is related to https://github.com/ChromeDevTools/chrome-devtools-mcp/issues/268. I just commented there about my work on providing DOM, CSS rules, computed style.
For selecting an element, I still believe evaluating DOM methods like .querySelector, .children, .parentNode is the best way for models. They cannot really understand image at pixel level because everything is tokenized.