Cannot extract link targets (anymore)
Since the upgrade from 1.13.1 to ^2.0.0, Stagehand does not follow instructions to extract the hrefs of visible anchor tags anymore. Looking at the inference logs, it appears the LLM is only passed the link text, but not the URL anymore, so instructions to grab the URL and start crawling won't work.
Even explicitly annotating the zod model as postUrl: z.string().describe("The relative URL of the href"), has no effect - it just extracts this field as the display name of the link.
The use-case here is to retrieve a list of forum posts and their URLs, without necessarily using stagehand to perform the navigation on those links.
hey! yeah we changed how extract works to make it much faster, but this meant trimming down the content we give to the LLM. @seanmcguire12 is working on adding links back in #655
Love the speed boost the new version gave, and looking forward to seeing this implemented :)
hey @JosXa! link extraction is available on the alpha release now if you want to test it out! Within your schema, you'll need to define your link/url field with the following zod type: z.string().url() for it to work. In your case, it would be postUrl: z.string().url()
@seanmcguire12 I've tried the latest alpha release, but it always fails to validate the zod schema when I add a z.string().url()
Hey @alexdotpink thanks for flagging this! could you post a snippet so I can try to reproduce? Please include as much detail as possible (url, model, schema, instruction, etc) Thanks!
Yes, I can confirm the error on 2.2.0-alpha-8f0f97bc491e23ff0078c802aaf509fd04173c37.
Full error:
400 Invalid schema for response_format 'Extraction': In context=(..., 'postUrl'), 'format' is not permitted.
at _StagehandPage.<anonymous> (C:\Users\josch\.kenv\node_modules\.pnpm\@browserbasehq+stagehand@2._3b4d582bac000913703aa3c79db2112b\node_modules\@browserbasehq\stagehand\dist\index.js:3611:15)
at Generator.throw (<anonymous>)
at rejected (C:\Users\josch\.kenv\node_modules\.pnpm\@browserbasehq+stagehand@2._3b4d582bac000913703aa3c79db2112b\node_modules\@browserbasehq\stagehand\dist\index.js:70:29)
at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
It only happens when the zod string is annotated as .url(). Zod version is ^3.24.2.
Here's a somewhat minimal repro:
const stagehand = new Stagehand({
apiKey: key,
env: "LOCAL",
modelName: "gpt-4o-mini", // also tried 4o
})
await stagehand.init()
await stagehand.page.goto("https://github.com/browserbase/stagehand/issues/651")
const Schema = z.object({
commentDate: z.string(),
commentAuthor: z.string(),
// The `describe` makes no difference
permalink: z.string().url().describe("The absolute permalink URL of the comment"),
})
const answer = await stagehand.page.extract({
instruction: "Find all comments under this GitHub issue",
schema: z.object({ comments: z.array(Schema) }),
})
hey @JosXa , i wasn't able to repro the schema validation issue you ran into, but i was getting empty strings with gpt-4o and 4o-mini, which means they are unable to find the the correct element based on the inputs they are given.
I also found that adjusting the prompt/model worked for me.
Going to look deeper into this, but for now, here is the script that worked for me:
import { Stagehand } from "@/dist";
import { z } from "zod";
async function example() {
const stagehand = new Stagehand({
env: "LOCAL",
modelName: "gemini-2.0-flash",
modelClientOptions: {
apiKey: process.env.GOOGLE_API_KEY,
},
});
await stagehand.init();
await stagehand.page.goto(
"https://github.com/browserbase/stagehand/issues/651",
);
const Schema = z.object({
commentDate: z.string(),
commentAuthor: z.string(),
// The `describe` makes no difference
permalink: z
.string()
.url()
.describe("The absolute permalink URL of the comment"),
});
const answer = await stagehand.page.extract({
instruction:
"Find all comments under this GitHub issue, and their corresponding permalink URL",
schema: z.object({ comments: z.array(Schema) }),
});
console.log(JSON.stringify(answer, null, 2));
}
(async () => {
await example();
})();
here is the output:
{
"comments": [
{
"commentAuthor": "kamath",
"commentDate": "last week",
"permalink": "https://github.com/browserbase/stagehand/issues/651#issuecomment-2798001226"
},
{
"commentAuthor": "JosXa",
"commentDate": "last week",
"permalink": "https://github.com/browserbase/stagehand/issues/651#issuecomment-2798356391"
},
{
"commentAuthor": "seanmcguire12",
"commentDate": "5 days ago",
"permalink": "https://github.com/browserbase/stagehand/issues/651#issuecomment-2811684060"
},
{
"commentAuthor": "alexdotpink",
"commentDate": "2 days ago",
"permalink": "https://github.com/browserbase/stagehand/issues/651#issuecomment-2817203425"
},
{
"commentAuthor": "seanmcguire12",
"commentDate": "11 hours ago",
"permalink": "https://github.com/browserbase/stagehand/issues/651#issuecomment-2818925293"
},
{
"commentAuthor": "JosXa",
"commentDate": "3 hours ago",
"permalink": "https://github.com/browserbase/stagehand/issues/651#issuecomment-2819775125"
},
{
"commentAuthor": "JosXa",
"commentDate": "3 hours ago",
"permalink": "https://github.com/browserbase/stagehand/issues/651#issuecomment-2819777638"
}
]
}
Did the "and their corresponding permalink URL" make a big difference for you?
Did the "and their corresponding permalink URL" make a big difference for you?
@JosXa yes the prompt + gemini model seemed to be the difference maker here!
I believe it has an issue with relative URLs. When the link is <a href="./hello-world">, it's unable to construct a full url even when given the domain name in the describe context 🤔
Version 2.2.0 still seem to not able to extract links
hey @darasus! yeah there was an issue on the latest version with link extraction, so we've merged various patches/improvements today that should help (you can test it out on the alpha release if you like). Do you mind posting a snippet here so that I can try to reproduce?
Hey @seanmcguire12, this should be sufficient to reproduce. Also tried with latest alpha.
const url = `https://linkedin.com/mynetwork/invite-connect/connections/`;
await ctx.page.goto(url, {
waitUntil: "load",
});
const { recentConnections } = await ctx.page.extract({
instruction: "Find first 10 connections, and their corresponding permalink URL",
useTextExtract: true,
schema: z.object({
recentConnections: z.array(
z.object({
firstName: z.string(),
lastName: z.string(),
profileUrl: z.string().url(),
connectedDate: z.string(),
}),
),
}),
});
FWIW – I notice a correlation with having the package.json "type": "module" set, which while using z.string().url() results in:
Full error:
400 Invalid schema for response_format 'Extraction': In context=(..., 'postUrl'), 'format' is not permitted.
at _StagehandPage.<anonymous> (C:\Users\josch\.kenv\node_modules\.pnpm\@browserbasehq+stagehand@2._3b4d582bac000913703aa3c79db2112b\node_modules\@browserbasehq\stagehand\dist\index.js:3611:15)
at Generator.throw (<anonymous>)
at rejected (C:\Users\josch\.kenv\node_modules\.pnpm\@browserbasehq+stagehand@2._3b4d582bac000913703aa3c79db2112b\node_modules\@browserbasehq\stagehand\dist\index.js:70:29)
at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
@darasus you'll have to remove useTextExtract: true (or set it to false) for this to work. Will make sure we update our docs once we officially announce this feature.
@adamalfredsson this was one of the errors we patched on friday! if you could try it out again (on alpha version) and let me know how it works that would be great!
I believe it has an issue with relative URLs. When the link is
<a href="./hello-world">, it's unable to construct a full url even when given the domain name in the describe context 🤔
I have this exact issue ... the extracted links are wrong are slightly hallicinated always . no matter what model or how detailed the instruction ... examples dont help eiither.