stagehand icon indicating copy to clipboard operation
stagehand copied to clipboard

Cannot extract link targets (anymore)

Open JosXa opened this issue 9 months ago • 11 comments

Since the upgrade from 1.13.1 to ^2.0.0, Stagehand does not follow instructions to extract the hrefs of visible anchor tags anymore. Looking at the inference logs, it appears the LLM is only passed the link text, but not the URL anymore, so instructions to grab the URL and start crawling won't work.

Even explicitly annotating the zod model as postUrl: z.string().describe("The relative URL of the href"), has no effect - it just extracts this field as the display name of the link.

The use-case here is to retrieve a list of forum posts and their URLs, without necessarily using stagehand to perform the navigation on those links.

JosXa avatar Apr 09 '25 10:04 JosXa

hey! yeah we changed how extract works to make it much faster, but this meant trimming down the content we give to the LLM. @seanmcguire12 is working on adding links back in #655

kamath avatar Apr 11 '25 20:04 kamath

Love the speed boost the new version gave, and looking forward to seeing this implemented :)

JosXa avatar Apr 12 '25 01:04 JosXa

hey @JosXa! link extraction is available on the alpha release now if you want to test it out! Within your schema, you'll need to define your link/url field with the following zod type: z.string().url() for it to work. In your case, it would be postUrl: z.string().url()

seanmcguire12 avatar Apr 17 '25 04:04 seanmcguire12

@seanmcguire12 I've tried the latest alpha release, but it always fails to validate the zod schema when I add a z.string().url()

alexdotpink avatar Apr 20 '25 14:04 alexdotpink

Hey @alexdotpink thanks for flagging this! could you post a snippet so I can try to reproduce? Please include as much detail as possible (url, model, schema, instruction, etc) Thanks!

seanmcguire12 avatar Apr 21 '25 16:04 seanmcguire12

Yes, I can confirm the error on 2.2.0-alpha-8f0f97bc491e23ff0078c802aaf509fd04173c37.

Full error:
400 Invalid schema for response_format 'Extraction': In context=(..., 'postUrl'), 'format' is not permitted.
    at _StagehandPage.<anonymous> (C:\Users\josch\.kenv\node_modules\.pnpm\@browserbasehq+stagehand@2._3b4d582bac000913703aa3c79db2112b\node_modules\@browserbasehq\stagehand\dist\index.js:3611:15)
    at Generator.throw (<anonymous>)
    at rejected (C:\Users\josch\.kenv\node_modules\.pnpm\@browserbasehq+stagehand@2._3b4d582bac000913703aa3c79db2112b\node_modules\@browserbasehq\stagehand\dist\index.js:70:29)
    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)

It only happens when the zod string is annotated as .url(). Zod version is ^3.24.2.

JosXa avatar Apr 22 '25 00:04 JosXa

Here's a somewhat minimal repro:

const stagehand = new Stagehand({
  apiKey: key,
  env: "LOCAL",
  modelName: "gpt-4o-mini", // also tried 4o
})

await stagehand.init()

await stagehand.page.goto("https://github.com/browserbase/stagehand/issues/651")

const Schema = z.object({
  commentDate: z.string(),
  commentAuthor: z.string(),
  // The `describe` makes no difference
  permalink: z.string().url().describe("The absolute permalink URL of the comment"),
})

const answer = await stagehand.page.extract({
  instruction: "Find all comments under this GitHub issue",
  schema: z.object({ comments: z.array(Schema) }),
})

JosXa avatar Apr 22 '25 00:04 JosXa

hey @JosXa , i wasn't able to repro the schema validation issue you ran into, but i was getting empty strings with gpt-4o and 4o-mini, which means they are unable to find the the correct element based on the inputs they are given.

I also found that adjusting the prompt/model worked for me.

Going to look deeper into this, but for now, here is the script that worked for me:

import { Stagehand } from "@/dist";
import { z } from "zod";

async function example() {
  const stagehand = new Stagehand({
    env: "LOCAL",
    modelName: "gemini-2.0-flash",
    modelClientOptions: {
      apiKey: process.env.GOOGLE_API_KEY,
    },
  });

  await stagehand.init();

  await stagehand.page.goto(
    "https://github.com/browserbase/stagehand/issues/651",
  );

  const Schema = z.object({
    commentDate: z.string(),
    commentAuthor: z.string(),
    // The `describe` makes no difference
    permalink: z
      .string()
      .url()
      .describe("The absolute permalink URL of the comment"),
  });

  const answer = await stagehand.page.extract({
    instruction:
      "Find all comments under this GitHub issue, and their corresponding permalink URL",
    schema: z.object({ comments: z.array(Schema) }),
  });
  console.log(JSON.stringify(answer, null, 2));
}

(async () => {
  await example();
})();

here is the output:


{
  "comments": [
    {
      "commentAuthor": "kamath",
      "commentDate": "last week",
      "permalink": "https://github.com/browserbase/stagehand/issues/651#issuecomment-2798001226"
    },
    {
      "commentAuthor": "JosXa",
      "commentDate": "last week",
      "permalink": "https://github.com/browserbase/stagehand/issues/651#issuecomment-2798356391"
    },
    {
      "commentAuthor": "seanmcguire12",
      "commentDate": "5 days ago",
      "permalink": "https://github.com/browserbase/stagehand/issues/651#issuecomment-2811684060"
    },
    {
      "commentAuthor": "alexdotpink",
      "commentDate": "2 days ago",
      "permalink": "https://github.com/browserbase/stagehand/issues/651#issuecomment-2817203425"
    },
    {
      "commentAuthor": "seanmcguire12",
      "commentDate": "11 hours ago",
      "permalink": "https://github.com/browserbase/stagehand/issues/651#issuecomment-2818925293"
    },
    {
      "commentAuthor": "JosXa",
      "commentDate": "3 hours ago",
      "permalink": "https://github.com/browserbase/stagehand/issues/651#issuecomment-2819775125"
    },
    {
      "commentAuthor": "JosXa",
      "commentDate": "3 hours ago",
      "permalink": "https://github.com/browserbase/stagehand/issues/651#issuecomment-2819777638"
    }
  ]
}

seanmcguire12 avatar Apr 22 '25 03:04 seanmcguire12

Did the "and their corresponding permalink URL" make a big difference for you?

JosXa avatar Apr 22 '25 10:04 JosXa

Did the "and their corresponding permalink URL" make a big difference for you?

@JosXa yes the prompt + gemini model seemed to be the difference maker here!

seanmcguire12 avatar Apr 23 '25 19:04 seanmcguire12

I believe it has an issue with relative URLs. When the link is <a href="./hello-world">, it's unable to construct a full url even when given the domain name in the describe context 🤔

JosXa avatar Apr 24 '25 06:04 JosXa

Version 2.2.0 still seem to not able to extract links

darasus avatar May 01 '25 14:05 darasus

hey @darasus! yeah there was an issue on the latest version with link extraction, so we've merged various patches/improvements today that should help (you can test it out on the alpha release if you like). Do you mind posting a snippet here so that I can try to reproduce?

seanmcguire12 avatar May 03 '25 01:05 seanmcguire12

Hey @seanmcguire12, this should be sufficient to reproduce. Also tried with latest alpha.

const url = `https://linkedin.com/mynetwork/invite-connect/connections/`;

await ctx.page.goto(url, {
  waitUntil: "load",
});

const { recentConnections } = await ctx.page.extract({
  instruction: "Find first 10 connections, and their corresponding permalink URL",
  useTextExtract: true,
  schema: z.object({
    recentConnections: z.array(
      z.object({
        firstName: z.string(),
        lastName: z.string(),
        profileUrl: z.string().url(),
        connectedDate: z.string(),
      }),
    ),
  }),
});

darasus avatar May 03 '25 07:05 darasus

FWIW – I notice a correlation with having the package.json "type": "module" set, which while using z.string().url() results in:

Full error:
400 Invalid schema for response_format 'Extraction': In context=(..., 'postUrl'), 'format' is not permitted.
    at _StagehandPage.<anonymous> (C:\Users\josch\.kenv\node_modules\.pnpm\@browserbasehq+stagehand@2._3b4d582bac000913703aa3c79db2112b\node_modules\@browserbasehq\stagehand\dist\index.js:3611:15)
    at Generator.throw (<anonymous>)
    at rejected (C:\Users\josch\.kenv\node_modules\.pnpm\@browserbasehq+stagehand@2._3b4d582bac000913703aa3c79db2112b\node_modules\@browserbasehq\stagehand\dist\index.js:70:29)
    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)

adamalfredsson avatar May 03 '25 12:05 adamalfredsson

@darasus you'll have to remove useTextExtract: true (or set it to false) for this to work. Will make sure we update our docs once we officially announce this feature.

seanmcguire12 avatar May 06 '25 01:05 seanmcguire12

@adamalfredsson this was one of the errors we patched on friday! if you could try it out again (on alpha version) and let me know how it works that would be great!

seanmcguire12 avatar May 06 '25 01:05 seanmcguire12

I believe it has an issue with relative URLs. When the link is <a href="./hello-world">, it's unable to construct a full url even when given the domain name in the describe context 🤔

I have this exact issue ... the extracted links are wrong are slightly hallicinated always . no matter what model or how detailed the instruction ... examples dont help eiither.

reactsaas avatar Jul 24 '25 16:07 reactsaas