twitter-scraper icon indicating copy to clipboard operation
twitter-scraper copied to clipboard

Unknown subtask ArkoseLogin

Open karashiiro opened this issue 1 year ago • 12 comments

Not sure if this is the issue or if there are multiple separate issues right now, but I ran the scraper tests just now and couldn't log in:

    Unknown subtask ArkoseLogin

      109 |         next = await this.handleSuccessSubtask(next);
      110 |       } else {
    > 111 |         throw new Error(`Unknown subtask ${next.subtask.subtask_id}`);
          |               ^
      112 |       }
      113 |     }
      114 |     if ('err' in next) {

      at TwitterUserAuth.login (src/auth-user.ts:111:15)
      at Scraper.login (src/scraper.ts:410:5)
      at getScraper (src/test-utils.ts:55:5)
      at Object.<anonymous> (src/tweets.test.ts:331:19)

The login flow involves a bunch of task messaging back and forth, and I haven't seen this task name before, so it's possible that the login flow changed, in which case a bunch of APIs will stop working. Still need to figure out how exactly this changed, though.

Originally posted by @karashiiro in https://github.com/the-convocation/twitter-scraper/issues/113#issuecomment-2541984595

karashiiro avatar Dec 20 '24 02:12 karashiiro

Actually, when I tried debugging this just now, I manually logged into the account I was using for unit tests to compare against the real login flow for that particular account in the browser console. It was the same as what it had been before it broke, and then when I ran the unit tests again they all worked as expected.

I wonder if this means this issue was some kind of suspicious account check, and logging in through the website cleared it? At any rate, there doesn't seem to be any new fundamental issue with the scraper. I do still want to understand why this alternate auth path happened before I logged in on the website, though.

Originally posted by @karashiiro in https://github.com/the-convocation/twitter-scraper/issues/113#issuecomment-2543309960

karashiiro avatar Dec 20 '24 02:12 karashiiro

@danbednarski I won't be able to look into this for a couple of weeks (until 1/2) because I'll be traveling with very little downtime, but the comments I copied to this issue are my best understanding of what's going on so far.

karashiiro avatar Dec 20 '24 02:12 karashiiro

i'm sure you know this already but this seems like a semi-randomized arkose labs challenge (something like a recaptcha.) maybe twitter-scraper could provide the facilities for a user to hook into 2captcha.com or something similar?

catdevnull avatar Jan 23 '25 00:01 catdevnull

To further confirm what everyone is thinking, I cycle between 3 accounts with the scraper. All three had this issue and had a captcha challenge when signing out and back in.

Image

nvsd avatar Jan 27 '25 14:01 nvsd

Can this subtask be supported in any form, including manual verification?

DiamondHunters avatar Mar 23 '25 18:03 DiamondHunters

I'm not sure if this can be handled entirely in the library (desktop-Node vs server-Node vs browser handling might be too different), but it should be supported somehow. At a glance the options are:

  • 2Captcha (Node-only, requires an API key from them)
  • Capsolver (SDK is Node-only, browser requires an extension, requires an API key from them)
  • Manual handling (different in all environments)
  • Probably more?

and of those, it's not clear which of them work or how consistently.

Would it work if the library exposed an optional subtask handler? Thinking about offering an extension point here that expects a function:

type SubtaskHandler = (subtaskId: string, previousResponse: TwitterUserAuthFlowResponse) => Promise<TwitterUserAuthFlowRequest>

and then consumers can register handlers like so:

const arkoseLoginHandler: SubtaskHandler = async (subtaskId, previousResponse) => {
  const subtaskRequest = await /* whatever handling is needed */;
  return subtaskRequest;
};

scraper.registerAuthSubtaskHandler('ArkoseLogin', arkoseLoginHandler);

Then it can be handled in an appropriate platform-specific way using whichever CAPTCHA API (or manual verification) makes the most sense, at the cost of pushing it out to consumers to handle. It'd also support overriding handling for existing subtasks in case anyone has a good reason to so.

karashiiro avatar Mar 29 '25 20:03 karashiiro

I think that generally makes sense, although it is worth noting that at least in 2captcha's case, the SDK isn't really needed: https://github.com/catdevnull/flybondi.fail/blob/b547f4b05c90a1e26ed7662beeea23ca3224321b/trigger/scrap-airfleets.ts#L185

but i agree that twitter-scraper probably shouldn't implement it by itself

catdevnull avatar Mar 30 '25 00:03 catdevnull

Implemented in v0.16.0 with some minor adjustments, docs are here. PR has more info too - give it a go and let me know if that's sufficient.

karashiiro avatar Apr 06 '25 19:04 karashiiro

For some reason I got this error for the first time on my VPS, but when I use the same credentials to login on the browser or if I run the same code on my PC It logins in fine. Only one of my accounts of the several has this problem each time logging in now. Is this likely a IP block or something? Could a proxy fix it? Doesn't make sense that its only happening on that one account on my VPS only

pkdev08 avatar Apr 27 '25 06:04 pkdev08

For some reason I got this error for the first time on my VPS, but when I use the same credentials to login on the browser or if I run the same code on my PC It logins in fine. Only one of my accounts of the several has this problem each time logging in now. Is this likely a IP block or something? Could a proxy fix it? Doesn't make sense that its only happening on that one account on my VPS only

@pkdev08 The API is monitoring where logins occur from, and often only pops a login challenge when logging in from somewhere different for the first time. It may have detected your local IP as your "primary" IP and decided to only raise a challenge on your VPS. While automating the process via the subtask hook is an option, I think you can also handle this as a one-off thing yourself.

What I've personally found works is just solving the challenge from the destination IP manually once, at which point it won't send another challenge until your login IP changes again. You might be able to somehow proxy through your VPS to accomplish this.

If you don't already have a way to easily proxy through your VPS, I'd suggest trying a workflow along these lines (I do this to split-tunnel through my work's VPN):

  1. Use ssh -D <any available local port number> your-vps-hostname to create a tunnel through your VPS
  2. Use FoxyProxy (the extension is free, you don't need to buy their VPN) to hook up your browser to your tunnel
  3. Manually login to Twitter and solve the challenge
  4. Retry your code

karashiiro avatar Apr 27 '25 15:04 karashiiro

For some reason I got this error for the first time on my VPS, but when I use the same credentials to login on the browser or if I run the same code on my PC It logins in fine. Only one of my accounts of the several has this problem each time logging in now. Is this likely a IP block or something? Could a proxy fix it? Doesn't make sense that its only happening on that one account on my VPS only

@pkdev08 The API is monitoring where logins occur from, and often only pops a login challenge when logging in from somewhere different for the first time. It may have detected your local IP as your "primary" IP and decided to only raise a challenge on your VPS. While automating the process via the subtask hook is an option, I think you can also handle this as a one-off thing yourself.

What I've personally found works is just solving the challenge from the destination IP manually once, at which point it won't send another challenge until your login IP changes again. You might be able to somehow proxy through your VPS to accomplish this.

If you don't already have a way to easily proxy through your VPS, I'd suggest trying a workflow along these lines (I do this to split-tunnel through my work's VPN):

  1. Use ssh -D <any available local port number> your-vps-hostname to create a tunnel through your VPS
  2. Use FoxyProxy (the extension is free, you don't need to buy their VPN) to hook up your browser to your tunnel
  3. Manually login to Twitter and solve the challenge
  4. Retry your code

Thanks. This seemed to work. Hopefully it won't be a persistent issue. We'll have to see. From what I saw all the captcha API providers seem have to have a charge, didn't see any free options for that

pkdev08 avatar Apr 27 '25 20:04 pkdev08

@karashiiro Hey, do you know if there's anyway to get the data for age restricted tweets? I noticed those don't return any data.

pkdev08 avatar Aug 07 '25 20:08 pkdev08