panther icon indicating copy to clipboard operation
panther copied to clipboard

Avoid `webDriver->quit()` on `__destruct()` when scraping *remote* websites

Open ThomasLandauer opened this issue 3 years ago • 8 comments

When connecting to remote webpages, I'm sometimes getting this exception:

Curl error thrown for http DELETE to /session/5db262bc-961f-4cbf-9983-8d602f00d89a Operation timed out after 30000 milliseconds with 0 bytes received

And at the bottom of Symfony's exception page:

Curl error thrown for http POST to /session/5db262bc-961f-4cbf-9983-8d602f00d89a/url with params: {"url":"https://www.example.com"} Operation timed out after 30001 milliseconds with 0 bytes received

As far as I can see, the cause is: When done, Panther tries to cleanup and Client::quit() calls $this->webDriver->quit();. And from this I'm guessing:

  • Some servers just respond with 5xx. Possible side effect: After doing this "forbidden" request repeatedly, I might get blocked.
  • Some don't send a response at all. Side effect: Panther waits for 30 seconds (=general timeout), i.e. my command hangs.

So the solution looks pretty clear to me: Don't send that request remotely ;-)

So the first question towards a PR would be: Do you want an automatic check, or rather some user-configurable option to suppress this cleanup?

Related: https://github.com/symfony/panther/issues/169 (don't know if it's really the same, or some Docker-related problem)

ThomasLandauer avatar Apr 20 '21 15:04 ThomasLandauer

I run 30 instances of Panther in parallel using different ports and each of them connects to different proxy. I often get that error and I'm not sure why

trbsi avatar May 10 '21 19:05 trbsi

@ThomasLandauer what client for scraping do you use? Curl ? Chrome? Firefox?

Mepcuk avatar May 31 '21 13:05 Mepcuk

Firefox.

ThomasLandauer avatar May 31 '21 13:05 ThomasLandauer

It's maybe related to the bug I try to fix in https://github.com/symfony/panther/pull/425. However I didn't manage to get this patch working and I'm not sure of when I'll have the time yo work on it again.

Help welcome on this one (yes, destructors are hard to deal with).

dunglas avatar May 31 '21 20:05 dunglas

I think I'm experiencing the same issue (i.e. I get the delete error when using Panther on remote sites)

Is there any fix suggested? Or where should I look to try and patch it myself?

Maybe we could add some setting on Client to tell it not to call $this->webDriver->close() from Client::close()?

gravitiq-cm avatar Aug 18 '22 08:08 gravitiq-cm

Hi @gravitiq-cm, as explained previously, this error is most likely caused by the bug I tried to fix in https://github.com/symfony/panther/pull/425

Unfortunately, I didn't find the time to finish it and it's at the very end of my todo list. Help on fixing this would be much appreciated!

dunglas avatar Aug 19 '22 18:08 dunglas

I think a valid solution would be to allow users to define a different class to use instead of the hard-coded RemoteWebDriver.

For example, change from:

/**
 * @throws \RuntimeException
 */
public function start(): WebDriver
{
    // ...
    return RemoteWebDriver::create(...);
}

to:

/**
 * @throws \RuntimeException
 */
public function start(): WebDriver
{
    // ...
    $webDriverClass = $this->options['web_driver_class'];
    return $webDriverClass::create(...);
    // (or could use call_user_func_array() if preferred style)
}

Users could then create a custom class which extends RemoteWebDriver and has their own customisation. RemoteWebDriver looks extensible... it has no private methods or functions, so would be easy to extend then (in this case) override CustomRemoteWebDriver::quit() to not try and delete the session.

gravitiq-cm avatar Aug 20 '22 10:08 gravitiq-cm

IMHO it would be better to fix the bug for everybody without asking the user to write custom code.

dunglas avatar Aug 20 '22 11:08 dunglas