scrapy-splash icon indicating copy to clipboard operation
scrapy-splash copied to clipboard

Migrate scrapy to headless-chrome?

Open fbuchinger opened this issue 7 years ago • 6 comments

A few weeks ago, the chromium project announced headless chromium as new, clean way to open websites in a non-UI server context.

The announcement had quite an impact in the headless-browser scene and resulted in the resignment of the phantomJS maintainer.

Since the current webkit engine of Splash dates back to 2013, I wanted to know whether there are any plans to port splash to headless chrome?

fbuchinger avatar Apr 25 '17 16:04 fbuchinger

will take a lot of work, i guess

bufrr avatar May 27 '17 02:05 bufrr

Webkit is upgraded to a much more recent version in Splash master (~mid-2016 Safari), and will be upgraded further (to Webkit trunk) in future, thanks to https://github.com/annulen/webkit. You can use scrapinghub/splash:master Docker image to try the changes, or wait for Splash 3.0 release.

Switching to Headless Chromium would be a huge change indeed. We don't have engineering resources to make this switch in a near future. Also, it may be easier to create a separate Scrapy + Headless Chromium intergation project.

Switching to Headless Chromium has both advantages and disadvantages; it seems there are more advantages. But some Splash features can't be implemented in Headless Chromium AFAIK - e.g. per-request proxy options are impossible if I'm not mistaken - this feature is nice to have e.g. for Crawlera integration, to avoid using Crawlera for static resources.

kmike avatar Jun 27 '17 20:06 kmike

Thanks! Will try the master container to see if I can get around my scraping issues.

fbuchinger avatar Jun 28 '17 10:06 fbuchinger

got the following error when trying out the master dockerfile:

$ docker pull scrapinghub/splash:master
master: Pulling from scrapinghub/splash
75c416ea735c: Pulling fs layer
c6ff40b6d658: Pulling fs layer
a7050fc1f338: Pulling fs layer
f0ffb5cf6ba9: Waiting
be232718519c: Waiting
02e48393bcae: Waiting
a699b90bbc99: Waiting
41da8db2bf8f: Waiting
ba57071e497d: Waiting
55c87f8bb02f: Waiting
error pulling image configuration: Get https://dseasb33srnrn.cloudfront.net/regi
stry-v2/docker/registry/v2/blobs/sha256/b3/b3f69a08d665f155a61dad4b436c4112f7580
36e2e5a1d4f97658707829b0d48/data?Expires=1498738729&Signature=BUj4fCBuoG2MDqovD8
9-hQ4UarCvnxIKG7qce0gkS6TC67GLSSR6fw2E1R7anC1iCyiaiA44tIniU0mtA1~HAVhlHjC73iQc3Z
j45ZStlPdSpOutmc4YEsOum33hbxG1Hox53J0CYatrXkOsHyzLqgyKXeU45QVab-Q7Kt2lVrE_&Key-P
air-Id=APKAJECH5M7VWIS5YZ6Q: read tcp 10.0.2.15:46376->13.32.28.215:443: read: c
onnection reset by peer

fbuchinger avatar Jun 29 '17 12:06 fbuchinger

Could you try it again? It looks like a temporary issue - either a dockerhub issue, or a network issue.

kmike avatar Jun 29 '17 12:06 kmike

We 've now successfully tested splash 3.0 and are really impressed: The execution time of our scraping jobs (running layoutstats,js on ~ 120 URLs) dropped from approx 75 minutes to just 25 minutes :-) Taking screenshots also seems to work more reliable now. Big kudos to you and the guys behind the "Chromium 2016" port!

fbuchinger avatar Jul 16 '17 19:07 fbuchinger