skrape.it Add a fetcher that uses a real Chrome browser to download the html

Add a fetcher that uses a real Chrome browser to download the html

Open johanoskarsson opened this issue 11 months ago • 1 comments

Adds a new Fetcher that uses a real Chrome browser to fetch the html. This solved a problem where I was unable to fetch a page that was partially generated by javascript using any of the existing fetchers. (I assume the page required a modern real browser for some reason I did not investigate further).

This change uses the cdt-java-client library found here to launch and communicate with a Chrome browser: https://github.com/kklisura/chrome-devtools-java-client However due to a breaking change in Chrome that has not been fixed in this library I am using a fork with that one patch applied: io.fluidsonic.mirror:cdt-java-client:4.0.0-fluidsonic-1. Hopefully the change gets merged back into the main library.

WIP warning: I figured I would publish this PR in its current state in case it helps anyone else. It does however not fullfil all the expectations of a fetcher. It does not return the correct http status etc, just the body. There is a Network class that can probably be used to extract those.

Mar 05 '24 17:03 johanoskarsson

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 89.56%. Comparing base (382f21b) to head (475065d).

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #237   +/-   ##
=======================================
  Coverage   89.56%   89.56%           
=======================================
  Files          38       38           
  Lines         986      986           
  Branches       69       69           
=======================================
  Hits          883      883           
  Misses         81       81           
  Partials       22       22

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Mar 05 '24 17:03 codecov[bot]

skrape.it skrape.it copied to clipboard

Add a fetcher that uses a real Chrome browser to download the html

Codecov Report

skrape.it
skrape.it copied to clipboard