web-scraper-chrome-extension icon indicating copy to clipboard operation
web-scraper-chrome-extension copied to clipboard

Headless mode

Open furstenheim opened this issue 8 years ago • 14 comments

The pr is quite large so it would totally make sense not to merge.

That being said, the aim of this pr is to make the repo available in headless mode. That means that it is possible to load it as a npm module and scrape programmatically. Right now it uses jsdom to get the window, document and jquery. Next step would be to add another browser that runs in Chrome Headless.

The pr incorporates several stuff

  • Using standard.js (mostly habit but it was also good to find global undefined variables)
  • Using a bundler so it was easier to port to node. In particular the selectors are stored in Selectors instead of window
  • Use karma and gulp, this made easier to run the tests on changes.
  • Window, jquery and document are not accessed globally but instead passed on object creation. That way we can pass to the content scraper the fake window and jquery when that is the case.
  • Added a jsdom browser, this is similar to chrome popup browser but running in node
  • Added a web browser, this runs jsdom in the devtools in a webworker thus avoiding having a popup to scrape.

furstenheim avatar May 19 '17 14:05 furstenheim

Nice work! Long term it might make sense to merge your changes but that would also mean that all other open pull requests (9) could not be merged. I would like to support the headless mode but it is just to much work for me to merge it into my fork

jwillmer avatar May 22 '17 12:05 jwillmer

How about implementing the open pull requests into your version?

jwillmer avatar May 23 '17 08:05 jwillmer

Yes, I think that would be feasible. The most delicate part will be the tests because I'd rather have no jquery on them so that they are more agnostic.

furstenheim avatar May 23 '17 08:05 furstenheim

How about this? https://stackoverflow.com/questions/22810786/add-fixtures-to-jasmine-without-using-jquery-jasmine-is-it-possible

jwillmer avatar May 23 '17 12:05 jwillmer

Actually it is not using jasmine any longer, the tests used an old version of jasmine which wasn't very nice to run async tests (all that runFor, waitFor... was a bit hacky), so I took the time to move to mocha which is easier for asynchronous tests.

I also changed the test runner. It was a bit odd, having the specRunner.html to load all the dependencies. It was alright for the extension but it wasn't perfect to integrate with node and run automatically on changes so I moved it to karma.

furstenheim avatar May 23 '17 13:05 furstenheim

Right now the tests work as follows, first JSDOMSpec or browserSpec runs to load window, document and jquery, they store this variables in globals. Then for each test that variable is loaded and passed to the classes that require it

furstenheim avatar May 23 '17 13:05 furstenheim

I automated the client side(extension) scraping with an graphql link to a server. But headless is a much better solution. Is this fork working? If so, is their Any documentation about how to get it working? Chrome headless implementation is indeed a good next step..

grinono avatar Sep 12 '17 21:09 grinono

@grinono yes, it is available in npm and it is easy to use

furstenheim avatar Sep 13 '17 13:09 furstenheim

this, i start testing with it.. but i got lots of errors.

Just changed to right package name const webscraper = require('webscraper-headless') to import webscraper from 'web-scraper-headless'

but then running the example in NodeJS, it returns errors. when fixing these i encounter more and more errors. looks like Babal es6 compiler errors. how do you run this package server side? I'm using meteor > nodejs

grinono avatar Sep 15 '17 11:09 grinono

@grinono what kind of errors are you getting? Have you tried requiring the package instead of importing?

furstenheim avatar Sep 15 '17 13:09 furstenheim

this, i start testing with it.. but i got lots of errors.

Can you move this discussion to an issue? So this thread stays on track.

jwillmer avatar Sep 15 '17 15:09 jwillmer

The first errors i get are regarding the default values declared in the Functions as shown below. function scrapeJSDOM (sitemapInfo, options = {}) The option = {} is somehow not allowed. But this should be a totally fine declaration in es6.

i can start a new issue, but this code is not officially supported.

grinono avatar Sep 18 '17 10:09 grinono

You can start an issue in the fork

furstenheim avatar Sep 18 '17 10:09 furstenheim

i checked that before, but it's not possible to start an issue in the fork... Image of fork

grinono avatar Sep 19 '17 12:09 grinono