web-scraper-chrome-extension Headless mode

The pr is quite large so it would totally make sense not to merge.

That being said, the aim of this pr is to make the repo available in headless mode. That means that it is possible to load it as a npm module and scrape programmatically. Right now it uses jsdom to get the window, document and jquery. Next step would be to add another browser that runs in Chrome Headless.

The pr incorporates several stuff

Using standard.js (mostly habit but it was also good to find global undefined variables)
Using a bundler so it was easier to port to node. In particular the selectors are stored in Selectors instead of window
Use karma and gulp, this made easier to run the tests on changes.
Window, jquery and document are not accessed globally but instead passed on object creation. That way we can pass to the content scraper the fake window and jquery when that is the case.
Added a jsdom browser, this is similar to chrome popup browser but running in node
Added a web browser, this runs jsdom in the devtools in a webworker thus avoiding having a popup to scrape.

May 19 '17 14:05 furstenheim

Nice work! Long term it might make sense to merge your changes but that would also mean that all other open pull requests (9) could not be merged. I would like to support the headless mode but it is just to much work for me to merge it into my fork

May 22 '17 12:05 jwillmer

How about implementing the open pull requests into your version?

May 23 '17 08:05 jwillmer

Yes, I think that would be feasible. The most delicate part will be the tests because I'd rather have no jquery on them so that they are more agnostic.

May 23 '17 08:05 furstenheim

How about this? https://stackoverflow.com/questions/22810786/add-fixtures-to-jasmine-without-using-jquery-jasmine-is-it-possible

May 23 '17 12:05 jwillmer

Actually it is not using jasmine any longer, the tests used an old version of jasmine which wasn't very nice to run async tests (all that runFor, waitFor... was a bit hacky), so I took the time to move to mocha which is easier for asynchronous tests.

I also changed the test runner. It was a bit odd, having the specRunner.html to load all the dependencies. It was alright for the extension but it wasn't perfect to integrate with node and run automatically on changes so I moved it to karma.

May 23 '17 13:05 furstenheim

Right now the tests work as follows, first JSDOMSpec or browserSpec runs to load window, document and jquery, they store this variables in globals. Then for each test that variable is loaded and passed to the classes that require it

May 23 '17 13:05 furstenheim

I automated the client side(extension) scraping with an graphql link to a server. But headless is a much better solution. Is this fork working? If so, is their Any documentation about how to get it working? Chrome headless implementation is indeed a good next step..

Sep 12 '17 21:09 grinono

@grinono yes, it is available in npm and it is easy to use

Sep 13 '17 13:09 furstenheim

this, i start testing with it.. but i got lots of errors.

Just changed to right package name const webscraper = require('webscraper-headless') to import webscraper from 'web-scraper-headless'

but then running the example in NodeJS, it returns errors. when fixing these i encounter more and more errors. looks like Babal es6 compiler errors. how do you run this package server side? I'm using meteor > nodejs

Sep 15 '17 11:09 grinono

@grinono what kind of errors are you getting? Have you tried requiring the package instead of importing?

Sep 15 '17 13:09 furstenheim

this, i start testing with it.. but i got lots of errors.

Can you move this discussion to an issue? So this thread stays on track.

Sep 15 '17 15:09 jwillmer

The first errors i get are regarding the default values declared in the Functions as shown below. function scrapeJSDOM (sitemapInfo, options = {}) The option = {} is somehow not allowed. But this should be a totally fine declaration in es6.

i can start a new issue, but this code is not officially supported.

Sep 18 '17 10:09 grinono

You can start an issue in the fork

Sep 18 '17 10:09 furstenheim

i checked that before, but it's not possible to start an issue in the fork... Image of fork

Sep 19 '17 12:09 grinono