js-crawler `knownUrls` processing logic is incorrectly using underscore

crawler.js#L178 and crawler.js#L197 are using underscore to determine whether the provided url is a known url.

This is incorrect, as underscore doesn't handle objects on it's contains method, as such it's always returning false. It should be using the same form as crawler.js#L203.

Changing it breaks a unit test though, and I'm not sure how it's going to impact the flow as expected

Jan 02 '18 15:01 Nysosis

Thanks for reviewing the code. Looks like at least in the second case the check always returns false.

In crawler.js#L178 currentUrlsToCrawl is an array and Underscore's contains method should work as expected in this case http://underscorejs.org/#contains
For the second case crawler.js#L197

_.contains(self.knownUrls, url)

should most likely be changed to

_.contains(_.keys(self.knownUrls), url)

as self.knownUrls is an object and for it _.contains will always return false.

Will check this more and fix this/update the tests.

Jan 10 '18 22:01 amoilanen

For the first case, you've got two ||'d checks in the line, the first currentUrlsToCrawl, like you say, is an array and will correctly work with underscore.contains, the second is the knownUrls, so will always return false.

I can't see any reason to keep the knownUrls as an object, it looks like it would all work correct if you switched it over to being an array, and just fixed up the instances where it's being treated like an object, #L203 for example, to use _.contains like currentUrlsToCrawl

As an aside, I wasn't sure if this was still an actively developed repo, and wanted to make some other changes to it, I've forked the repo, and done a load of work to it, as I wanted to integrate it with headless chrome (as mentioned in #49). I pulled out request from being an explict thing, and created two separate 'pluginable' engines, one for request and one for chromium (via puppeteer) (none of this is up on GitHub yet, just on my machine). Which I'm happy to look at merging/passing over, however, I have also converted the whole thing to typescript (which is what highlighted this issue to me in the first place), and I don't know your feelings on whether you'd be happy to make that big a change. Also some slight changes as a result of pulling out request, so there's some breaking changes for the API.

Jan 10 '18 23:01 Nysosis

Yes, you are right, will check both cases. Most likely we should switch knownUrls to be an array and make sure the unit tests pass.

Thank you for pointing out the problematic places and finding these issues.

As an aside, I wasn't sure if this was still an actively developed repo, and wanted to make some other changes to it, I've forked the repo, and done a load of work to it, as I wanted to integrate it with headless chrome (as mentioned in #49). I pulled out request from being an explict thing, and created two separate 'pluginable' engines, one for request and one for chromium (via puppeteer) (none of this is up on GitHub yet, just on my machine). Which I'm happy to look at merging/passing over, however, I have also converted the whole thing to typescript (which is what highlighted this issue to me in the first place), and I don't know your feelings on whether you'd be happy to make that big a change.

Not sure about switching to Typescript and adding support for headless Chromium with breaking API changes, these are quite large changes, but they can probably be done at least in a separate experimental branch.

I thought about developing the crawler further and supporting other request engines/re-factoring the implementation. But then it might be a good idea to avoid huge potentially destabilizing changes in the 'master' branch and making sure the implementation is well tested and there are unit and end-to-end tests. Once the new implementation is stable enough/is tested it can be released as a new major version.

From my side not sure how much time it is possible to dedicate to the development of the crawler, this whole process and releasing a new version can take at least a few months. I will try to find some time to do this.

If you need something faster and want to add more experimental features/do more changes, then probably it is a good idea to fork the repository.

Jan 14 '18 21:01 amoilanen

js-crawler js-crawler copied to clipboard

`knownUrls` processing logic is incorrectly using underscore

js-crawler
js-crawler copied to clipboard