docsearch-scraper
docsearch-scraper copied to clipboard
DocSearch - Scraper
A user expected the crawler to respect the `` meta tag that should tell crawlers to skip a page. We don't honor this tag at all (nor do we honor...
I have personally experienced `Ctrl-C` resulting in an incomplete index. The scrappy documentation for `spider_closed` signal, https://docs.scrapy.org/en/latest/topics/signals.html#scrapy.signals.spider_closed , mentions that the reason for the closing should be `finished` under normal...
See #461.
Relates to #459 With this PR, I'm trying to initiate some move/improvements with the Docker image structure. The image uses a lot of layers for no obvious reasons. Let's try...
Pinning `google-chrome-stable` is not the easiest as versions are removed from time to time as the newer versions usually become the stable ones. I've seen efforts in bumping the Chrome...
Allows chrome webdriver to authenticate using an auth cookie pulled from the .env file. This would allow for scraping of password-protected documentation.
# Situation When a configuration includes `custom_settings.attributesForFaceting`, the index's setting `attributesForFaceting` does not include `tags` anymore. This override the `default_settings` defined by the strategy. `tags` defined from `start_urls` are not...
The docker image that's being published for this repository is severely fragmented. As a result, it takes a much longer time to download than it should and consumes a lot...
It would be awesome to throw a page up that just displays a list of the last index times for all the configs. I know that it would help me...
Currently, the scraper assumes it's writing progress messages to an ANSI-compatible terminal. As a result, the progress messages look like this in a CI environment: ``` [94m> DocSearch: [0mhttps://docs.couchbase.com/server/6.0/introduction/intro.html ([93m51...