docsearch-scraper issues

Should the crawler respect the <meta name="robots" content="noindex,nofollow">?

10

A user expected the crawler to respect the `` meta tag that should tell crawlers to skip a page. We don't honor this tag at all (nor do we honor...

pixelastic

enhancement

Partial index promotion due to scrappy spider signals not being handled

I have personally experienced `Ctrl-C` resulting in an incomplete index. The scrappy documentation for `spider_closed` signal, https://docs.scrapy.org/en/latest/topics/signals.html#scrapy.signals.spider_closed , mentions that the reason for the closing should be `finished` under normal...

ProTip

Allow custom ports for `start_urls`

See #461.

vjpr

Update: use less layers in Docker image

Relates to #459 With this PR, I'm trying to initiate some move/improvements with the Docker image structure. The image uses a lot of layers for no obvious reasons. Let's try...

ArtFlag

refactor(Dockerfile): use latest stable chrome with matching driver

Pinning `google-chrome-stable` is not the easiest as versions are removed from time to time as the newer versions usually become the stable ones. I've seen efforts in bumping the Chrome...

rubda

Added auth cookie support in .env

Allows chrome webdriver to authenticate using an auth cookie pulled from the .env file. This would allow for scraping of password-protected documentation.

canyonrobins

[core] Setting attributesForFaceting from the config overrides Algolia settings

2

# Situation When a configuration includes `custom_settings.attributesForFaceting`, the index's setting `attributesForFaceting` does not include `tags` anymore. This override the `default_settings` defined by the strategy. `tags` defined from `start_urls` are not...

s-pace

bug

help wanted

need_fix_test

Optimize docker image

7

The docker image that's being published for this repository is severely fragmented. As a result, it takes a much longer time to download than it should and consumes a lot...

mojavelinux

bug

enhancement

help wanted

Track Last Index Time

It would be awesome to throw a page up that just displays a list of the last index times for all the configs. I know that it would help me...

GoPro16

enhancement

Output CI-friendly progress messages

4

Currently, the scraper assumes it's writing progress messages to an ANSI-compatible terminal. As a result, the progress messages look like this in a CI environment: ``` [94m> DocSearch: [0mhttps://docs.couchbase.com/server/6.0/introduction/intro.html ([93m51...

mojavelinux

bug

enhancement

help wanted

docsearch-scraper
docsearch-scraper copied to clipboard

Metadata

Should the crawler respect the <meta name="robots" content="noindex,nofollow">?

Partial index promotion due to scrappy spider signals not being handled

Allow custom ports for `start_urls`

Update: use less layers in Docker image

refactor(Dockerfile): use latest stable chrome with matching driver

Added auth cookie support in .env

[core] Setting attributesForFaceting from the config overrides Algolia settings

Optimize docker image

Track Last Index Time

Output CI-friendly progress messages

← Metadata

Owner

Metadata

docsearch-scraper docsearch-scraper copied to clipboard

Metadata

← Metadata

Owner

Metadata

docsearch-scraper
docsearch-scraper copied to clipboard