crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

fix: `EnqueueStrategy.All` erroring with links using unsupported protocols

Open stefansundin opened this issue 1 year ago • 2 comments

This changes EnqueueStrategy.All to filter out non-http and non-https URLs (mailto: links were causing the crawler to error).

Let me know if there's a better fix or if you want me to change something.

Thanks!

Request failed and reached maximum retries. Error: Received one or more errors
    at _ArrayValidator.handle (/path/to/project/node_modules/@sapphire/shapeshift/src/validators/ArrayValidator.ts:102:17)
    at _ArrayValidator.parse (/path/to/project/node_modules/@sapphire/shapeshift/src/validators/BaseValidator.ts:103:2)
    at RequestQueueClient.batchAddRequests (/path/to/project/node_modules/@crawlee/src/resource-clients/request-queue.ts:340:36)
    at RequestQueue.addRequests (/path/to/project/node_modules/@crawlee/src/storages/request_provider.ts:238:46)
    at RequestQueue.addRequests (/path/to/project/node_modules/@crawlee/src/storages/request_queue.ts:304:22)
    at attemptToAddToQueueAndAddAnyUnprocessed (/path/to/project/node_modules/@crawlee/src/storages/request_provider.ts:302:42)
    at RequestQueue.addRequestsBatched (/path/to/project/node_modules/@crawlee/src/storages/request_provider.ts:319:37)
    at RequestQueue.addRequestsBatched (/path/to/project/node_modules/@crawlee/src/storages/request_queue.ts:309:22)
    at enqueueLinks (/path/to/project/node_modules/@crawlee/src/enqueue_links/enqueue_links.ts:384:2)
    at browserCrawlerEnqueueLinks (/path/to/project/node_modules/@crawlee/src/internals/browser-crawler.ts:777:21)

stefansundin avatar Mar 22 '24 00:03 stefansundin

@stefansundin do you plan to finish this? I'd rather not merge such change without any added tests

B4nan avatar Mar 27 '24 12:03 B4nan

Hi @B4nan. I started writing a test but I had some more important work come up that took priority.

I may be able to finish it next week.

If you prefer then we can close this PR and open an issue instead.

stefansundin avatar Mar 27 '24 16:03 stefansundin