typesense-docsearch-scraper
typesense-docsearch-scraper copied to clipboard
Use Port in `start_urls`
Description
We currently have a site that we set up in the scraper config that is hosted on a non-standard HTTP/HTTPS port (3000). When setting the start_urls
to a hostname with a port e.g. http://my-host:3000/
, the scraper fails with an error message suggesting it does not accept domains with ports. It looks like the old algolia scraper configs used to support ports so I assume this is related to an update to the scrapy package used in this forked solution.
Steps to reproduce
- Build and run a docusaurus site locally, serving on
http://localhost:3000
- Update the Docsearch config to set the start_urls
"start_urls":["http://localhost:3000/"]
- run the docsearch scraper
Expected Behavior
- Site is scraped and uploaded to Typesense server
Actual Behavior
Error returned from scraper:
PortWarning: allowed_domains accepts only domains without ports. Ignoring entry localhost:3000 in allowed_domains.
warnings.warn(message, PortWarning)
Metadata
Typesense Version:
Docker images:
- typesense/typesense:0.24.1
- typesense/docsearch-scraper:0.6.0
OS: Linux
typesense-docsearch-scraper
has all the commits from algolia-docsearch-scraper
up to Dec 22, 2020. I don't see any updates in the algolia scraper since then where this port limitation was addressed...
Also I still see that error message about ports not allowed in allowed_domains in the master branch of scrapy here. So this limitation still exists as of today.
So I'm surprised to see a config in the docsearch scraper configs repo with a port number!
Any update on that? I'm facing the same issue, but not understand if I'm able to test Typesense locally