typesense-docsearch-scraper icon indicating copy to clipboard operation
typesense-docsearch-scraper copied to clipboard

Use Port in `start_urls`

Open JasonWhall opened this issue 1 year ago • 2 comments

Description

We currently have a site that we set up in the scraper config that is hosted on a non-standard HTTP/HTTPS port (3000). When setting the start_urls to a hostname with a port e.g. http://my-host:3000/ , the scraper fails with an error message suggesting it does not accept domains with ports. It looks like the old algolia scraper configs used to support ports so I assume this is related to an update to the scrapy package used in this forked solution.

Steps to reproduce

  • Build and run a docusaurus site locally, serving on http://localhost:3000
  • Update the Docsearch config to set the start_urls "start_urls":["http://localhost:3000/"]
  • run the docsearch scraper

Expected Behavior

  • Site is scraped and uploaded to Typesense server

Actual Behavior

Error returned from scraper:

PortWarning: allowed_domains accepts only domains without ports. Ignoring entry localhost:3000 in allowed_domains.
  warnings.warn(message, PortWarning)

Metadata

Typesense Version:

Docker images:

  • typesense/typesense:0.24.1
  • typesense/docsearch-scraper:0.6.0

OS: Linux

JasonWhall avatar May 26 '23 16:05 JasonWhall

typesense-docsearch-scraper has all the commits from algolia-docsearch-scraper up to Dec 22, 2020. I don't see any updates in the algolia scraper since then where this port limitation was addressed...

Also I still see that error message about ports not allowed in allowed_domains in the master branch of scrapy here. So this limitation still exists as of today.

So I'm surprised to see a config in the docsearch scraper configs repo with a port number!

jasonbosco avatar May 26 '23 17:05 jasonbosco

Any update on that? I'm facing the same issue, but not understand if I'm able to test Typesense locally

noghartt avatar Oct 28 '23 01:10 noghartt