Flowise icon indicating copy to clipboard operation
Flowise copied to clipboard

[BUG] Relative Links do not get scraped

Open ape-nq opened this issue 1 year ago • 2 comments

Describe the bug When using a web scraper as Document loader a lot of relative links are not found.

To Reproduce

  1. In a Flowise Chatflow
  2. Use Puppeteer or other web scraper from document loaders and try to scrape a website with relative links.
  3. Configure base URL to https://docs.readthedocs.io/en/stable/about/index.html
  4. In Manage Links click fetch URLs
  5. Relative link https://docs.readthedocs.io/en/stable/tutorial/index.html is not found

Expected behavior All relative links should be found.

Screenshots If applicable, add screenshots to help explain your problem.

Flow Exported flow to help replicating the problem: Relative Link Repro Chatflow.json

Setup

  • Installation [e.g. docker, npx flowise start, yarn start]
  • Flowise Version 1.5.0
  • OS: Docker on Linux tested, but should be the same on any
  • Browser Firefox and Chrome tested

Additional context Add any other context about the problem here.

ape-nq avatar Feb 16 '24 11:02 ape-nq

I fixed and tested it on my system.

PR: https://github.com/FlowiseAI/Flowise/pull/1740

ape-nq avatar Feb 16 '24 11:02 ape-nq

With Release-Version 1.5.0 there are some invalid links and no https://docs.readthedocs.io/en/stable/tutorial/index.html found: Screenshot from 2024-02-20 13-21-20

With the patched version https://docs.readthedocs.io/en/stable/tutorial/index.html shows up and no invalid links are returned: Screenshot from 2024-02-20 13-27-49

ape-nq avatar Feb 20 '24 12:02 ape-nq