firecrawl
firecrawl copied to clipboard
limits include filtered out paths
the crawl limit is applied before the paths are filtered out.
base url: test.com
limit: 2
included links: ["/pages/*"]
links on test.com in order:
[
"/home",
"/imprint",
"/about",
"pages/1",
"pages/2",
"pages/3"
]
expected links to be crawled: ["pages/1","pages/2"]
current links that are crawled: []
Thanks for spotting and fixing that! Will take a look at it soon :)
I'm unable to reproduce the issue.
Tested with:
base url: mendable.ai
limit: 2
included only paths: /blog/*
results: [https://www.mendable.ai/blog/august2023update, https://www.mendable.ai/blog/building-copilots]
@red545 can you send us a real example?
@rafaelsideguide
base url: https://www.newimmobilien.at/sie-suchen-eine-immobilie
limit: 5
included only paths: objektdetail/*
results: anywhere from 0-2 results at any given time.
but the issue might be even deeper. I am trying to scrape the following url:
https://www.newimmobilien.at/objektdetail/13521532?from=166590
which is resulting in Page not Found
while I can sometimes crawl the page. sadly this is not deterministic
I wonder if it has anything to do with the query params.
@rafaelsideguide
I tried reproducing again, to no avail
I checked the logs, and it's been working great since this morning. Maybe it's a concurrency issue?
same for me. this seems to be fixed now