firecrawl limits include filtered out paths

limits include filtered out paths

Open red545 opened this issue 10 months ago • 4 comments

the crawl limit is applied before the paths are filtered out.

base url: test.com limit: 2 included links: ["/pages/*"]

links on test.com in order:

[
"/home",
"/imprint",
"/about",
"pages/1",
"pages/2",
"pages/3"
]

expected links to be crawled: ["pages/1","pages/2"]

current links that are crawled: []

Apr 21 '24 19:04 red545

Thanks for spotting and fixing that! Will take a look at it soon :)

Apr 21 '24 21:04 nickscamara

I'm unable to reproduce the issue.

Tested with: base url: mendable.ai limit: 2 included only paths: /blog/* results: [https://www.mendable.ai/blog/august2023update, https://www.mendable.ai/blog/building-copilots]

@red545 can you send us a real example?

Apr 22 '24 14:04 rafaelsideguide

@rafaelsideguide

base url: https://www.newimmobilien.at/sie-suchen-eine-immobilie limit: 5 included only paths: objektdetail/*

results: anywhere from 0-2 results at any given time.

but the issue might be even deeper. I am trying to scrape the following url: https://www.newimmobilien.at/objektdetail/13521532?from=166590 which is resulting in Page not Found

while I can sometimes crawl the page. sadly this is not deterministic

Apr 22 '24 14:04 red545

I wonder if it has anything to do with the query params.

@rafaelsideguide

Apr 22 '24 15:04 nickscamara

I tried reproducing again, to no avail

I checked the logs, and it's been working great since this morning. Maybe it's a concurrency issue?

Apr 24 '24 00:04 calebpeffer

same for me. this seems to be fixed now

Apr 25 '24 07:04 red545

firecrawl firecrawl copied to clipboard

limits include filtered out paths

firecrawl
firecrawl copied to clipboard