firecrawl [Feat] Improve Handling of Sitemap Structure for URLs with Include Paths

[Feat] Improve Handling of Sitemap Structure for URLs with Include Paths

Open calebpeffer opened this issue 6 months ago • 0 comments

A customer reached out about https://www.clinikally.com/blogs/news. They were trying to crawl it with the parameters.:

Include only paths: blogs/news/*

The results were inconsistent, sometimes giving 9 links sometimes giving more like 30. However, it worked much more consistently (and gave many more links)) when I switched the base URL to https://www.clinikally.com/

I think issue here is that this website's sitemap structure is a tree. It has a parent sitemap at the root (https://www.clinikally.com/sitemap.xml) that directs crawlers to sitemap.xml.

I can only assume the issue was because it wasn't "going backward" on the URL to find the parent sitemap; instead, I was trying to crawl the page recursively, which fails because its paginated.

You could classify this as a user error, but from what I've seen, users will default to pasting in the URL that they want to crawl, i.e., https://www.clinikally.com/blogs/news, instead of going to the base URL and using includePaths.

Solutions?

Maybe we always "Truncate the URL" to the base, then on our end set a includesOnlyPath for links like /blogs/news?

Aug 12 '24 18:08 calebpeffer

firecrawl firecrawl copied to clipboard

[Feat] Improve Handling of Sitemap Structure for URLs with Include Paths

firecrawl
firecrawl copied to clipboard