firecrawl
firecrawl copied to clipboard
[Feat] Improve Handling of Sitemap Structure for URLs with Include Paths
A customer reached out about https://www.clinikally.com/blogs/news. They were trying to crawl it with the parameters.:
Include only paths: blogs/news/*
The results were inconsistent, sometimes giving 9 links sometimes giving more like 30. However, it worked much more consistently (and gave many more links)) when I switched the base URL to https://www.clinikally.com/
I think issue here is that this website's sitemap structure is a tree. It has a parent sitemap at the root (https://www.clinikally.com/sitemap.xml) that directs crawlers to sitemap.xml.
I can only assume the issue was because it wasn't "going backward" on the URL to find the parent sitemap; instead, I was trying to crawl the page recursively, which fails because its paginated.
You could classify this as a user error, but from what I've seen, users will default to pasting in the URL that they want to crawl, i.e., https://www.clinikally.com/blogs/news, instead of going to the base URL and using includePaths.
Solutions?
Maybe we always "Truncate the URL" to the base, then on our end set a includesOnlyPath for links like /blogs/news?