headless-chrome-crawler [Feature Request] Add support for multiple sitemaps

What is the current behavior? I don't believe the crawler is handling sitemaps broken out into multiple sitemaps. This is common in large sites since sitemaps are limited to 50k urls. See Simplify multiple sitemap management

Good example is NASA https://www.nasa.gov/sitemap.xml

What is the expected behavior? Successfully crawl large sites via sitemap(s)

What is the motivation / use case for changing the behavior? Large enterprise sites not being crawled via sitemap

Mar 08 '18 13:03 NickStees

Also I am having trouble getting the crawler to pick up my sites sitemap.xml and after digging around the code I realized that this crawler requires the sitemap be declared in the robots.txt https://www.sitemaps.org/protocol.html#submit_robots most sites I don't think do this, so it probably has limited success in actually finding sitemaps. Using the NASA example again https://www.nasa.gov/robots.txt they don't declare the sitemap.xml in their robots.txt and neither was I.

Mar 08 '18 16:03 NickStees

Ok... after a lot of time debugging. It appears that the robots-parser will in fact handle multiple sitemaps just fine. My problem is just the fact that the sitemap: directive was not in my robots.txt.

So I guess the Feature Request is now maybe somehow accounting for robots.txt that don't list the sitemaps, and just manually checking for /sitemap.xml on the server? Since this is a standardized location?

Mar 09 '18 14:03 NickStees

@NickStees

Sorry to keep you waiting for long. Thanks for a good issue and even thorough investigation on it.

After learning the protocols for sitemaps, now I can safely say that there is no standard for the file location. No official documents including sitemap.org define it, so this stackoverflow question helped me the most.

People conventionally places sitemap.xml at the root folder, but it's not a standard. There are two ways how search engines find sitemaps:

Locations written in robots.txt
Locations submitted to each search engines' webmaster forms

There is no way we know the sitemaps locations submitted to search engines, so the only way we have is the current way: find locations written in robots.txt.

It's true that most users places the sitemap.xml in the root folder, but there is no guarantee that the information is right. That's why Scrapy for example only trust robots.txt written in robots.txt.

I'd like to keep this issue opened until this feature is supported.

Mar 25 '18 12:03 yujiosaka

Maybe it could be useful to force checking sitemap.xml at the root folder when followSitemapXml options is true. If the file doesn't exist the crawler discards the result, but if it exists, it parses the file and continue crawling the link that where found.

Mar 25 '18 18:03 BubuAnabelas

@BubuAnabelas Yes it would be useful, but I believe it should not be a default behavior. Since no one states that it's the right sitemaps, it may be wrong, outdated or simply a copy from somewhere else. There is no such rules as that sitemaps should be named sitemap.xml anyway.

Mar 26 '18 03:03 yujiosaka

@yujiosaka Thanks for digging into this so much, I always assumed it was a standard to name it sitemap.xml but yes it looks like it's not required to be so. I think most CMS's by default go with this convention so I assume its probably popular.

Maybe in the crawler configuration, we can manually specify an array of sitemaps??

One other thing I encountered was that in the robots.txt the sitemap URL has to be fully qualified. So I could not run the crawler on a dev/test server since the dev robots.txt (a non-dynamic file) always pointed to the production sitemap.xml URL. Not a biggie, just thought it may be interesting to know that.

Thanks for working on such a handy tool!

Apr 04 '18 01:04 NickStees

headless-chrome-crawler headless-chrome-crawler copied to clipboard

[Feature Request] Add support for multiple sitemaps

headless-chrome-crawler
headless-chrome-crawler copied to clipboard