SitemapParser icon indicating copy to clipboard operation
SitemapParser copied to clipboard

Line-separated sitemaps

Open spekulatius opened this issue 4 years ago • 3 comments

Hey @vipnytt,

wondering what it would take to make the line-separated sitemaps by default. Could the type be guessed based on the content-type in the response or maybe from failing to parse the XML? Keen to hear your thoughts.

Cheers, Peter

spekulatius avatar Sep 29 '20 11:09 spekulatius

Hi @spekulatius

You can always enable it by setting ['strict'=>false] as the 2nd parameter of the constructor, as shown in the example.

Text files has been disabled by default, to avoid possible false results. It isn't standardized in any way, but there are some guidelines available. There are some URL validation in place, but currently it won't distinguish between a list of URLs and a text file with some random URLs in the middle of the document.

It's technically possible to tune the parser additionally, for example:

  • White listing some content-types, eg. text/plain, for parsing with strict mode enabled (default behavior)
  • Make sure each line contains an URL
  • Check that the size of the file, as well as number of lines is within guideline limits

What do you think?

JanPetterMG avatar Oct 01 '20 14:10 JanPetterMG

Hey @JanPetterMG

this doesn't sound too bad. I guess most sitemaps hit the 50k lines limit before the 50 MB limit. One thing that is still a bit unclear for me is: how do you know if you get a page as entry of the sitemap list or a link to another sitemap file (due to splitting). Did I miss that part?

Also, is there support for gzipped files?

Also, could you point me in the right direction on where to start looking in the code for the required changes? This would help in getting started.

Cheers, Peter

spekulatius avatar Oct 05 '20 08:10 spekulatius

  • GZip files are supported. ref. /src/SitemapParser.php#L191
  • parseString is the place to start digging into the code. /src/SitemapParser.php#L341. I must admit, the current code isn't designed to be this flexible. It's doable, but you might also want to handle it in a new class instead. It's up to you, and what you do prefer.
  • Sitemaps vs regular URLs are currently differentiated by checking the file extension. .xml/.xml.gz for sitemaps, and everything else is considered an regular URL. I guess there are no community guidelines when it comes to splitting txt files (that's why XML sitemaps are the way to go), but you might also want to include .txt/.txt.gz files? Most are going to be HTML files (or similar) anyway.

JanPetterMG avatar Oct 05 '20 09:10 JanPetterMG