mdBook Add sitemap generation support to HTML renderer

Fixes #1491

Mergeable as-is, but there is optional work:

[ ] <lastmod> from .md file modification time?
[ ] Check that the generated file is under 50 MiB, as per the spec
[ ] Warnings if the site URL e.g. has parameters, since .join() overwrites them

I will work on them if maintainers think they're a good thing.

Jul 30 '21 16:07 ISSOtm

Any news on this one?

Feb 13 '22 20:02 avivace

Just rebased the PR.

Feb 15 '22 19:02 ISSOtm

The feature will be very helpful, also for me. Hope this will be merged!

Feb 21 '22 03:02 sjkim04

Looking forward to this feature! :D

Mar 03 '22 08:03 billy1624

Any update? 🙂

Apr 18 '22 18:04 hervyqa

Is there anything I can do to help get this merged?

Oct 15 '22 14:10 brettchalupa

Can you provide some more context on sitemaps in general? What benefits does it have? Shouldn't crawlers be able to crawl the summary index? I'm not very familiar with them, so if you could provide some general information, that would be helpful.

This looks like it needs some tests.

Oct 15 '22 21:10 ehuss

search engines like google, yahoo, bing, etc.. usually use sitemap xml to crawl every page of the website. this is good so that the website is easily indexed by search engines. this is related to SEO.

Oct 16 '22 00:10 hervyqa

I believe a site map hints to a search engine what to index and what not to index. So unnecessary or outdated files can be skipped if they are not in the map.

Site maps are also more reliable wrt page ranking these days because things like a nav bar pointing to all the pages made simply counting the number of backlinks unreliable as a metric.

Oct 16 '22 01:10 schungx

Well, the info about why this would be useful has been said above :stuck_out_tongue:

What should be tested? I can only think of comparing the sitemap that would be output from the test book against a reference, but I believe that would produce a bunch of false positives.

Oct 16 '22 08:10 ISSOtm

@ISSOtm with regards to what should be tested, just from checking out the PR, here are my thoughts:

generate_sitemap tests:
- BookItems collection passed in to verify the file is created and has the expected contents for those items passed in — this is an integration test of sorts since this function calls out to write_sitemap
- The destination existing already and error being returned
- passing in a site_url without a trailing slash and ensuring it works as expected
xml_escapes unit tests by passing in a variety of strings to ensure they're escaped as expected

I'm happy to collaborate or help on any of the tests if you want. 😄

Also just want to say +1 to this comment you left: // TODO: lastmod from src file modification time? I think that makes sense and would be useful.

Oct 16 '22 15:10 brettchalupa

Thanks @brettchalupa!

I'd be happy if you could give me some strings that would be useful for testing xml_escapes.

I'll begin writing the rest of the tests and the lastmod TODO this weekend/next week.

Oct 22 '22 10:10 ISSOtm

@ISSOtm here are some strings for that unit test that might be helpful (even if it's unlikely some of these may get passed through in the grand scheme of things, it's probably okay to be extra thorough):

"https://example.com/foo/bar.html"
"https://Bücher.example/foo/bar"
"https://'<foo'/>.com/foo?bar=1&biz=100" -- you know, more permutations that exercise what's intended to be escaped 😂

Oct 26 '22 14:10 brettchalupa

Any movement on this? It is still causing issues with Google indexing.

May 03 '23 22:05 andymac4182

Any news here?

Oct 25 '23 22:10 avivace

Sorry, I've been dedicating my time elsewhere, as contribution to mdBook is very difficult due to lack of available maintainer time.

I'd be happy to let someone else take this PR to the finish line in my stead.

Oct 26 '23 21:10 ISSOtm

mdBook mdBook copied to clipboard

Add sitemap generation support to HTML renderer

mdBook
mdBook copied to clipboard