mdBook icon indicating copy to clipboard operation
mdBook copied to clipboard

Add sitemap generation support to HTML renderer

Open ISSOtm opened this issue 3 years ago • 14 comments

Fixes #1491

Mergeable as-is, but there is optional work:

  • [ ] <lastmod> from .md file modification time?
  • [ ] Check that the generated file is under 50 MiB, as per the spec
  • [ ] Warnings if the site URL e.g. has parameters, since .join() overwrites them

I will work on them if maintainers think they're a good thing.

ISSOtm avatar Jul 30 '21 16:07 ISSOtm

Any news on this one?

avivace avatar Feb 13 '22 20:02 avivace

Just rebased the PR.

ISSOtm avatar Feb 15 '22 19:02 ISSOtm

The feature will be very helpful, also for me. Hope this will be merged!

sjkim04 avatar Feb 21 '22 03:02 sjkim04

Looking forward to this feature! :D

billy1624 avatar Mar 03 '22 08:03 billy1624

Any update? 🙂

hervyqa avatar Apr 18 '22 18:04 hervyqa

Is there anything I can do to help get this merged?

brettchalupa avatar Oct 15 '22 14:10 brettchalupa

Can you provide some more context on sitemaps in general? What benefits does it have? Shouldn't crawlers be able to crawl the summary index? I'm not very familiar with them, so if you could provide some general information, that would be helpful.

This looks like it needs some tests.

ehuss avatar Oct 15 '22 21:10 ehuss

search engines like google, yahoo, bing, etc.. usually use sitemap xml to crawl every page of the website. this is good so that the website is easily indexed by search engines. this is related to SEO.

hervyqa avatar Oct 16 '22 00:10 hervyqa

I believe a site map hints to a search engine what to index and what not to index. So unnecessary or outdated files can be skipped if they are not in the map.

Site maps are also more reliable wrt page ranking these days because things like a nav bar pointing to all the pages made simply counting the number of backlinks unreliable as a metric.

schungx avatar Oct 16 '22 01:10 schungx

Well, the info about why this would be useful has been said above :stuck_out_tongue:

What should be tested? I can only think of comparing the sitemap that would be output from the test book against a reference, but I believe that would produce a bunch of false positives.

ISSOtm avatar Oct 16 '22 08:10 ISSOtm

@ISSOtm with regards to what should be tested, just from checking out the PR, here are my thoughts:

  • generate_sitemap tests:
    • BookItems collection passed in to verify the file is created and has the expected contents for those items passed in — this is an integration test of sorts since this function calls out to write_sitemap
    • The destination existing already and error being returned
    • passing in a site_url without a trailing slash and ensuring it works as expected
  • xml_escapes unit tests by passing in a variety of strings to ensure they're escaped as expected

I'm happy to collaborate or help on any of the tests if you want. 😄

Also just want to say +1 to this comment you left: // TODO: lastmod from src file modification time? I think that makes sense and would be useful.

brettchalupa avatar Oct 16 '22 15:10 brettchalupa

Thanks @brettchalupa!

I'd be happy if you could give me some strings that would be useful for testing xml_escapes.

I'll begin writing the rest of the tests and the lastmod TODO this weekend/next week.

ISSOtm avatar Oct 22 '22 10:10 ISSOtm

@ISSOtm here are some strings for that unit test that might be helpful (even if it's unlikely some of these may get passed through in the grand scheme of things, it's probably okay to be extra thorough):

  • "https://example.com/foo/bar.html"
  • "https://Bücher.example/foo/bar"
  • "https://'<foo'/>.com/foo?bar=1&biz=100" -- you know, more permutations that exercise what's intended to be escaped 😂

brettchalupa avatar Oct 26 '22 14:10 brettchalupa

Any movement on this? It is still causing issues with Google indexing.

andymac4182 avatar May 03 '23 22:05 andymac4182

Any news here?

avivace avatar Oct 25 '23 22:10 avivace

Sorry, I've been dedicating my time elsewhere, as contribution to mdBook is very difficult due to lack of available maintainer time.

I'd be happy to let someone else take this PR to the finish line in my stead.

ISSOtm avatar Oct 26 '23 21:10 ISSOtm