mdBook
mdBook copied to clipboard
Add sitemap generation support to HTML renderer
Fixes #1491
Mergeable as-is, but there is optional work:
- [ ]
<lastmod>
from.md
file modification time? - [ ] Check that the generated file is under 50 MiB, as per the spec
- [ ] Warnings if the site URL e.g. has parameters, since
.join()
overwrites them
I will work on them if maintainers think they're a good thing.
Any news on this one?
Just rebased the PR.
The feature will be very helpful, also for me. Hope this will be merged!
Looking forward to this feature! :D
Any update? 🙂
Is there anything I can do to help get this merged?
Can you provide some more context on sitemaps in general? What benefits does it have? Shouldn't crawlers be able to crawl the summary index? I'm not very familiar with them, so if you could provide some general information, that would be helpful.
This looks like it needs some tests.
search engines like google, yahoo, bing, etc.. usually use sitemap xml to crawl every page of the website. this is good so that the website is easily indexed by search engines. this is related to SEO.
I believe a site map hints to a search engine what to index and what not to index. So unnecessary or outdated files can be skipped if they are not in the map.
Site maps are also more reliable wrt page ranking these days because things like a nav bar pointing to all the pages made simply counting the number of backlinks unreliable as a metric.
Well, the info about why this would be useful has been said above :stuck_out_tongue:
What should be tested? I can only think of comparing the sitemap that would be output from the test book against a reference, but I believe that would produce a bunch of false positives.
@ISSOtm with regards to what should be tested, just from checking out the PR, here are my thoughts:
-
generate_sitemap
tests:-
BookItem
s collection passed in to verify the file is created and has the expected contents for those items passed in — this is an integration test of sorts since this function calls out towrite_sitemap
- The destination existing already and error being returned
- passing in a
site_url
without a trailing slash and ensuring it works as expected
-
-
xml_escapes
unit tests by passing in a variety of strings to ensure they're escaped as expected
I'm happy to collaborate or help on any of the tests if you want. 😄
Also just want to say +1 to this comment you left: // TODO: lastmod from src file modification time?
I think that makes sense and would be useful.
Thanks @brettchalupa!
I'd be happy if you could give me some strings that would be useful for testing xml_escapes
.
I'll begin writing the rest of the tests and the lastmod
TODO this weekend/next week.
@ISSOtm here are some strings for that unit test that might be helpful (even if it's unlikely some of these may get passed through in the grand scheme of things, it's probably okay to be extra thorough):
-
"https://example.com/foo/bar.html"
-
"https://Bücher.example/foo/bar"
-
"https://'<foo'/>.com/foo?bar=1&biz=100"
-- you know, more permutations that exercise what's intended to be escaped 😂
Any movement on this? It is still causing issues with Google indexing.
Any news here?
Sorry, I've been dedicating my time elsewhere, as contribution to mdBook is very difficult due to lack of available maintainer time.
I'd be happy to let someone else take this PR to the finish line in my stead.