sitemap_excludes support for regular expressions
Some pages that are automatically generated such as the _modules/ path contains source code that should preferably be excluded from the sitemap. To my understanding it is not possible to exclude these from the sitemap using sitemap_excludes without adding every single page to it manually. Is there a convenient way to exclude all of pages contained in a directory?
Unfortunately not at this time, I can look at adding regex support to sitemap_excludes.
As a workaround until then, you might be able to dynamically generate the list of files in the conf.py.
I'm also interested in this for excluding _modules/ paths. @Koen1999 did you find a solution by dynamically generating the list of files?
@jdillard if adding full regex support is going to be a lot of extra complexity, would you be open to supporting a * character at the end of a path, meaning "exclude any paths of which this path is a prefix"? If you think that would be an acceptable resolution, I'm happy to try opening a PR.
I'm also interested in this for excluding
_modules/paths. @Koen1999 did you find a solution by dynamically generating the list of files?@jdillard if adding full regex support is going to be a lot of extra complexity, would you be open to supporting a
*character at the end of a path, meaning "exclude any paths of which this path is a prefix"? If you think that would be an acceptable resolution, I'm happy to try opening a PR.
I have not applied the workaround, mainly because the pages are generated by some other module that I am not very familiar with. The workaround he proposed seems feasible though.
I think the solution you propose seems fine, especially considering stars are not common characters in URLs. Alternatively, you could change sitemap_excludes to be a list of two types: strings (no regular expression), and compiled regular expressions using re.
I think regex has the potential to make configuration more complicated (a lot of people struggle with the syntax) and not sure the added control it gives is worth it in this case.
I went with a simpler wildcard approach if you want to give thoughts and/or test is out here: https://github.com/jdillard/sphinx-sitemap/pull/113
I would have tried full glob patterns, but the doc paths are stored as strings, so going glob would take some work.