wget2
wget2 copied to clipboard
no option to skip processing of sitemaps from robots.txt?
The sitemap processing is a good feature however in some cases it's undesirable yet there seems to be no way to turn it off
Consider a scenario where you want to download an isolated subset of pages which link to each other but don't link to the rest of the site, and aren't confined to a specific subdirectory. Example, /test.html
which links to /test.txt
and nothing else.
Trying to do something like this:
wget2-master --no-robots --robots=off -m --no-parent https://skyqueen.cc/test.html
Expectation is that it just downloads the test.html and test.txt, instead, it pull robots.txt, gets sitemaps from it, downloads the sitemaps, and consequently starts downloading the whole site
the --no-parent
doesn't do anything because the test.html is in the root of the site
the --no-robots
and --robots=off
do not actually stop the robots.txt from being downloaded & scanned for sitemaps
I know it's old but https://manpages.debian.org/testing/wget2/wget2.1.en.html implies that robots.txt is only scanned for sitemaps if the robots option is on, which does not seem to be the case
maybe there could be two separate options (obey robots yes or no, process sitemaps from robots yes or no) so all 4 of these scenarios could be accommodated:
- obey robots.txt and process sitemaps contained in it (default behavior)
- obey robots.txt but don't process sitemaps from it
- disobey robots.txt but process sitemaps from it
- disobey robots.txt and don't process sitemaps from it (in this case it should probably skip downloading the robots.txt at all unless a page links to it)
I tried also using --reject
and --reject-regex
in various combinations to block the downloading of robots.txt and/or the sitemap.txt it references, however, they appear to be handled as a special case which can't be rejected.
Thanks, good finding ! I added a new option in https://gitlab.com/gnuwget/wget2/-/merge_requests/518 . Will likely merge tomorrow.