wget2 no option to skip processing of sitemaps from robots.txt?

no option to skip processing of sitemaps from robots.txt?

Open catharsis71 opened this issue 3 years ago • 2 comments

The sitemap processing is a good feature however in some cases it's undesirable yet there seems to be no way to turn it off

Consider a scenario where you want to download an isolated subset of pages which link to each other but don't link to the rest of the site, and aren't confined to a specific subdirectory. Example, /test.html which links to /test.txt and nothing else.

Trying to do something like this:

wget2-master --no-robots --robots=off -m --no-parent https://skyqueen.cc/test.html

Expectation is that it just downloads the test.html and test.txt, instead, it pull robots.txt, gets sitemaps from it, downloads the sitemaps, and consequently starts downloading the whole site

the --no-parent doesn't do anything because the test.html is in the root of the site

the --no-robots and --robots=off do not actually stop the robots.txt from being downloaded & scanned for sitemaps

I know it's old but https://manpages.debian.org/testing/wget2/wget2.1.en.html implies that robots.txt is only scanned for sitemaps if the robots option is on, which does not seem to be the case

maybe there could be two separate options (obey robots yes or no, process sitemaps from robots yes or no) so all 4 of these scenarios could be accommodated:

obey robots.txt and process sitemaps contained in it (default behavior)
obey robots.txt but don't process sitemaps from it
disobey robots.txt but process sitemaps from it
disobey robots.txt and don't process sitemaps from it (in this case it should probably skip downloading the robots.txt at all unless a page links to it)

Nov 27 '22 19:11 catharsis71

I tried also using --reject and --reject-regex in various combinations to block the downloading of robots.txt and/or the sitemap.txt it references, however, they appear to be handled as a special case which can't be rejected.

Nov 27 '22 19:11 catharsis71

Thanks, good finding ! I added a new option in https://gitlab.com/gnuwget/wget2/-/merge_requests/518 . Will likely merge tomorrow.

Dec 03 '22 18:12 rockdaboot

wget2 wget2 copied to clipboard

no option to skip processing of sitemaps from robots.txt?

wget2
wget2 copied to clipboard