wget-2-zim icon indicating copy to clipboard operation
wget-2-zim copied to clipboard

Suggestion: Option to exclude certain webpaths from being crawled, or at least written, maybe both

Open 5000thinmints opened this issue 2 years ago • 2 comments

For example in my run of http://www.someweb.com I would like to exclude all of http://www.someweb.com/boringnotes/ from being crawled/written since there is nothing of interest to me there.

5000thinmints avatar Jan 18 '23 23:01 5000thinmints

Unfortunately this is not possible with wget itself, so the only point of such an option would be for the script to delete the directories afterwards in order to make the ZIM file smaller.

I am not sure if that would be so helpful. What do you think?

ballerburg9005 avatar Jan 19 '23 03:01 ballerburg9005

It is somewhat far-fetched, but you could set up Privoxy with "https-inspection" enabled and put "--no-check-certificate -e use_proxy=yes -e http_proxy=127.0.0.1:8118" into the wget command. This way the proxy would be able to read your HTTP requests and you could set it up to block the URL paths you want.

I suppose this is a much more desirable result than deleting the folders afterwards, if you really need it.

ballerburg9005 avatar Jan 29 '23 04:01 ballerburg9005