wget-2-zim
wget-2-zim copied to clipboard
creates ZIM files for Kiwix from arbitrary websites with wget and some nifty tricks (doesn't need ServiceWorkers)
It would be convenient to be able to specify a download directory when running the script (e.g. with `-d` flag or something), rather than having it download to the directory...
I've noticed some `index.html` files were missing after scraping a site with your script. Seems the problem is that if wget downloads some ~binary~ files to a directory then a...
`find $DOMAIN -type f \( -name '*.htm*' -or -name '*.php*' \) -exec "$iterscript" "$DOMAIN" '{}' "$EXTERNALURLS" "$WGETREJECT" "$NOOVERREACH" -not -path "./$DOMAIN/wget-2-zim-overreach/*" \;`: https://github.com/ballerburg9005/wget-2-zim/blob/6f83e1125fe3cc09be4e2f1bc2c1fdc42959cc66/wget-2-zim.sh#L209 This line produces no debug output and...
Suggestion: Option to exclude certain webpaths from being crawled, or at least written, maybe both
For example in my run of http://www.someweb.com I would like to exclude all of http://www.someweb.com/boringnotes/ from being crawled/written since there is nothing of interest to me there.
currently building zim-tools from github requires a version of libzim that debian does not distribute, and thus building zim-tools on a debian machine results is not possible (or at least,...