Exclude domains
An option to exclude domains so that no connection is made to prevent downloading resources from specific domains would be very useful.
Thank you for your program.
That's such a good idea, thank you for the suggestion! Do you have examples of other CLI tools having similar blacklists?
Currently Monolith has no arrays in options, I need to find a good format for providing lists to the program.
Options: -B *.domain.com,some-other-domain.com,friendbook.com or -B not-this.com -B not-that.com etc. There must be some common way of specifying wildcards in hostnames, perhaps that could be copied from iptables.
wget has the --exclude-domains flag option (uses comma-delimited values). wget has some really great options that I am sometimes surprised to find are missing in similar popular tools. I do think wget's version is quite limited though as it will only accept exact domain names. Having wildcard/regex support would definitely make it better.
Thanks for working on this feature @snshn. Is it possible to have wildcard/regex matching for this?
Also, do you think a complementary whitelist option would be nice to add? So that you are able to only connect to specified whitelisted domains/regex while the unmatched are automatically blacklisted. This can be very useful especially for websites that connect to many unwanted domains where blacklisting would become quite tedious there.
Yeah, whitelisting is definitely something that needs to be added.
Not sure about wildcard/regex matching right now, perhaps it could be added later; need to see how other tools have it implemented. Regex could be used for matching the whole URL, not just the domain, but that's likely an overkill.
Wget has some similar filtering options if you want to see how they are implemented:
--accept-regex / --reject-regex
--domains / --exclude-domains (wget lacks wildcard/regex for these unfortunately)
Wildcard matching would be really nice to have in my opinion. I use regex infrequently but it does come in handy when you want to match mainly based on URL.
Currently in master, needs a bit more testing before I'm able to release it as part of 2.7.0.
-d .google.com will match subdomains (as well as google.com), -d google.com will only match google.com; -E is to exclude those domains instead of only allowing those, so e.g.:
monolith https://somesite.com/somearticle.html -d .google.com -d .googleadservices.com -E -I -o degoogled.html
at first it had support for -d .google.com,google.com in there as well, but it looks extremely messy and is harder to read than standalone -d's, besides quite pointless since that CLI parsing clap crate now supports multiple occurrences of the same option.
Regex, support for IP addresses, port numbers, protocols, etc will come later, can't delay this feature any longer.
Almost two years later, but this is done. Please see the latest README.md file for instructions. 2.7.0 is out, should be available via package managers within a day or two. Thank you for your patience and helping getting this feature in!