adblock-lean icon indicating copy to clipboard operation
adblock-lean copied to clipboard

Erroneous rogue element detection?

Open jul059 opened this issue 1 year ago • 13 comments

Adblock-lean reports a rogue element in the following blocklist:

Warning: Rogue element: '711線上認證.online' identified originating in blocklist file from: https://raw.githubusercontent.com/elliotwutingfeng/GlobalAntiScamOrg-blocklist/main/global-anti-scam-org-scam-urls-pihole.txt.

It looks to me like an erroneous detection.

jul059 avatar Nov 03 '24 18:11 jul059

It does appear that asian characters are allowable in URL addresses. https://stackoverflow.com/questions/7184802/are-chinese-characters-allowed-entered-in-urls

We have not so far whitelisted these types of characters in adblock-lean and is why this rogue element detection occurs. Alphaumeric characters are allowed via [::alnum::] I'll have a look for an equivalent for asian characters.

Wizballs avatar Nov 03 '24 19:11 Wizballs

I thought that international characters, while showing up as such in the browser, in fact are transparently encoded into ascii characters when dns resolution is performed. This Wikipedia article seems to confirm that: https://en.m.wikipedia.org/wiki/Internationalized_domain_name

Which means that internationalized domain names can be adblocked as-is, in their ascii transcription. Technically we could implement support for additional alphabets, but they are many, so this may get complicated and have a performance impact. For this reason, I think that the correct solution is to use lists that have the internationalized domain names encoded with Punycode.

I'd suggest @jul059 to file a bug report to whoever maintains the list which has this entry.

Edit: if using a different list or fixing this list is not possible, I suppose we could make rogue element detection optional. Since rogue element check safeguards against both corrupted and malicious lists, the user would then need to make sure that the lists they are using are safe and sound.

friendly-bits avatar Nov 03 '24 23:11 friendly-bits

Upon further research, perhaps we could consider changing the regex to support all unicode characters. Apparently this can be done by replacing [[:alnum:]_-] with (\w|[0‐9]|-).

(edit: it's [0-9], not [0_9] as incorrectly represented by Github)

\w is supposed to match "word" character, which supposedly means any letter in either ascii or unicode, and underscore. So this should cover all alphabets (unless I am misunderstanding something). The caveats:

  • \w is non-POSIX. So whether it does or does not work in sed depends on the specific sed implementation. Busybox seems to support this, but we would need to test against Busybox sed in OoenWrt to make sure.
  • This may have performance impact

If this does work but has performance impact, we could make this an optional feature.

The question of whether this is actually needed still stands.

friendly-bits avatar Nov 04 '24 00:11 friendly-bits

international characters, while showing up as such in the browser, in fact are transparently encoded into ascii characters when dns resolution is performed.

I read this same thing also. The Blocklist maintainer could possibly switch to using ascii character.

replacing [[:alnum:]_-] with (\w|[0‐9]|-)

I'll give this a performance/sed test later on tonight. Let's see how that goes and decide from there.

Wizballs avatar Nov 04 '24 02:11 Wizballs

So I have some different script behavior than OP. Running on the latest adblock-lean main (no modification), my test file with entry 華信金融.tw passes adblock-lean rogue check with no issues https://raw.githubusercontent.com/Wizballs/dnsmasqaddresstest/refs/heads/main/test

However, dnsmasq does not like these characters anyway. So that should pretty much be decision made as asian characters cause this error:

Error: The dnsmasq test on the final blocklist failed.
dnsmasq --test errors:
dnsmasq: error at line 15843 of stdin

Error: New blocklist file check failed.

Wizballs avatar Nov 04 '24 11:11 Wizballs

One other thing we could do for this use case is add an option enabling pre-processing of certain lists with the idn2 utility. idn2 can be used to translate unicode to punycode. This requires packages idn2 libidn2 (seems to be available in the repo).

So something like

unicode_blocklist_urls="https://raw.githubusercontent.com/elliotwutingfeng/GlobalAntiScamOrg-blocklist/main/global-anti-scam-org-scam-urls-pihole.txt"

friendly-bits avatar Nov 04 '24 12:11 friendly-bits

I think converting urls to punycode should be up to the list maintainer and not built into adblock-lean. The OP could request that particular blocklist maintainer to do so.

Something else in that list is d+D446own.chaojis0.com I haven't seen urls with a + symbol before. And while a Google search says these are accepted, I can't get this address to resolve. It also has a capital letter which is unusual.

Wizballs avatar Nov 04 '24 20:11 Wizballs

The question is whether including internationalized domain names (IDN's) in blocklists is a normal practice. From a little research, it looks like at least pihole does accept IDN's. I can't tell how widespread the use of these lists is but if it doesn't cost us much effort, we could as well support them in adblock-lean (with help of idn2).

friendly-bits avatar Nov 04 '24 22:11 friendly-bits

I don't think it's widespread use, first I've seen after all these years. Can you make any sense of this address in that same list? I'm trying to understand if the maintainer needs to make some updates since the really only a few idns in that list , and a couple that look like this:

d+D446own.chaojis0.com

Wizballs avatar Nov 05 '24 10:11 Wizballs

Does this issue still need to be worked on?

lynxthecat avatar Dec 02 '24 16:12 lynxthecat

@lynxthecat I think we haven't reached a conclusion on this one. I think it's a matter of policy. In other words, you are the boss, you decide :)

friendly-bits avatar Jan 13 '25 23:01 friendly-bits

So I have some different script behavior than OP. Running on the latest adblock-lean main (no modification), my test file with entry 華信金融.tw passes adblock-lean rogue check with no issues https://raw.githubusercontent.com/Wizballs/dnsmasqaddresstest/refs/heads/main/test

However, dnsmasq does not like these characters anyway. So that should pretty much be decision made as asian characters cause this error:

Error: The dnsmasq test on the final blocklist failed.
dnsmasq --test errors:
dnsmasq: error at line 15843 of stdin

Error: New blocklist file check failed.

Doesn't this answer it for us? Ultimately if dnsmasq is going to reject we're limited by that?

lynxthecat avatar Jan 17 '25 08:01 lynxthecat

Doesn't this answer it for us? Ultimately if dnsmasq is going to reject we're limited by that?

This definitely does mean that those lists can not be used as-is. See further discussion of implementing a conversion mechanism, which is what probably some other adblockers which do accept these lists do.

friendly-bits avatar Jan 17 '25 08:01 friendly-bits