WCF
WCF copied to clipboard
Investigate normalizing the username / email for SFS processing
This change is not correct.
- The local part of an email address is case sensitive and thus must not be normalized (RFC 5321#2.3.11).
- I however acknowledge that in the real world email addresses tend to be case insensitive and that the risk of a false-positive match is appropriately low.
- For usernames, where normalization actually is viable without ignoring standards it breaks the matching, because SFS provides usernames within a case-preserving list, that is used as-is within the hash database.
- Thus lowercasing the username will make username matching completely non-functional.
- Thinking about Unicode simply lowercasing the value to perform a match is not going to cut it either. The string should also be normalized (probably using the NFC normalization).
In addition this normalization would need to be appropriately documented / indicated, because performing this kind of normalization effectively creates data out of thin air and thus is not just an alternative representation of the SFS database, but a completely distinct database. This might or might not have legal liability implications in case of a false-positive.
In any case performing such a normalization probably would not be qualified within the stability guarantees of 5.2.
The best solution going forward probably would be if the SFS themselves would provide pre-hashed lists with a clear indication of the normalization that needs be performed to match within those pre-hashed lists.
Originally posted by @TimWolla in https://github.com/WoltLab/WCF/issues/4004#issuecomment-782118449
include me, happy to add this my side.
fyi, everything is converted to lower case, and hashed that way. unicode however, well, that's a problem as many sites that submit unicode data either wont encode it properly, don't tell me what encoding it is, or just convert it to [insert random charset here]. This really only applies to the username component at the moment, as email address standards for unicode is punycode ascii.
I wont be supporting case sensitive email addresses sorry due to the nightmare of tracking what domain supports it vs what domain is normalising it, that's just a mine field of pain that I don't have the time to get into. The scope for abuse of this, both in phishing and spam, is just too vast.
I'm not sure there is any email provider supporting that without falling back to puny or some form of ascii encoding.
@stopforumspam Thank you for chiming in. We are currently finalizing WoltLab Suite 5.4. We'll come back to you once development of 5.5 has begun.
@stopforumspam I'd like to make a first proposal for pre-hashed IP addresses, because IP addresses are the simplest case as the encoding is properly defined. My understanding is that you use PHP, that's why I'll give the example in PHP:
<?php
function hash_ipv4(string $ipv4)
{
$binary32 = \inet_pton($ipv4);
$binary24 = $binary32 & "\xff\xff\xff\x00";
$binary16 = $binary32 & "\xff\xff\x00\x00";
return [
\hash_hmac('md5', $binary32, 'com.stopforumspam:ipv4'),
\hash_hmac('md5', $binary24, 'com.stopforumspam:ipv4:24'),
\hash_hmac('md5', $binary16, 'com.stopforumspam:ipv4:16'),
];
}
function hash_ipv6(string $ipv6)
{
$binary128 = \inet_pton($ipv6);
$binary64 = $binary128 & "\xff\xff\xff\xff\xff\xff\xff\xff\x00\x00\x00\x00\x00\x00\x00\x00";
$binary56 = $binary128 & "\xff\xff\xff\xff\xff\xff\xff\x00\x00\x00\x00\x00\x00\x00\x00\x00";
$binary48 = $binary128 & "\xff\xff\xff\xff\xff\xff\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00";
$binary40 = $binary128 & "\xff\xff\xff\xff\xff\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00";
$binary32 = $binary128 & "\xff\xff\xff\xff\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00";
return [
\hash_hmac('md5', $binary128, 'com.stopforumspam:ipv6'),
\hash_hmac('md5', $binary64, 'com.stopforumspam:ipv6:64'),
\hash_hmac('md5', $binary56, 'com.stopforumspam:ipv6:56'),
\hash_hmac('md5', $binary48, 'com.stopforumspam:ipv6:48'),
\hash_hmac('md5', $binary40, 'com.stopforumspam:ipv6:40'),
\hash_hmac('md5', $binary32, 'com.stopforumspam:ipv6:32'),
];
}
Performing the hashing using this method has several benefits:
- The exact encoding of the IP address is irrelevant, because the binary form is being used. Consider
2001:0db8:85a3:0000:0000:8a2e:0370:7334
vs2001:db8:85a3::8a2e:370:7334
. - The most commonly used subnets are also being provided, because for at least IPv6 it's not useful to check specific IP addresses and using hashes the consumer can't generate the subnets themselves.
- I'm using the HMAC construction to ensure the hashes are being unique to SFS (i.e. they can't easily be compared against some other hashed lists of IP addresses), improving privacy. I'm using MD5 to keep the hashes small, we don't care about the security properties of MD5 here.
Of course by providing the subnets in the prehashed list some hashes will be duplicated if multiple IP addresses from a single subnet are being provided. I suggest that it's the job of the consumer to deal with this fact (e.g. by merging them themselves). This allows you to generate the file in a simple streaming fashion, reducing computing requirements on your end. The generation of the file could then look like this:
$data = [
['ip' => '192.168.2.46', 'hits' => 3, 'lastHit' => date('c', time()),],
['ip' => '192.168.17.49', 'hits' => 5, 'lastHit' => date('c', time()),],
];
foreach ($data as $entry) {
$hashes = hash_ipv4($entry['ip']);
foreach ($hashes as $hash) {
\fputcsv(STDOUT, [$hash, $entry['hits'], $entry['lastHit']]);
}
}
Resulting in:
76130df5d7c47e4ccbd6eec0e7982dc9,3,2021-06-08T09:09:15+00:00
9eca44f2f22810a3eb03e1401dcb702e,3,2021-06-08T09:09:15+00:00
916d946d4c8c126388678bdb9a32f287,3,2021-06-08T09:09:15+00:00
84061cf307b2512636e79c0ca34ceeea,5,2021-06-08T09:09:15+00:00
4ed706353f82cd0e81ea90191f92d4f2,5,2021-06-08T09:09:15+00:00
916d946d4c8c126388678bdb9a32f287,5,2021-06-08T09:09:15+00:00