SafeText
SafeText copied to clipboard
Apply NFKC normalisation
Otherwise I can fingerprint on diacritic form, ligatures, etc.
I don't know if it also removes the homoglyphs. Might want to look into that.
NFKC does change the appearance of the text a bit if you're using display variants e.g. blacktype h Vs Latin h, but NFC normalisation permits too many fingerprinting options.
http://unicode.org/reports/tr15/#Canon_Compat_Equivalence
Thanks for this, I'll have to look into it. I'll leave it open until I fix it
https://stackoverflow.com/questions/5258623/remove-special-characters-from-string
I think this method:
>>> unicodedata.normalize('NFKD', source).encode('ascii', 'ignore')
is the simplest and most correct method here, in fact I think that you could just compare the text with the encoded/cleaned version and it would be ok.
Why would you re-encode as ASCII?
To strip of all non-ascii chars. just to make sure there is nothing at all that could be used to fingerprint the text.
With ASCII you can still fingerprint on:
- Number of whitespace characters
- Extra/changed characters hidden as typos and/or wrong punctuation (unicode just expands this option)
And on a bunch of things that are probably out of scope
- Exact numbers used
- Rephrasings
- Restructuring (moving sections, paragraphs, etc around)
Remember the attacker only needs about log2(number of people with access) bits of identifying changes to survive any sanitation and conversion.
Number of spaces is easy to spot and also easy to fix - eg collapse all spaces to a single one. Typos could be dealt with but I agree it is hard to do it automatically.
I think its about lowering the probability, not removing the possibility of such attack altogether.
Reading through this:
- I'll add the normalization, looks pretty useful,
- As for the comments about reencoding as ASCII - I'm going to agree with @Visgean in that we want to remove anything 'non-ascii'. This would be a concern if the tool were to be used with other languages, but really I'm centering it around the Latin character set.
- @cmcaine You raise valid points, it's just easier to clean once all the "questionable" characters have been removed. And in regards to your last 3 bullet points, you are entirely correct. However, I'm trying to address the issue of fingerprinting in text - not fingerprinting through language patterns/word choice. In the future, I may try to add something that swaps out words with synonyms, but that's down the road.
Thanks for the feedback, wanted to say I appreciate it. I'll try to get around to implementing things within in the next few days. And of course, feel free to submit a pull request!