SafeText Apply NFKC normalisation

Otherwise I can fingerprint on diacritic form, ligatures, etc.

I don't know if it also removes the homoglyphs. Might want to look into that.

NFKC does change the appearance of the text a bit if you're using display variants e.g. blacktype h Vs Latin h, but NFC normalisation permits too many fingerprinting options.

http://unicode.org/reports/tr15/#Canon_Compat_Equivalence

Jan 01 '18 12:01 cmcaine

Thanks for this, I'll have to look into it. I'll leave it open until I fix it

Jan 02 '18 05:01 DavidJacobson

https://stackoverflow.com/questions/5258623/remove-special-characters-from-string

I think this method:

>>> unicodedata.normalize('NFKD', source).encode('ascii', 'ignore')

is the simplest and most correct method here, in fact I think that you could just compare the text with the encoded/cleaned version and it would be ok.

Jan 02 '18 19:01 Visgean

Why would you re-encode as ASCII?

Jan 02 '18 19:01 cmcaine

To strip of all non-ascii chars. just to make sure there is nothing at all that could be used to fingerprint the text.

Jan 02 '18 21:01 Visgean

With ASCII you can still fingerprint on:

Number of whitespace characters
Extra/changed characters hidden as typos and/or wrong punctuation (unicode just expands this option)

And on a bunch of things that are probably out of scope

Exact numbers used
Rephrasings
Restructuring (moving sections, paragraphs, etc around)

Remember the attacker only needs about log2(number of people with access) bits of identifying changes to survive any sanitation and conversion.

Jan 02 '18 21:01 cmcaine

Number of spaces is easy to spot and also easy to fix - eg collapse all spaces to a single one. Typos could be dealt with but I agree it is hard to do it automatically.

I think its about lowering the probability, not removing the possibility of such attack altogether.

Jan 02 '18 22:01 Visgean

Reading through this:

I'll add the normalization, looks pretty useful,
As for the comments about reencoding as ASCII - I'm going to agree with @Visgean in that we want to remove anything 'non-ascii'. This would be a concern if the tool were to be used with other languages, but really I'm centering it around the Latin character set.
@cmcaine You raise valid points, it's just easier to clean once all the "questionable" characters have been removed. And in regards to your last 3 bullet points, you are entirely correct. However, I'm trying to address the issue of fingerprinting in text - not fingerprinting through language patterns/word choice. In the future, I may try to add something that swaps out words with synonyms, but that's down the road.

Thanks for the feedback, wanted to say I appreciate it. I'll try to get around to implementing things within in the next few days. And of course, feel free to submit a pull request!

Jan 03 '18 06:01 DavidJacobson

SafeText SafeText copied to clipboard

Apply NFKC normalisation

SafeText
SafeText copied to clipboard