persephone icon indicating copy to clipboard operation
persephone copied to clipboard

Use NFC unicode normalization in datasets/na.py

Open oadams opened this issue 6 years ago • 6 comments

oadams avatar Mar 20 '18 04:03 oadams

I think this is a very good idea so let me know if this is something you want me to implement by assigning me this ticket to me.

shuttle1987 avatar Mar 23 '18 06:03 shuttle1987

No worries, I'll do this.

oadams avatar Mar 25 '18 01:03 oadams

Perhaps we can add the easy-first label if this hasn't already been done?

shuttle1987 avatar Aug 10 '18 10:08 shuttle1987

Sure.

oadams avatar Aug 11 '18 15:08 oadams

I really need to remember to do this over in the Web API, I suspect this will save us running into a bunch of bugs?

shuttle1987 avatar Sep 15 '18 07:09 shuttle1987

Just one bug really. Everything will still work but the model will think two symbols that are equivalent are distinct and as a result there'll be a tad more data sparsity for those labels and the model might underperform slightly. So it's good practice to follow but it shouldn't break anything if you don't.

oadams avatar Sep 17 '18 13:09 oadams