persephone
persephone copied to clipboard
Use NFC unicode normalization in datasets/na.py
I think this is a very good idea so let me know if this is something you want me to implement by assigning me this ticket to me.
No worries, I'll do this.
Perhaps we can add the easy-first label if this hasn't already been done?
Sure.
I really need to remember to do this over in the Web API, I suspect this will save us running into a bunch of bugs?
Just one bug really. Everything will still work but the model will think two symbols that are equivalent are distinct and as a result there'll be a tad more data sparsity for those labels and the model might underperform slightly. So it's good practice to follow but it shouldn't break anything if you don't.