John Bauer
John Bauer
The probably unsatisfying answer is, all of the brackets in ancora are round. We could probably detect that and teach the tokenizer to recognize square the same as round. I'll...
I made some changes to the training to hopefully capture [] if the dataset doesn't already have [] in it. Also, we were in fact attempting to augment the ellipses,...
If you are using the dev branch, there is now a Spanish tokenizer which processes [], at least on the example you gave above. I'm not sure there will be...
Next few weeks we are unlikely to fix this - end of the month I can brush off some old work for expanding the lemmas that our Spanish tools can...
17K isn't that bad, but it will definitely be missing many verbs. Ideally the tokenizer would pick up the correct pattern, but apparently not in this case... The solution we...
I tried your code example, thanks for including that. However, my experience is different from yours on Linux. There, it hangs with no errors or warnings whatsoever. This is with...
For reference, here is what we trained on: https://github.com/UniversalDependencies/UD_Arabic-PADT Let's start from some random sentence in the text file you sent ``` ﺍﻧﺍ ﺳﻣﻋﺗ ؛ﻣﻫﺍ ﻛﺍﻧﺗ ﺑﺗﺣﻛﻯ ؛ﻟﺷﻳﺭﻳﻧ ﻭ ؛ﺷﻓﻳﻋﺓ...
A brief look at arabic_reshaper makes it look like it turns "no form" text into having the proper form. So basically we would need to either - convert all of...
Perhaps the easiest thing to do would be to put everything through an unshaper such as in this Stack Overflow question: https://stackoverflow.com/questions/33718144/do-arabic-characters-have-different-unicode-code-points-based-on-position-in-str
It's not perfect, but if I start from the `SHAPING` map described in that Stack Overflow post, reverse it ``` >>> unshape = {} >>> for x in SHAPING: ......