John Bauer comments

Results 1064 comments of


                                            John Bauer

bracket chars not separated as punctuation in Spanish

The probably unsatisfying answer is, all of the brackets in ancora are round. We could probably detect that and teach the tokenizer to recognize square the same as round. I'll...

bracket chars not separated as punctuation in Spanish

I made some changes to the training to hopefully capture [] if the dataset doesn't already have [] in it. Also, we were in fact attempting to augment the ellipses,...

bracket chars not separated as punctuation in Spanish

If you are using the dev branch, there is now a Spanish tokenizer which processes [], at least on the example you gave above. I'm not sure there will be...

several common Spanish verbs forms don't use the infinitive as the lemma

Next few weeks we are unlikely to fix this - end of the month I can brush off some old work for expanding the lemmas that our Spanish tools can...

several common Spanish verbs forms don't use the infinitive as the lemma

17K isn't that bad, but it will definitely be missing many verbs. Ideally the tokenizer would pick up the correct pattern, but apparently not in this case... The solution we...

Multi-processing stopped working since 1.4

I tried your code example, thanks for including that. However, my experience is different from yours on Linux. There, it hangs with no errors or warnings whatsoever. This is with...

[QUESTION] Arabic model not recognizing words

For reference, here is what we trained on: https://github.com/UniversalDependencies/UD_Arabic-PADT Let's start from some random sentence in the text file you sent ``` ﺍﻧﺍ ﺳﻣﻋﺗ ؛ﻣﻫﺍ ﻛﺍﻧﺗ ﺑﺗﺣﻛﻯ ؛ﻟﺷﻳﺭﻳﻧ ﻭ ؛ﺷﻓﻳﻋﺓ...

[QUESTION] Arabic model not recognizing words

A brief look at arabic_reshaper makes it look like it turns "no form" text into having the proper form. So basically we would need to either - convert all of...

[QUESTION] Arabic model not recognizing words

Perhaps the easiest thing to do would be to put everything through an unshaper such as in this Stack Overflow question: https://stackoverflow.com/questions/33718144/do-arabic-characters-have-different-unicode-code-points-based-on-position-in-str

[QUESTION] Arabic model not recognizing words

It's not perfect, but if I start from the `SHAPING` map described in that Stack Overflow post, reverse it ``` >>> unshape = {} >>> for x in SHAPING: ......