Amir Plivatsky
Amir Plivatsky
> In the language learning code, I don't downcase any data in advance. Instead, the system eventually learns that certain words behave the same way, grammatically, whether they are uppercased...
I'm finally trying to implement hanfling of capitalized words by the dict. I encountered a problem: How to generally implement a downcasing rule. In English it is simple, and is...
I implemented the main part of the pseudo-morphology capitalization. There were several options for how and were to make the driving definition, and I chose to put it in `4.0.regex`...
> But is it superfluous? In almost all cases, if a word attaches with Wd, then it should have been capitalized...(or already was capitalized). I tried to always use the...
> The 'f' and the 'l' flags would not be needed. I don't understand how you can do it without an indication like 'f'. This flag indicates that the regex...
> Allowing two links between words -- I'm pretty sure this would massively break the existing dicts. It will be interesting to see where and how... I always thought multiple...
For splitting units and any other continuous morphology I proposed a better way - defining (in the dict) token boundaries (which side of a token must have whitespace) either by...
> Yes. I'm not sure how. Ideally, one could say something like "if regex ran, then insert A+ on the first token and insert B- on the second token". Since...
> Where is this? one of the other issues? I'm sorry, I'm having trouble reading and responding to everything. In the LG group (the "zero knowledge tokenizer"), in other issues...
> However, if it is in the dict, it then looks up =B=, and so on in a backtracking fashion. Note this fix in my previous post.