Amir Plivatsky comments

Results 369 comments of


                                            Amir Plivatsky

Capitalization is a kind-of pseduo-morphology

> In the language learning code, I don't downcase any data in advance. Instead, the system eventually learns that certain words behave the same way, grammatically, whether they are uppercased...

Capitalization is a kind-of pseduo-morphology

I'm finally trying to implement hanfling of capitalized words by the dict. I encountered a problem: How to generally implement a downcasing rule. In English it is simple, and is...

Capitalization is a kind-of pseduo-morphology

I implemented the main part of the pseudo-morphology capitalization. There were several options for how and were to make the driving definition, and I chose to put it in `4.0.regex`...

Capitalization is a kind-of pseduo-morphology

> But is it superfluous? In almost all cases, if a word attaches with Wd, then it should have been capitalized...(or already was capitalized). I tried to always use the...

Capitalization is a kind-of pseduo-morphology

> The 'f' and the 'l' flags would not be needed. I don't understand how you can do it without an indication like 'f'. This flag indicates that the regex...

Capitalization is a kind-of pseduo-morphology

> Allowing two links between words -- I'm pretty sure this would massively break the existing dicts. It will be interesting to see where and how... I always thought multiple...

Capitalization is a kind-of pseduo-morphology

For splitting units and any other continuous morphology I proposed a better way - defining (in the dict) token boundaries (which side of a token must have whitespace) either by...

Capitalization is a kind-of pseduo-morphology

> Yes. I'm not sure how. Ideally, one could say something like "if regex ran, then insert A+ on the first token and insert B- on the second token". Since...

Capitalization is a kind-of pseduo-morphology

> Where is this? one of the other issues? I'm sorry, I'm having trouble reading and responding to everything. In the LG group (the "zero knowledge tokenizer"), in other issues...

Capitalization is a kind-of pseduo-morphology

> However, if it is in the dict, it then looks up =B=, and so on in a backtracking fashion. Note this fix in my previous post.