Ceyda Cinarel (재이다)

Results 47 comments of Ceyda Cinarel (재이다)

https://hsivonen.fi/string-length/ "🤦🏼‍♂️","🤦🏼","💖", "💘", "💝", "💞", "❣️", "✨". I think converting between jspy lengths can solve this. there are too many emojis and strange width chars when working with multiple languages...

Here is another thing: https://user-images.githubusercontent.com/15624271/219418599-f0879a98-fa39-4fd7-8499-f9ddd58d54c2.mov I mean I understand why it happens but not how to fix it 🤣 . Just playing around in a notebook you can see why...

yes javascript uses UTF-16 encoding to calculate string lengths. While python counts codepoints(or utf-8 encoding bytes) The key concepts to understand are **unicode code points**,**graphemes** and **utf-16** encoding. I meant...

Also learned that in JS if you use array expansion(?) you can get the number of codepoints accurately (same as python) ``` [..."🤦🏼‍♂️"].length ```

but that is how python counts too! It count's code points. How humans perceive a single _letter_(A,B,C etc)(can think of this as the _grapheme_) and how a single _grapheme_ is...

I would heavily suggest not straying from the norm of using `len()` `list()` on the python side (ie counting codepoints), because that is basically how most tokenization libraries work (transformers,spacy......

+1 for "CTRL-Z-like return to the previous state"

👍 Anyway, wasn't expecting changing something as fundamental as `.from_pretrained` to be reasonable or easy 😅 @sgugger While working on this I realized a couple of things. Will make separate...