transformer-from-scratch icon indicating copy to clipboard operation
transformer-from-scratch copied to clipboard

Tokenizer

Open eduardoleao052 opened this issue 2 years ago • 5 comments

Have you been able to get good results with the tokenization? I've been using a regex like yours to tokenize some texts for my decoder transformer, and the vocabulary size seems to blow up! I think it's because it is at a word level, maybe there's no escaping a larger vocab size.

eduardoleao052 avatar Dec 11 '23 14:12 eduardoleao052

I don't know much about text pre-processing neither transformers (studied years ago) but I think OpenAI's tiktoken library is a way to go for tokenisation.

RahulBhalley avatar Mar 01 '24 14:03 RahulBhalley

I see, I am trying to study tokenization a bit more lately, thanks for the tiktoken tip! If you don't mind me asking, what have you moved on to in terms of interests after learning about transformers and such?

eduardoleao052 avatar Mar 01 '24 15:03 eduardoleao052

I have moved on to the production side of deep learning for freelance projects. So, I am relying on pre-trained models only. I know it's wrong to not study but just build upon what others have built. But it's a lot less stressful and time freeing than trying to keep up with all new stuffs in detail. @eduardoleao052

RahulBhalley avatar Mar 02 '24 12:03 RahulBhalley

That's cool! I guess it's natural, after studying something from a theoretical standpoint, wanting to move on to the practical side of things.

eduardoleao052 avatar Mar 02 '24 22:03 eduardoleao052

Yes. 🙂

RahulBhalley avatar Mar 03 '24 02:03 RahulBhalley