Daniil Gavrilov comments

Results 6 comments of


                                            Daniil Gavrilov

[Feature] Add ability to manual edit vocabulary (add/remove subwords)

Is it really important to be able to remove some tokens if you could just preprocess your training data so some characters are not going to be merged together after...

[Feature] Add ability to manual edit vocabulary (add/remove subwords)

We agree that it could be considered as a downside of the bpe algorithm. However, we believe that such cases are not going to be very frequent. Furthermore, removing some...

[Feature] Add ability to manual edit vocabulary (add/remove subwords)

The issue is that you have to store ABC and DEF if you want to encode some text later. Consider looking at the function https://github.com/VKCOM/YouTokenToMe/blob/c2ab3c86c07918dd0f9ef1e0445e6c79f504a64a/youtokentome/cpp/bpe.cpp#L1528

Using YouTokenToMe with pre-defined vocab and embeddings

Hi @alexbalandi! Right now, you can't use external vocab to define your bpe model. We plan to support converting different subword formats into yttm format in the future, but it...

Is it possible to unset random seed for BPE-dropout?

Could you please provide more information about the issue? I've tested yttm bpe dropout in python REPL and obtained different subword tokenization for different runs ``` >>> for _ in...

train.py memory problem

What do you mean by "dies after a while"? There are no restricts on the nature of word embeddings –– you just have to save it in appropriate file and...