YouTokenToMe icon indicating copy to clipboard operation
YouTokenToMe copied to clipboard

Using YouTokenToMe with pre-defined vocab and embeddings

Open alexbalandi opened this issue 3 years ago • 2 comments

I want to use YouTokenToMe for fast id encoding, but I need to do it with embeddings taken from here : https://nlp.h-its.org/bpemb/ Obviously, there is a pre-defined vocab there. Right now I don't see out-of-the-box way to "befriend" YouTokenToMe model with pre-defined vocab. Are there any plans to implement something like build_from_vocab classmethod? If not, can I get any starter points on how to do it myself? Right now the model file looks a bit obscure to me, so I can't easily get started on building my own model file from vocab I have.

alexbalandi avatar Feb 16 '21 09:02 alexbalandi

Hi @alexbalandi!

Right now, you can't use external vocab to define your bpe model. We plan to support converting different subword formats into yttm format in the future, but it seems to be slightly hard to implement.

kefirski avatar Feb 16 '21 09:02 kefirski

Hi @alexbalandi!

Right now, you can't use external vocab to define your bpe model. We plan to support converting different subword formats into yttm format in the future, but it seems to be slightly hard to implement.

Thank you for quick answer! Can I at least get some pointers at where to look so I could try to make ad hoc solution myself? Like what does each line in .model file from you tutorial mean? I could try to look source code, but I'm not proficient in c++ and honestly, any sources with code-less (or at least pseudo-code) explanation of how your model gets loaded from file and works would help.

alexbalandi avatar Feb 16 '21 10:02 alexbalandi