open_llama icon indicating copy to clipboard operation
open_llama copied to clipboard

tokenization issue for code

Open brando90 opened this issue 1 year ago • 7 comments

Does this still a bug for tokenization? I want to use this for code. Thanks!

brando90 avatar Jun 27 '23 19:06 brando90

If you are talking about the fast encoder, it was fixed in the main branch of transformers. AFAIK it hasn't been tagged as a release, yet.

gjmulder avatar Jun 28 '23 07:06 gjmulder

Probably a duplicate of #40?

gjmulder avatar Jun 29 '23 05:06 gjmulder

Check out our OpenLLaMA v2 model, which has a new tokenizer and is pretrained with a lot of code. The official release of that will happen very soon.

young-geng avatar Jul 07 '23 07:07 young-geng

can we use the old models or how does this work? We just load the old model with the new tokenizer?

Brando Miranda Ph.D. Student Computer Science, Stanford University EDGE Scholar, Stanford University @.*** website: https://brando90.github.io/brandomiranda/home.html mentorship opportunities: https://brando90.github.io/brandomiranda/prospective-collaborations.html

On Jul 7, 2023, at 12:52 AM, Xinyang (Young) Geng @.***> wrote:

Check out our OpenLLaMA v2 modelhttps://huggingface.co/openlm-research/open_llama_7b_v2, which has a new tokenizer and is pretrained with a lot of code. The official release of that will happen very soon.

— Reply to this email directly, view it on GitHubhttps://github.com/openlm-research/open_llama/issues/61#issuecomment-1624940429, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAOE6LRSBDRPTWUHHJTCYHDXO654DANCNFSM6AAAAAAZWCI5DE. You are receiving this because you authored the thread.Message ID: @.***>

brando90 avatar Jul 07 '23 18:07 brando90

@brando90 The v2 model is a completely different one trained on a new mixture of dataset, so you'll need to load the new weights too.

young-geng avatar Jul 07 '23 18:07 young-geng

Got it. Thanks!

I will assume v1 open llama is basically unusable for code gen (what I want) and use only v2.

Thanks!

Brando Miranda Ph.D. Student Computer Science, Stanford University EDGE Scholar, Stanford University @.*** website: https://brando90.github.io/brandomiranda/home.html mentorship opportunities: https://brando90.github.io/brandomiranda/prospective-collaborations.html

On Jul 7, 2023, at 11:55 AM, Xinyang (Young) Geng @.***> wrote:

@brando90https://github.com/brando90 The v2 model is a completely different one trained on a new mixture of dataset, so you'll need to load the new weights too.

— Reply to this email directly, view it on GitHubhttps://github.com/openlm-research/open_llama/issues/61#issuecomment-1625894602, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAOE6LTTBRYD5CYUEVWXAGDXPBLQXANCNFSM6AAAAAAZWCI5DE. You are receiving this because you were mentioned.Message ID: @.***>

brando90 avatar Jul 07 '23 21:07 brando90

@brando90 Yeah. I imagine you probably want to use v2 almost always since it is a better model overall.

young-geng avatar Jul 07 '23 21:07 young-geng