litgpt Add Cohere's Command-R

https://txt.cohere.com/command-r/ https://huggingface.co/CohereForAI/c4ai-command-r-v01

I don't think the architecture needs any changes to support this

Mar 12 '24 02:03 carmocca

I don't think the architecture needs any changes to support this

I thought the same about Gemma 😄.

This model requires custom modeling and tokenizer classes. So, it might not be that straightforward to implement.

Mar 12 '24 12:03 Andrei-Aksionov

Do you see any specific differences in the modeling?

Mar 12 '24 15:03 carmocca

I posted it without even looking at the code. I mean, why would anyone provide a custom code if it's identical to the one that is already in transformers?

So, after a really quick "scanning" of the modeling code I found a couple of interesting details.

They have logits_scalethat is applied on the lm_head output: https://huggingface.co/CohereForAI/c4ai-command-r-v01/blob/2a6d259c29bd319c3bdb8dd88b8d59b8c303c318/modeling_cohere.py#L1164
Forward method in CohereDecoderLayer does things a bit differently: https://huggingface.co/CohereForAI/c4ai-command-r-v01/blob/2a6d259c29bd319c3bdb8dd88b8d59b8c303c318/modeling_cohere.py#L689-L709 It's definitely a parallel_residual + shared_attention_norm, but nothing of this is mentioned in the config file. Since it's a just a matter of a proper config setting, shouldn't give us any problems.

Maybe something more.

Mar 12 '24 16:03 Andrei-Aksionov

I think you've missed the rotate_half part, while the tokenizer is the same as llama

def rotate_half(x):
    # Split and rotate
    x1 = x[..., ::2]
    x2 = x[..., 1::2]
    rot_x = torch.stack([-x2, x1], dim=-1).flatten(-2)
    return rot_x

https://github.com/ggerganov/llama.cpp/pull/6033#issuecomment-1993657166

Mar 13 '24 06:03 choyakawa