RWKV-LM
RWKV-LM copied to clipboard
RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, sa...
Is it possible use bpe tokenizer instead rwkv_vocab_v20230424 in the next model? I tried rwkv model in Thai language. It look good but it is very slow because Thai is...
To bring more awareness and adoption of RWKV, would it be possible to get benchmark scores on the Huggingface LLM leaderboard or on the model cards itself (For RWKV-6 and...
RWKV_TimeMix中在序列维度上进行操作,在进行训练时训练数据常常是首尾相接的,序列之间需要隔断分开进行处理,例如flashattention会接收一个序列开始位置的输入,RUN_CUDA似乎没有,是如何实现的
# Fix broken `accumulate_grad_batches` argument in v5 trainer While trying to finetune some of the RWKV-7-Pile models, I found that the `accumulate_grad_batches` argument sent to the main trainer file had...
rwkv_v7_demo.py : args.vocab_size = 50304 01.b 实际:65536 ` raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for RWKV: Missing key(s) in state_dict: "blocks.0.att.v0", "blocks.0.att.v1", "blocks.0.att.v2". size...
  想问一下 这个 报错 应该是哪里的问题
你好,想请教一下head_size_divisor的含义是什么,和head_size的关系是什么? 以及还想请教下self.ln_x = nn.GroupNorm(H, C, eps=(1e-5) * (self.head_size_divisor ** 2)) # !!! notice eps value !!! 这里为什么要这样定义eps, 谢谢
最后的输出的一部分应该是o_t = r_t @ S_t^T,看起来图中计算出来S没有进行转置吧?(代码是正确的) @BlinkDL
Added context about data.