Huang Haiduo
Huang Haiduo
> Hi @haiduo , you could check these papers: https://arxiv.org/pdf/1502.01852.pdf, https://arxiv.org/pdf/1606.05340.pdf, https://arxiv.org/pdf/1611.01232.pdf, all of which analyze training dynamics for centered weight. I am not sure how to analyze weights with...
> Hi @haiduo, thanks again for your interest. For b=4, it maps [-1, 1] to [0, 1], to {0, 1, ..., 15}, to {0.5, 1.5, ..., 15.5}, to {1/32, 3/32,...
> Hi @haiduo, thanks again for your interest. For b=4, it maps [-1, 1] to [0, 1], to {0, 1, ..., 15}, to {0.5, 1.5, ..., 15.5}, to {1/32, 3/32,...
I have solved it! Look at follows: The bug stem form https://github.com/Lightning-AI/lit-llama/blob/da71adea0970d6d950fb966d365cfb428aef8298/lit_llama/model.py#L130 I managed to change it : from transformers.utils import is_torch_bf16_gpu_available dtype=torch.bfloat16 if is_torch_bf16_gpu_available() else torch.float16,
> Have you figured it out? To me, the line 428 is somehow another form of equation(3) in the paper. `out[..., :-1]` is equal to QK^TV, and `out[..., -1:]` is...
> The base model has a layer normalization (layernorm) layer before the LM head. Since the feature sequence has already been normalized, we do not use layer normalization. It is...
Let me try to answer this. It should be that there is no need to expand the total-tokens, but you need to ensure that the total tokens of the modified...
After reading your comment, I find this phenomenon very interesting, so I try it just now and find that the output logits of the draft model and target model are...
> > Thank you both for your careful observation; these details are very helpful. > > I suggest whether it is possible to change the comparison of two float values...