Chong Ruan comments

Results 16 comments of


                                            Chong Ruan

The attention mechanism is not the original attention mechanism in the paper

[This issue](https://github.com/pytorch/tutorials/issues/87) also mentioned this. @spro Please fix it quickly. This mistake in tutorial has been existing for 2 years.

why not use a partial factorization ?

@kimiyoung To make it clearer, let us walk through a concrete example: Assume the original sentence is 12345678, and the permutation is 12367845. The last two tokens, i.e.: 4 and...

Fine-tuning Script

> @RERV It seems as if swift does not support finetuning of the vision encoder (it seems that way from my quick glance over the source code, I hope I'm...

Fine-tuning Script

> @soloice Hi, I see, thanks. Would it be possible to just release the backprop code of the vision encoder, no framework around it, no clustering, just a starting point...

@Jintao-Huang Can you kindly confirm [if swift can be used to finetune visual encoder](https://github.com/deepseek-ai/DeepSeek-VL/issues/6#issuecomment-1992731315)? If so, how? If not, what's the simplest way to support it?

请问finetune脚本是全参微调么，最少需要多少显存和内存。

> > 1. 是全参数 > > 2. 如果是33B的话，一般需要80G显存，但通过pp并行（速度会慢），40G显存也是可以的 > > @guoday 你好, 我微调1.3B是两张显卡都跑到30G了, 想微调6.7B的时候显存直接爆了(batch_size=4都不行), 请问为什么消耗这么高, 好奇怪, 请教一下是要设置什么参数吗? 谢谢. 什么并行策略？

Batching

> Is this code "optimal" for batched inference and preprocessing? Nope. It's just a toy demo, not for production purpose.

Remove Ray for the dependency

> @lanking520 Thanks for your comment. We indeed use NCCL for cross-GPU tensor communication. However, in vLLM, we also need to pass several metadata ("control messages") from the scheduler to...

某些情况下，模型会重复最后的一两句话

我们也发现了这个问题，正在努力解决

某些情况下，模型会重复最后的一两句话

> > 我们也发现了这个问题，正在努力解决 > > 请问可能的原因有哪些呢？现在我用一些数据微调模型之后，几乎全部都是重复的字符一般来说是训练不充分。另外如果 SFT 数据集规模太小也会出现这种情况。