lynn
lynn
> Do you solve the problem? I meet the same problem. It seems to be the wrong llama model I download, I download it from modelscope, download from huggingface works...
> > 大佬们。我有个问题。 前提:对Qwen3-32B进行预训练,packing=false,长度选择2k,我的每条样本保证token长度不超过2k,假设位 1000。 那么 给模型的 数据,每个样本数据的 input_ids就是 长度是 1001(末尾加eos),不足2k。 > > 疑问: 1、需要手动在 input_ids后面添加 pad_token 吗,把每条样本补长到 2k。 2、需要手动补充 labels列和attention_mask列吗。 labels前1001与input_ids一致,之后置IGNORE_INDEX;attention_mask前1001置1,之后置0 > > 1和2都不需要,会自动补齐 https://github.com/hiyouga/LLaMA-Factory/blob/main/src/llamafactory/data/processor/supervised.py#L32 请问这个 token是需要我们在制作数据集的时候添加吗,还是说LF会追加上去
Similar problem occured to me. import selective_scan_cuda ImportError: /home/aiscuser/.conda/envs/mamba/lib/python3.11/site-packages/selective_scan_cuda.cpython-311-x86_64-linux-gnu.so: undefined symbol: _ZN3c1021throwNullDataPtrErrorEv python=3.11 pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=12.1
I try code below solved this problem: git clone https://github.com/state-spaces/mamba.git pip install . --no-build-isolation
It's short , no more than 150 words, maybe that explain your insights. Does mamba2 solve this problem--this kind of performance degrade compare to transformer. I havn't tried mamba2 yet.
> The transition from Mamba1 to Mamba2 does not show significant improvements for short sequence lengths. As seen in the attached image, Mamba2 still performs slower than Transformers for shorter...
> The FP16/BF16 **1979 TFLOPS** defined in H200 spec is with sparsity, I think the actual MFU should be `420/(1979/2)=42.45%` Could you explain what does `sparsity` mean?
> Please refer to https://developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/ Thanks for the reference!
> No problem. > > 1. This seems a bit unlikely tbh. Have you ensured that `mamba-ssm` and `causal-conv1d` are installed? Maybe set `config.use_cache=False` during training at least. Otherwise, I'm...
> @Lynnzake At least in hf, you guarantee that inference is avoided 100% and in that case the code opts for the fused path, i.e. a kernel for conv combined...