Jack Chen
Jack Chen
Maybe it's not caused by using multi-GPU, you can use cuda-memcheck tool to help you find out more details about this error: using cuda-memcheck python your-program.py, it will log more...
any updates on this? Missing your vit example~ @Taka152 @godweiyang
I use gcc7.2, torch 1.8, cuda 11.2. Hope it helps
max sequence length: 8836, having patched these lines of code and it seems works: in ` void launch_attn_softmax_bw `: ```cuda } else if (to_len
> Yes, that's the place to modify the length limit, and it can be tested [here](https://github.com/bytedance/lightseq/blob/aabce486f34bec28bfe0efbbda1a183d5a6a37ba/tests/test_ls_kernels.py#L729-L730). Thanks for pointing that. I will fire a pull request to support longer sequence.
> @Jack47 It's great. Did you test it works well? The code have some bugs about block_dims. should use test_ls_op.py to validate before use it.
> 可以在一部分结构上用 `torch_scope` 这个接口包一下,在 torch scope 里面的部分会用使用 fp32 进行训练,例如 moe 的例子里: > > https://github.com/Tencent/PatrickStar/blob/0731c6ed2065e62d0cd489813b4e162880a5ab51/examples/moe/moe_bert.py#L53-L64 > > 不过注意,如果只是要把一层设置为 fp32 的话,这里的 `do_allreduce` 应该设置为 `True` 妙啊,意思是这块是torch在管理的,不需要ps参与?
> @Jack47 @liaojianjin 最近我们在对派大星进行全面的重构...所以这些特性可能之后都会有些变化.. 例如我们可能之后会直接复用 pytorch autocast,而不是实现自己版本的混合精度训练了,这样的话本 issue 中提到的 layernorm 设置成 fp32 的问题可能就迎刃而解了,也不需要在迁移后重新对齐精度了。所以现在的暴露的接口可能比较简陋,非常抱歉... 好的好的,👍
> 1. Can you post the detailed configurations, including PyTorch, CUDA, g++, etc. ? Please see https://github.com/utsaslab/MONeT/blob/master/install.sh#L11 > conda install pytorch==1.5.1 torchvision==0.6.1 cudatoolkit=10.1 -c pytorch -y
seems it's already supported in this mr:https://github.com/bytedance/lightseq/pull/299/files