laekov
laekov
`fmoe_cuda` 里应当有 `assign_pos` 这个方法, 但不应该有 `assign_pos_` 这个方法.
好问题. 目前没有这样的尝试.
FastMoE 的单卡版本或多卡并行版本并不涉及对 kv-cache 进行变动. 理论上和 page attention 是正交关系. 可以一起用.
是, 是, 是和是, 是...
我们没有在仓库中包含微调的例子. 请您自行实现.
This is because the gradients are synchronized across the DP group, so they are identical. Meanwhile, the sum of a parameter tensor should be collected from the whole MP group.
The key point is that the experts are different in a DP group of Megatron-LM (and also MP group in previous versions of FastMoE), so we have to reduce them....
What is the version of your Megatron-LM? They have being significantly changing the API in their recent versions.
1.1.0 may be too old. We have verivied support for 2.2, 2.5 and 3.0.2 . If you have to use 1.1.0, you need to either modify megatron or fastmoe.