laekov

Results 38 comments of laekov

`fmoe_cuda` 里应当有 `assign_pos` 这个方法, 但不应该有 `assign_pos_` 这个方法.

好问题. 目前没有这样的尝试.

FastMoE 的单卡版本或多卡并行版本并不涉及对 kv-cache 进行变动. 理论上和 page attention 是正交关系. 可以一起用.

我们没有在仓库中包含微调的例子. 请您自行实现.

This is because the gradients are synchronized across the DP group, so they are identical. Meanwhile, the sum of a parameter tensor should be collected from the whole MP group.

The key point is that the experts are different in a DP group of Megatron-LM (and also MP group in previous versions of FastMoE), so we have to reduce them....

What is the version of your Megatron-LM? They have being significantly changing the API in their recent versions.

1.1.0 may be too old. We have verivied support for 2.2, 2.5 and 3.0.2 . If you have to use 1.1.0, you need to either modify megatron or fastmoe.