ghostplant

Results 272 comments of ghostplant

Hi, the mod files are generated by a compiler project `autort`, which is an integration of different compilation backends. Even mod files seem to be the same format, they may...

The bsz=1 and MTP=0 would be far below, that's why 100% success MTP helps a lot. But I have no idea how this question is related to this topic, we...

May I ask the reason for removing `system.cache()`? By the way, the patch was intended for a very old Fairseq checkpoint. Since it is impractical to keep patches up-to-date with...

Removing the cache will lead to an inability to recall each balance loss generated during the forwarding process when calculating the loss. As a result, training may become increasingly imbalanced,...

Nop. in `data` type, non-shared parameters are in Zero2 style, so they are still unique and independent in gradients.

Hi, unless you want to change the training GPU environments, you don't really need to do the conversion. Assume your model is 20GB for shared parameter and 800GB for non-shared...

Hello, we found that the fairseq_moe instruction is too old **while official fairseq also stops maintaining** and **the dataset link doesn't work as well**, so we're going to remove this...

Hello, Tutel currently does not support Huawei Ascend because we do not have the hardware model and SDK for it. However, we would be willing to support it if it...

Hello, dp is `parallel_type == 0` using all_gather for ZeRO-2. This type is usually slower especially when the expert parameters is larger than activation sizes.

The `max_tokens` isn't handled yet through REST JSON API. Instead, it is currently a static global setting, which is specified by the argument `--max_seq_len`? (the version 20250715 has a fine-grain...