Zheng Cai comments

Results 60 comments of


                                            Zheng Cai

SigOpt Blog cannot be accessed

> @zigzagcai the SigOpt blog content is no longer hosted. We may choose to remedy this in the future, but if you would like to read a specific blog post...

[BUG] Loss difference when training with FP8 vs. BF16 MoE

Does FP8 grouped gemm supported?

[Feature] Add Lingen - MATE architecture (SSM Mamba2 based) as a model option

Hi, @mgyong Very glad to see that you like to integrate LinGen as model option into our framework. Also, I have some experience with SSM Mamba, so I will follow...

[QA] Internevo是否支持tied_embedding?

目前InternEvo内置的模型（InternLM/InternLM2）没有使用tied word embedding 并且，我查看了下，类似的LLaMA也没有使用tied word embedding，https://github.com/meta-llama/llama/issues/138

How can we use te.Linear with weight parallel?

> PyTorch FSDP gathers the module params before each forward and backward so that module implementations can just access them like normal. I wonder if your framework could use a...

How can we use te.Linear with weight parallel?

The basic idea of our ZeRO3 weight parallel implementation: In `WPFusedDenseFunc` https://github.com/InternLM/InternEvo/blob/feat/refactor-impl/internlm/model/model_ops/modules/linear.py#L171-L315, we all-gather weights in the fwd pass, then all-gather weights and reduce-scatter gradients in bwd pass. And we...

Can we only replace part of nn.Linear with te.Linear and others keep unchanged?

> I'm not sure what you mean - if you want to run some Linear layers in fp8 and the rest in higher precision, or you want to run for...

[Feature] a very simple hugging-face dataloader

Completed in https://github.com/InternLM/InternEvo/pull/244 1. load huggingface datasets in `streaming` mode, a.k.a, lazy load data samples and no need to download the whole datasets before training 2. on-the-fly tokenization 3. support...

Does TransformerEngine support FP8 communication such like all-gather or all-to-all?

> It depends on the type of communication. For FP8 with delayed scaling: > > * Tensor-parallel communication: all-gather in FP8 (see [`_all_gather_fp8`](https://github.com/NVIDIA/TransformerEngine/blob/a7eeb28bd917a647abf7854fa22239b8ee85c2af/transformer_engine/pytorch/distributed.py#L844)), reduce-scatter in BF16 (see [`reduce_scatter_along_first_dim`](https://github.com/NVIDIA/TransformerEngine/blob/a7eeb28bd917a647abf7854fa22239b8ee85c2af/transformer_engine/pytorch/distributed.py#L821)) > *...

Plans for block-wise FP8 quantization during training?

I have the same interest with block-wise FP8.