Zheng Cai

Results 60 comments of Zheng Cai

> @zigzagcai the SigOpt blog content is no longer hosted. We may choose to remedy this in the future, but if you would like to read a specific blog post...

Hi, @mgyong Very glad to see that you like to integrate LinGen as model option into our framework. Also, I have some experience with SSM Mamba, so I will follow...

目前InternEvo内置的模型(InternLM/InternLM2)没有使用tied word embedding 并且,我查看了下,类似的LLaMA也没有使用tied word embedding,https://github.com/meta-llama/llama/issues/138

> PyTorch FSDP gathers the module params before each forward and backward so that module implementations can just access them like normal. I wonder if your framework could use a...

The basic idea of our ZeRO3 weight parallel implementation: In `WPFusedDenseFunc` https://github.com/InternLM/InternEvo/blob/feat/refactor-impl/internlm/model/model_ops/modules/linear.py#L171-L315, we all-gather weights in the fwd pass, then all-gather weights and reduce-scatter gradients in bwd pass. And we...

> I'm not sure what you mean - if you want to run some Linear layers in fp8 and the rest in higher precision, or you want to run for...

Completed in https://github.com/InternLM/InternEvo/pull/244 1. load huggingface datasets in `streaming` mode, a.k.a, lazy load data samples and no need to download the whole datasets before training 2. on-the-fly tokenization 3. support...

> It depends on the type of communication. For FP8 with delayed scaling: > > * Tensor-parallel communication: all-gather in FP8 (see [`_all_gather_fp8`](https://github.com/NVIDIA/TransformerEngine/blob/a7eeb28bd917a647abf7854fa22239b8ee85c2af/transformer_engine/pytorch/distributed.py#L844)), reduce-scatter in BF16 (see [`reduce_scatter_along_first_dim`](https://github.com/NVIDIA/TransformerEngine/blob/a7eeb28bd917a647abf7854fa22239b8ee85c2af/transformer_engine/pytorch/distributed.py#L821)) > *...

I have the same interest with block-wise FP8.