Zheng Cai
Zheng Cai
> @zigzagcai the SigOpt blog content is no longer hosted. We may choose to remedy this in the future, but if you would like to read a specific blog post...
Does FP8 grouped gemm supported?
Hi, @mgyong Very glad to see that you like to integrate LinGen as model option into our framework. Also, I have some experience with SSM Mamba, so I will follow...
目前InternEvo内置的模型(InternLM/InternLM2)没有使用tied word embedding 并且,我查看了下,类似的LLaMA也没有使用tied word embedding,https://github.com/meta-llama/llama/issues/138
> PyTorch FSDP gathers the module params before each forward and backward so that module implementations can just access them like normal. I wonder if your framework could use a...
The basic idea of our ZeRO3 weight parallel implementation: In `WPFusedDenseFunc` https://github.com/InternLM/InternEvo/blob/feat/refactor-impl/internlm/model/model_ops/modules/linear.py#L171-L315, we all-gather weights in the fwd pass, then all-gather weights and reduce-scatter gradients in bwd pass. And we...
> I'm not sure what you mean - if you want to run some Linear layers in fp8 and the rest in higher precision, or you want to run for...
Completed in https://github.com/InternLM/InternEvo/pull/244 1. load huggingface datasets in `streaming` mode, a.k.a, lazy load data samples and no need to download the whole datasets before training 2. on-the-fly tokenization 3. support...
> It depends on the type of communication. For FP8 with delayed scaling: > > * Tensor-parallel communication: all-gather in FP8 (see [`_all_gather_fp8`](https://github.com/NVIDIA/TransformerEngine/blob/a7eeb28bd917a647abf7854fa22239b8ee85c2af/transformer_engine/pytorch/distributed.py#L844)), reduce-scatter in BF16 (see [`reduce_scatter_along_first_dim`](https://github.com/NVIDIA/TransformerEngine/blob/a7eeb28bd917a647abf7854fa22239b8ee85c2af/transformer_engine/pytorch/distributed.py#L821)) > *...
I have the same interest with block-wise FP8.