LLaMA-Megatron issues

Have you backed up the Megatron-LM version you use?

1

Hi, I'm try to use your code to pretrain llama2-7b, but I find that Megatron-LM have been update recently and some code like 'indexed_dataset' have been removed/changed in the latest...

EthanLI24

MegatronLM 微信讨论组

1

老哥，方便交流可以进这个微信讨论组交流吗？ ![image](https://github.com/MoFHeka/LLaMA-Megatron/assets/31175974/1b56dd7e-1c55-4e89-a3dd-07fd306c2d65)

anxiangsir

1. 关于flash-attention：请问老哥预训练时flash-attn库的版本设置的是多少，我这边设置的为1.18.0，但是会报错，报错信息为：RuntimeError: Expected is_sm90 || is_sm8x || is_sm75 to be true, but got false. 不知道是否是flash-attn版本问题，还是说V100 不支持flash-attn呢。 2. 关于预训练LLAMA时的loss：因为flash-attn在我的服务器上不可用，因此我预训练时没有开启flash-attn。但这个应该只是加速，不会影响效果。我的问题是，我pretrain llama 400M from scratch，但我的训练loss在基本收敛的情况下维持在3.几左右的数值，和chinchilla文章中声称的同等规模模型的训练loss为2.几不一致，同时我也基于megatron-lm官方仓库预训练了一个同样400M的GPT-2模型，其收敛的训练loss能够达到2.7 ，符合chinchilla文章的结果，但基于您仓库，我预训练400M的llama 收敛后的训练损失偏高，请问这个正常吗，您这边有验证过代码的准确性么。还是说参数量为400M的LLAMA模型训练loss正常就为3.2 呢。另外你仓库里预训练的这个13B的LLAMA模型效果请问和官方LLAMA模型效果是保持一致的么？以下为我预训练400M 的LLAMA的超参设置：...

zhangbin1997

关于基于megatron的LLAMA 推理相关代码

1

请问老哥，我这边基于您LLaMA-Megatron仓库和 Megatron-LM仓库实现了LLAMA预训练，但好像这两个仓库中都没有提供基于Megatron的LLAMA模型的inference推理代码呢，请问您已经实现了相关代码吗？并且我看官方的Megatron-LM仓库中，只有GPT Evaluation和Bert Evaluation的代码，我如果根据官方的GPT inference代码直接修改成LLAMA inference代码的话，这其中是否会有很多bug呢？谢谢～

zhangbin1997

请问 output = torch.matmul(total_input, weight.t()) 此处报错！！！

1

您好，如题所示，megatron/core/tensor_parallel/layers.py中的243行处报错。按照我先前的经验，这种报错的原因通常是因为维度不匹配。但因为我第一次用llama和megatron，因此还是想请教下您，看看您是否先前也遇到过这个问题。不知道是不是我数据预处理时用的vocab-file和merge-file是gpt2的原因所导致的这个问题，或者还是说因为其它原因。我也print出了total_input 和 weight.t() 的维度和device，都是匹配的呢。谢谢～ WARNING:torch.distributed.run: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune...

zhangbin1997

With model parallel still OOM on A100-40G

1

用4卡A100-40G，加载llama-13B：报错如下 ``` torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ pretrain_llama.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-06-28_17:45:54 host : gpu5.example.com rank : 1 (local_rank: 1) exitcode : 1 (pid:...

LZY-the-boys

Could you please provide some details about tokenizer between Megatron-lm and HF tokenizer?

2

1. There are some different about megatron-lm tokenizer and HF tokenizer. ``` python llama/tools/preprocess_data.py \ --input /mnt/workspace/{}.json \ --output-prefix \ --vocab-file **gpt2-vocab.json** \ --dataset-impl mmap \ --tokenizer-type **GPT2BPETokenizer** \ --merge-file...

yeyunhu

LLaMA-Megatron
LLaMA-Megatron copied to clipboard

Metadata

Have you backed up the Megatron-LM version you use?

MegatronLM 微信讨论组

关于LLAMA pretrain相关问题。谢谢～

关于基于megatron的LLAMA 推理相关代码

请问 output = torch.matmul(total_input, weight.t()) 此处报错！！！

With model parallel still OOM on A100-40G

Could you please provide some details about tokenizer between Megatron-lm and HF tokenizer?

← Metadata

Owner

Metadata

LLaMA-Megatron LLaMA-Megatron copied to clipboard

Metadata

Have you backed up the Megatron-LM version you use?

MegatronLM 微信讨论组

关于LLAMA pretrain相关问题。谢谢～

关于基于megatron的LLAMA 推理相关代码

请问 output = torch.matmul(total_input, weight.t()) 此处报错！！！

With model parallel still OOM on A100-40G

Could you please provide some details about tokenizer between Megatron-lm and HF tokenizer?

← Metadata

Owner

Metadata

LLaMA-Megatron
LLaMA-Megatron copied to clipboard