Megatron-LM
Megatron-LM copied to clipboard
load T5 pretrained model from huggingface or google
We test the pre-training phase of T5 using model parallel and data parallel. It's really a easy framework to use with high GPU utilization. But there are some points that take us a long time when load pre-trained model checkpoints from huggingface or google official a) model structure, such as sentencepiece tokenizer, LN w/o bias, attention w/o bias, etc. b) manually split pretrained checkpoints for 4 model parallel is so painful, is there some better way to do this? we just maintain a list where some of layers are cut by row or column
I have the same problem and I would like to know how did you solve the problem about position embedding? For T5 models, Huggingface uses relative position but Megatron uses classic embedding method.
Marking as stale. No activity in 60 days. Remove stale label or comment or this will be closed in 7 days.
I would also like to know how to convert a huggingface checkpoint to a Megatron Checkpoint. Probably we need something opposite to this, https://github.com/huggingface/transformers/blob/main/src/transformers/models/bloom/convert_bloom_original_checkpoint_to_pytorch.py If someone from Megatron-LM can give me some hints, I can go ahead and try to write the converter scripts. the minimal things to do imo is,
- Write a converter from huggingface ckpt -> megatron ckpt
- Write some test to make sure at least the forward pass is same for both of the model.
Marking as stale. No activity in 60 days.
Are there any updates about this question?
And the lm_head layer is also different from huggingface
I have the same question but haven't found any helpful information.
Marking as stale. No activity in 60 days.