Iz Beltagy
Iz Beltagy
Can you try this for regular BERT and see if you get the same pattern?
As I said, I don't think this is a bug, it is just how the model decided to represent your tokens. As to the similarity measures, maybe normalizing the vector...
1- The official package is better if it has gradient accumulation (they have an open PR for it https://github.com/allenai/allennlp/pull/3051) 2- What do you mean by regular dependency?
[Here](https://huggingface.co/bigscience/gpt2-350m-en/tree/megatron-deepspeed)'s a megatron-deepspeed checkpoint and [here](https://huggingface.co/bigscience/gpt2-350m-en/tree/main)'s the corresponding HF-transformer checkpoint. We just need to verify that these two are the same.
Yes, to run Meg-DS training. Basically doing the steps listed in readme here https://github.com/bigscience-workshop/Megatron-DeepSpeed for them so that they only need to run the `pretrain_*` script.
@jaketae can be the first user of the AMI
Dirk's config is this branch https://github.com/allenai/LLM/tree/DirksRun2