Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[QUESTION] Sample idx, bin files in public domain for trying out pretrain_gpt.py?

Open sambar1729 opened this issue 7 months ago • 2 comments

Your question Ask a clear and concise question about Megatron-LM.

Can we have a sample idx + bin files as required by the pretrain_gpt.py ?

Running tools/preprocess_data.py on some sample data like

{"src": "www.nvidia.com", "text": "The quick brown fox", "type": "Eng", "id": "0", "title": "First Part"}
{"src": "The Internet", "text": "jumps over the lazy dog", "type": "Eng", "id": "42", "title": "Second Part"}

needs transformer_engine and on an A100 this takes a long time to build from source (the pip install

pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable

also fails).

This is just too much work to get some training data to run pretrain_gpt.py with. Can some sample idx, bin files as required by the pretraining be provided in a public place?

Thanks.

sambar1729 avatar Jun 26 '24 18:06 sambar1729