Megatron-LM
Megatron-LM copied to clipboard
[QUESTION] Sample idx, bin files in public domain for trying out pretrain_gpt.py?
Your question Ask a clear and concise question about Megatron-LM.
Can we have a sample idx + bin files as required by the pretrain_gpt.py ?
Running tools/preprocess_data.py on some sample data like
{"src": "www.nvidia.com", "text": "The quick brown fox", "type": "Eng", "id": "0", "title": "First Part"}
{"src": "The Internet", "text": "jumps over the lazy dog", "type": "Eng", "id": "42", "title": "Second Part"}
needs transformer_engine and on an A100 this takes a long time to build from source (the pip install
pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable
also fails).
This is just too much work to get some training data to run pretrain_gpt.py
with. Can some sample idx
, bin
files as required by the pretraining be provided in a public place?
Thanks.