NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

[BUG] Megatron Tokenization script missing / Docs outdated.

Open Pedrexus opened this issue 1 month ago • 0 comments

Hello all, thanks for the nice project.

I installed the docker://nvcr.io/nvidia/nemo:25.09 image and I cannot follow the docs tutorial in https://docs.nvidia.com/nemo-framework/user-guide/latest/data/pretrain_data.html because /opt/NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py is missing.

Are the docs outdated? How to tokenize a dataset?

Pedrexus avatar Nov 28 '25 03:11 Pedrexus