Tokenizer to be used for generation of data to .npy files
❓ The question
Hi,
I was unable to reopen the previous issue: https://github.com/allenai/OLMo/issues/790. Hence, creating another open issue and copying my response below.
Hi Aman,
Thanks for the guidance, I have tried your advice but am still facing difficulties.
Firstly, I tried with allenai/gpt-neox-olmo-dolma-v1_5 tokenizer to generate the .npy files using dolma tokens CLI, however it resulted in the following error when it is starting training of OLMo2 stage 2.
CRITICAL [olmo.util:168, rank=1] Uncaught ZeroDivisionError: division by zero.
Also noted in OLMo2-7B-stage2-seed42.yaml scripts, the tokenizer is as follows: tokenizer: identifier: tokenizers/allenai_dolma2.json truncate_direction: right
I also tried changing this to allenai_gpt-neox-olmo-dolma-v1_5.json but it result in the error: OLMoConfigurationError: vocab size mismatch between config and tokenizer
I believe that the tokenizer used should be consistent and it seemly is using dolma2-tokenizer for OLMo2 models. May I get some clarification on this?
I also downloaded the source dataset (which are .npy formats already) listed in the OLMo2-7B-stage2-seed42.yaml script, and those seem to be tokenized with dolma2-tokenizer. However, it does not work when I perform the generation of data with my dataset.
Hope to get some direction on this issue. Thanks so much!
Hey @WenJett,
- prepare_memmap_dataset.py is depreciated.
- The correct tokenizer to use is tokenizers/allenai_dolma2.json (I was wrong before)
- Can you provide more details of the dataset you're trying to tokenize? I'm not sure why it is throwing empty file.
Hi @aman-17,
I have also asked about the same issue at dolma GitHub: https://github.com/allenai/dolma/issues/225 to which @soldni has kindly responded but we can't identify the problem.
I have also uploaded the json.gz file I have used alongside the dolma tokens CLI data.json.gz
From my understanding, the data requires an "id" field and a "text" field. I am not sure if there are any additional requirements in place or steps necessary before performing the dolma tokens CLI.
edited to include the error message I received when running stage2 of OLMo script.
[2025-01-31 03:26:51] INFO [train:335, rank=0] Checkpoint successfully loaded [2025-01-31 03:26:51] INFO [train:351, rank=0] Starting training... [2025-01-31 03:26:51] INFO [olmo.train:967, rank=0] Pre-train system metrics System/Peak GPU Memory (MB)=43,795 [2025-01-31 03:26:51] CRITICAL [olmo.util:168, rank=0] Uncaught ZeroDivisionError: division by zero
╭──────────────────────────────────────────────────────────────────────────────────────────────────╮
│ /home/q3team/_Q4/OLMo/scripts/train.py:389 in
[2025-01-31 03:26:51] CRITICAL [olmo.util:168, rank=1] Uncaught ZeroDivisionError: division by zero
╭──────────────────────────────────────────────────────────────────────────────────────────────────╮
│ /home/q3team/_Q4/OLMo/scripts/train.py:389 in
[rank0]:[W131 03:26:55.738691064 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
[W0131 03:26:56.310000 28522 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 28615 closing signal SIGTERM
[E0131 03:26:56.425000 28522 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 28614) of binary: /usr/local/bin/python3.12
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/usr/local/lib/python3.12/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/usr/local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
scripts/train.py FAILED
Failures: <NO_OTHER_FAILURES>
Thanks for your help!
Hey @WenJett, lets go step by step..
- Did you build your dataset using memmap?
- Did you de-tokenize it and check whether you’re able to retrieve the correct text?
Hi, thanks again for the inquiry! We’re currently working on closing out old tickets, so we’re closing this out for now, but if you require a follow-up response, please re-open and we will get back to you!