OLMo icon indicating copy to clipboard operation
OLMo copied to clipboard

Tokenizer to be used for generation of data to .npy files

Open WenJett opened this issue 11 months ago • 3 comments

❓ The question

Hi,

I was unable to reopen the previous issue: https://github.com/allenai/OLMo/issues/790. Hence, creating another open issue and copying my response below.

Hi Aman,

Thanks for the guidance, I have tried your advice but am still facing difficulties.

Firstly, I tried with allenai/gpt-neox-olmo-dolma-v1_5 tokenizer to generate the .npy files using dolma tokens CLI, however it resulted in the following error when it is starting training of OLMo2 stage 2.

CRITICAL [olmo.util:168, rank=1] Uncaught ZeroDivisionError: division by zero.

Also noted in OLMo2-7B-stage2-seed42.yaml scripts, the tokenizer is as follows: tokenizer: identifier: tokenizers/allenai_dolma2.json truncate_direction: right

I also tried changing this to allenai_gpt-neox-olmo-dolma-v1_5.json but it result in the error: OLMoConfigurationError: vocab size mismatch between config and tokenizer

I believe that the tokenizer used should be consistent and it seemly is using dolma2-tokenizer for OLMo2 models. May I get some clarification on this?

I also downloaded the source dataset (which are .npy formats already) listed in the OLMo2-7B-stage2-seed42.yaml script, and those seem to be tokenized with dolma2-tokenizer. However, it does not work when I perform the generation of data with my dataset.

Hope to get some direction on this issue. Thanks so much!

WenJett avatar Jan 29 '25 04:01 WenJett

Hey @WenJett,

  1. prepare_memmap_dataset.py is depreciated.
  2. The correct tokenizer to use is tokenizers/allenai_dolma2.json (I was wrong before)
  3. Can you provide more details of the dataset you're trying to tokenize? I'm not sure why it is throwing empty file.

aman-17 avatar Jan 30 '25 19:01 aman-17

Hi @aman-17,

I have also asked about the same issue at dolma GitHub: https://github.com/allenai/dolma/issues/225 to which @soldni has kindly responded but we can't identify the problem.

I have also uploaded the json.gz file I have used alongside the dolma tokens CLI data.json.gz

From my understanding, the data requires an "id" field and a "text" field. I am not sure if there are any additional requirements in place or steps necessary before performing the dolma tokens CLI.

edited to include the error message I received when running stage2 of OLMo script.

[2025-01-31 03:26:51] INFO [train:335, rank=0] Checkpoint successfully loaded [2025-01-31 03:26:51] INFO [train:351, rank=0] Starting training... [2025-01-31 03:26:51] INFO [olmo.train:967, rank=0] Pre-train system metrics System/Peak GPU Memory (MB)=43,795 [2025-01-31 03:26:51] CRITICAL [olmo.util:168, rank=0] Uncaught ZeroDivisionError: division by zero

╭──────────────────────────────────────────────────────────────────────────────────────────────────╮ │ /home/q3team/_Q4/OLMo/scripts/train.py:389 in │ │ │ │ 386 │ │ raise OLMoCliError(f"Usage: {sys.argv[0]} [CONFIG_PATH] [OPTIONS]") │ │ 387 │ │ │ 388 │ cfg = TrainConfig.load(yaml_path, [clean_opt(s) for s in args_list]) │ │ ❱ 389 │ main(cfg) │ │ 390 │ │ /home/q3team/_Q4/OLMo/scripts/train.py:352 in main │ │ 349 │ │ │ │ 350 │ │ if not cfg.dry_run: │ │ 351 │ │ │ log.info("Starting training...") │ │ ❱ 352 │ │ │ trainer.fit() │ │ 353 │ │ │ log.info("Training complete") │ │ 354 │ │ │ else: │ │ 355 │ │ │ log.info("Dry run complete") │ │ │ │ /home/q3team/_Q4/OLMo/olmo/train.py:1185 in fit │ │ 1182 │ │ save_checkpoints: bool = True │ │ 1183 │ │ │ │ 1184 │ │ with torch_profiler as p: │ │ ❱ 1185 │ │ │ for epoch in range(self.epoch or 0, self.max_epochs): │ │ 1186 │ │ │ │ for batch in self.train_loader: │ │ 1187 │ │ │ │ │ # Bookkeeping. │ │ 1188 │ │ │ │ │ # NOTE: To track the global batch size / number of tokens per batch w │ │ │ │ /home/q3team/_Q4/OLMo/olmo/train.py:258 in max_epochs │ │ 255 │ │ │ 256 │ @property │ │ 257 │ def max_epochs(self) -> int: │ │ ❱ 258 │ │ return math.ceil(self.max_steps / self.batches_per_epoch) │ │ 259 │ │ │ 260 │ @property │ │ 261 │ def max_steps(self) -> int: │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ZeroDivisionError: division by zero

[2025-01-31 03:26:51] CRITICAL [olmo.util:168, rank=1] Uncaught ZeroDivisionError: division by zero

╭──────────────────────────────────────────────────────────────────────────────────────────────────╮ │ /home/q3team/_Q4/OLMo/scripts/train.py:389 in │ │ │ │ 386 │ │ raise OLMoCliError(f"Usage: {sys.argv[0]} [CONFIG_PATH] [OPTIONS]") │ │ 387 │ │ │ 388 │ cfg = TrainConfig.load(yaml_path, [clean_opt(s) for s in args_list]) │ │ ❱ 389 │ main(cfg) │ │ 390 │ │ /home/q3team/_Q4/OLMo/scripts/train.py:352 in main │ │ 349 │ │ │ │ 350 │ │ if not cfg.dry_run: │ │ 351 │ │ │ log.info("Starting training...") │ │ ❱ 352 │ │ │ trainer.fit() │ │ 353 │ │ │ log.info("Training complete") │ │ 354 │ │ │ else: │ │ 355 │ │ │ log.info("Dry run complete") │ │ │ │ /home/q3team/_Q4/OLMo/olmo/train.py:1185 in fit │ │ 1182 │ │ save_checkpoints: bool = True │ │ 1183 │ │ │ │ 1184 │ │ with torch_profiler as p: │ │ ❱ 1185 │ │ │ for epoch in range(self.epoch or 0, self.max_epochs): │ │ 1186 │ │ │ │ for batch in self.train_loader: │ │ 1187 │ │ │ │ │ # Bookkeeping. │ │ 1188 │ │ │ │ │ # NOTE: To track the global batch size / number of tokens per batch w │ │ │ │ /home/q3team/_Q4/OLMo/olmo/train.py:258 in max_epochs │ │ 255 │ │ │ 256 │ @property │ │ 257 │ def max_epochs(self) -> int: │ │ ❱ 258 │ │ return math.ceil(self.max_steps / self.batches_per_epoch) │ │ 259 │ │ │ 260 │ @property │ │ 261 │ def max_steps(self) -> int: │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ZeroDivisionError: division by zero

[rank0]:[W131 03:26:55.738691064 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) [W0131 03:26:56.310000 28522 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 28615 closing signal SIGTERM [E0131 03:26:56.425000 28522 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 28614) of binary: /usr/local/bin/python3.12 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) ^^^^^^ File "/usr/local/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/usr/local/lib/python3.12/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/usr/local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

scripts/train.py FAILED

Failures: <NO_OTHER_FAILURES>

Thanks for your help!

WenJett avatar Jan 31 '25 01:01 WenJett

Hey @WenJett, lets go step by step..

  1. Did you build your dataset using memmap?
  2. Did you de-tokenize it and check whether you’re able to retrieve the correct text?

aman-17 avatar Feb 13 '25 21:02 aman-17

Hi, thanks again for the inquiry! We’re currently working on closing out old tickets, so we’re closing this out for now, but if you require a follow-up response, please re-open and we will get back to you!

baileykuehl avatar Jul 01 '25 17:07 baileykuehl