axolotl icon indicating copy to clipboard operation
axolotl copied to clipboard

Distributed Timeout during Dataset Tokenization

Open casper-hansen opened this issue 6 months ago • 12 comments

Please check that this issue hasn't been reported before.

  • [x] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Shouldn't crash or slow down at this point and should tokenize without problem.

Current behaviour

it looks like from 48000 -> 48612 it goes wrong. only 612 samples were tokenized. this happens over and over again multiple times where it doesn't jump by +1000 but less.

Error is triggered:

Tokenizing Prompts (num_proc=64):  92%|█████████▏| 328986/359152 [29:58<12:05, 41.59 examples/s][rank1]:[W516 14:34:56.853490829 socket.cpp:464] [c10d] waitForInput: poll for socket SocketImpl(fd=79, addr=[localhost]:41170, remote=[localhost]:29500) returned 0, likely a timeout
[rank1]:[W516 14:34:56.854597329 socket.cpp:489] [c10d] waitForInput: socket SocketImpl(fd=79, addr=[localhost]:41170, remote=[localhost]:29500) timed out after 1800000ms
[rank1]: Traceback (most recent call last):
[rank1]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank1]:   File "<frozen runpy>", line 88, in _run_code
[rank1]:   File "/home/casper/miniconda3/envs/train/lib/python3.11/site-packages/axolotl/cli/train.py", line 124, in <module>
[rank1]:     fire.Fire(do_cli)
[rank1]:   File "/home/casper/miniconda3/envs/train/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
[rank1]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank1]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/casper/miniconda3/envs/train/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
[rank1]:     component, remaining_args = _CallAndUpdateTrace(
[rank1]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/casper/miniconda3/envs/train/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank1]:     component = fn(*varargs, **kwargs)
[rank1]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/casper/miniconda3/envs/train/lib/python3.11/site-packages/axolotl/cli/train.py", line 98, in do_cli
[rank1]:     return do_train(parsed_cfg, parsed_cli_args)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/casper/miniconda3/envs/train/lib/python3.11/site-packages/axolotl/cli/train.py", line 52, in do_train
[rank1]:     dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args)
[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/casper/miniconda3/envs/train/lib/python3.11/site-packages/axolotl/common/datasets.py", line 75, in load_datasets
[rank1]:     train_dataset, eval_dataset, total_num_steps, prompters = prepare_dataset(
[rank1]:                                                               ^^^^^^^^^^^^^^^^
[rank1]:   File "/home/casper/miniconda3/envs/train/lib/python3.11/site-packages/axolotl/utils/data/utils.py", line 39, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/casper/miniconda3/envs/train/lib/python3.11/site-packages/axolotl/utils/data/sft.py", line 69, in prepare_dataset
[rank1]:     with zero_first(is_local_main_process()):
[rank1]:   File "/home/casper/miniconda3/envs/train/lib/python3.11/contextlib.py", line 137, in __enter__
[rank1]:     return next(self.gen)
[rank1]:            ^^^^^^^^^^^^^^
[rank1]:   File "/home/casper/miniconda3/envs/train/lib/python3.11/site-packages/axolotl/utils/distributed.py", line 118, in zero_first
[rank1]:     barrier()
[rank1]:   File "/home/casper/miniconda3/envs/train/lib/python3.11/site-packages/axolotl/utils/distributed.py", line 69, in barrier
[rank1]:     dist.barrier()
[rank1]:   File "/home/casper/miniconda3/envs/train/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/casper/miniconda3/envs/train/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4551, in barrier
[rank1]:     work = group.barrier(opts=opts)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: wait timeout after 1800000ms, keys: /default_pg/0//cuda//0

Log of tokenizing:


Tokenizing Prompts (num_proc=64): 0%| | 0/359152 [00:00<?, ? examples/s] 
Tokenizing Prompts (num_proc=64): 0%| | 1000/359152 [00:18<1:53:00, 52.82 examples/s] 
Tokenizing Prompts (num_proc=64): 0%| | 1000/359152 [00:38<1:53:00, 52.82 examples/s] 
Tokenizing Prompts (num_proc=64): 1%| | 2000/359152 [00:41<2:06:23, 47.10 examples/s] 
Tokenizing Prompts (num_proc=64): 1%| | 3000/359152 [00:46<1:22:37, 71.83 examples/s] 
Tokenizing Prompts (num_proc=64): 1%| | 4000/359152 [00:52<1:02:47, 94.27 examples/s] 
Tokenizing Prompts (num_proc=64): 1%|▏ | 5000/359152 [01:00<56:04, 105.27 examples/s] 
Tokenizing Prompts (num_proc=64): 2%|▏ | 6000/359152 [01:05<47:12, 124.67 examples/s] 
Tokenizing Prompts (num_proc=64): 2%|▏ | 7000/359152 [01:05<32:29, 180.67 examples/s] 
Tokenizing Prompts (num_proc=64): 2%|▏ | 7000/359152 [01:18<32:29, 180.67 examples/s] 
Tokenizing Prompts (num_proc=64): 2%|▏ | 8000/359152 [01:19<48:28, 120.72 examples/s] 
Tokenizing Prompts (num_proc=64): 3%|▎ | 9000/359152 [01:21<37:02, 157.57 examples/s] 
Tokenizing Prompts (num_proc=64): 3%|▎ | 10000/359152 [01:23<29:09, 199.55 examples/s] 
Tokenizing Prompts (num_proc=64): 3%|▎ | 11000/359152 [01:28<28:52, 200.99 examples/s] 
Tokenizing Prompts (num_proc=64): 3%|▎ | 12000/359152 [01:31<23:54, 241.99 examples/s] 
Tokenizing Prompts (num_proc=64): 4%|▎ | 13000/359152 [01:33<20:16, 284.55 examples/s] 
Tokenizing Prompts (num_proc=64): 4%|▍ | 14000/359152 [01:34<16:39, 345.16 examples/s] 
Tokenizing Prompts (num_proc=64): 4%|▍ | 15000/359152 [01:40<22:33, 254.23 examples/s] 
Tokenizing Prompts (num_proc=64): 4%|▍ | 16000/359152 [01:47<27:27, 208.30 examples/s] 
Tokenizing Prompts (num_proc=64): 5%|▍ | 17000/359152 [01:48<19:47, 288.13 examples/s] 
Tokenizing Prompts (num_proc=64): 5%|▌ | 18000/359152 [01:53<23:29, 242.05 examples/s] 
Tokenizing Prompts (num_proc=64): 5%|▌ | 19000/359152 [01:55<19:21, 292.78 examples/s] 
Tokenizing Prompts (num_proc=64): 6%|▌ | 20000/359152 [01:56<15:35, 362.34 examples/s] 
Tokenizing Prompts (num_proc=64): 6%|▌ | 21000/359152 [02:04<23:35, 238.86 examples/s] 
Tokenizing Prompts (num_proc=64): 6%|▌ | 22000/359152 [02:09<25:44, 218.26 examples/s] 
Tokenizing Prompts (num_proc=64): 6%|▋ | 23000/359152 [02:09<18:16, 306.48 examples/s] 
Tokenizing Prompts (num_proc=64): 7%|▋ | 24000/359152 [02:12<17:13, 324.15 examples/s] 
Tokenizing Prompts (num_proc=64): 7%|▋ | 25000/359152 [02:13<13:42, 406.14 examples/s] 
Tokenizing Prompts (num_proc=64): 7%|▋ | 26000/359152 [02:19<19:45, 281.10 examples/s] 
Tokenizing Prompts (num_proc=64): 8%|▊ | 27000/359152 [02:22<18:37, 297.29 examples/s] 
Tokenizing Prompts (num_proc=64): 8%|▊ | 28000/359152 [02:22<13:16, 415.80 examples/s] 
Tokenizing Prompts (num_proc=64): 8%|▊ | 29000/359152 [02:33<27:09, 202.61 examples/s] 
Tokenizing Prompts (num_proc=64): 8%|▊ | 30000/359152 [02:34<20:26, 268.45 examples/s] 
Tokenizing Prompts (num_proc=64): 9%|▊ | 31000/359152 [02:38<20:27, 267.38 examples/s] 
Tokenizing Prompts (num_proc=64): 9%|▉ | 32000/359152 [02:43<22:04, 247.02 examples/s] 
Tokenizing Prompts (num_proc=64): 9%|▉ | 34000/359152 [02:43<12:42, 426.24 examples/s] 
Tokenizing Prompts (num_proc=64): 10%|▉ | 35000/359152 [02:44<11:02, 489.44 examples/s] 
Tokenizing Prompts (num_proc=64): 10%|█ | 36000/359152 [02:46<09:57, 540.70 examples/s] 
Tokenizing Prompts (num_proc=64): 10%|█ | 37000/359152 [02:55<21:27, 250.12 examples/s] 
Tokenizing Prompts (num_proc=64): 11%|█ | 38000/359152 [02:56<15:48, 338.73 examples/s] 
Tokenizing Prompts (num_proc=64): 11%|█ | 39000/359152 [03:01<18:50, 283.07 examples/s] 
Tokenizing Prompts (num_proc=64): 11%|█ | 40000/359152 [03:01<14:20, 370.79 examples/s] 
Tokenizing Prompts (num_proc=64): 11%|█▏ | 41000/359152 [03:06<16:54, 313.76 examples/s] 
Tokenizing Prompts (num_proc=64): 12%|█▏ | 42000/359152 [03:07<13:05, 403.98 examples/s] 
Tokenizing Prompts (num_proc=64): 12%|█▏ | 43000/359152 [03:08<11:14, 468.90 examples/s] 
Tokenizing Prompts (num_proc=64): 12%|█▏ | 44000/359152 [03:12<15:00, 350.15 examples/s] 
Tokenizing Prompts (num_proc=64): 13%|█▎ | 45000/359152 [03:19<20:45, 252.16 examples/s] 
Tokenizing Prompts (num_proc=64): 13%|█▎ | 46000/359152 [03:29<29:25, 177.40 examples/s] 
Tokenizing Prompts (num_proc=64): 13%|█▎ | 47000/359152 [03:29<21:30, 241.89 examples/s] 
Tokenizing Prompts (num_proc=64): 13%|█▎ | 48000/359152 [03:32<18:50, 275.25 examples/s] 
Tokenizing Prompts (num_proc=64): 14%|█▎ | 48612/359152 [03:32<15:41, 329.74 examples/s] 
Tokenizing Prompts (num_proc=64): 14%|█▍ | 49612/359152 [03:33<12:24, 415.68 examples/s] 
Tokenizing Prompts (num_proc=64): 14%|█▍ | 50612/359152 [03:33<08:42, 590.38 examples/s] 
Tokenizing Prompts (num_proc=64): 14%|█▍ | 51612/359152 [03:35<07:56, 645.03 examples/s] 
Tokenizing Prompts (num_proc=64): 15%|█▍ | 52612/359152 [03:35<06:31, 782.29 examples/s] 
Tokenizing Prompts (num_proc=64): 15%|█▍ | 53612/359152 [03:37<07:49, 651.09 examples/s] 
Tokenizing Prompts (num_proc=64): 15%|█▌ | 54612/359152 [03:38<06:19, 802.56 examples/s] 
Tokenizing Prompts (num_proc=64): 15%|█▌ | 55612/359152 [03:39<05:26, 930.19 examples/s] 
Tokenizing Prompts (num_proc=64): 16%|█▌ | 56612/359152 [03:41<07:51, 641.77 examples/s] 
Tokenizing Prompts (num_proc=64): 16%|█▌ | 57612/359152 [03:43<07:49, 641.81 examples/s] 
Tokenizing Prompts (num_proc=64): 16%|█▋ | 58612/359152 [03:46<09:28, 528.46 examples/s] 
Tokenizing Prompts (num_proc=64): 17%|█▋ | 59612/359152 [03:46<07:33, 660.57 examples/s] 
Tokenizing Prompts (num_proc=64): 17%|█▋ | 60612/359152 [03:49<08:47, 565.46 examples/s] 
Tokenizing Prompts (num_proc=64): 17%|█▋ | 61612/359152 [03:51<09:36, 515.99 examples/s] 
Tokenizing Prompts (num_proc=64): 17%|█▋ | 62612/359152 [03:53<09:58, 495.73 examples/s] 
Tokenizing Prompts (num_proc=64): 18%|█▊ | 63612/359152 [03:54<08:55, 551.73 examples/s] 
Tokenizing Prompts (num_proc=64): 18%|█▊ | 64224/359152 [03:56<10:08, 485.06 examples/s] 
Tokenizing Prompts (num_proc=64): 18%|█▊ | 65224/359152 [04:02<15:45, 310.77 examples/s] 
Tokenizing Prompts (num_proc=64): 18%|█▊ | 66224/359152 [04:03<12:15, 398.43 examples/s] 
Tokenizing Prompts (num_proc=64): 19%|█▊ | 67224/359152 [04:04<10:31, 462.45 examples/s] 
Tokenizing Prompts (num_proc=64): 19%|█▉ | 68224/359152 [04:06<09:55, 488.50 examples/s] 
Tokenizing Prompts (num_proc=64): 19%|█▉ | 69224/359152 [04:11<14:32, 332.26 examples/s] 
Tokenizing Prompts (num_proc=64): 20%|█▉ | 70224/359152 [04:17<18:49, 255.91 examples/s] 
Tokenizing Prompts (num_proc=64): 20%|█▉ | 71224/359152 [04:25<24:05, 199.16 examples/s] 
Tokenizing Prompts (num_proc=64): 20%|██ | 72224/359152 [04:25<17:09, 278.59 examples/s] 
Tokenizing Prompts (num_proc=64): 20%|██ | 73224/359152 [04:31<20:56, 227.55 examples/s] 
Tokenizing Prompts (num_proc=64): 21%|██ | 74224/359152 [04:32<15:07, 313.99 examples/s] 
Tokenizing Prompts (num_proc=64): 21%|██ | 75224/359152 [04:32<11:35, 408.16 examples/s] 
Tokenizing Prompts (num_proc=64): 21%|██ | 76224/359152 [04:36<12:44, 369.93 examples/s] 
Tokenizing Prompts (num_proc=64): 22%|██▏ | 77224/359152 [04:41<16:07, 291.35 examples/s] 
Tokenizing Prompts (num_proc=64): 22%|██▏ | 78224/359152 [04:43<14:17, 327.43 examples/s] 
Tokenizing Prompts (num_proc=64): 22%|██▏ | 80224/359152 [04:44<09:07, 509.11 examples/s] 
Tokenizing Prompts (num_proc=64): 23%|██▎ | 81224/359152 [04:50<13:12, 350.77 examples/s] 
Tokenizing Prompts (num_proc=64): 23%|██▎ | 82224/359152 [04:52<11:45, 392.79 examples/s] 
Tokenizing Prompts (num_proc=64): 23%|██▎ | 83224/359152 [04:55<13:05, 351.07 examples/s] 
Tokenizing Prompts (num_proc=64): 23%|██▎ | 84224/359152 [04:59<14:39, 312.68 examples/s] 
Tokenizing Prompts (num_proc=64): 24%|██▎ | 85224/359152 [05:01<12:21, 369.30 examples/s] 
Tokenizing Prompts (num_proc=64): 24%|██▍ | 86224/359152 [05:06<15:58, 284.84 examples/s] 
Tokenizing Prompts (num_proc=64): 24%|██▍ | 87224/359152 [05:07<12:09, 372.99 examples/s] 
Tokenizing Prompts (num_proc=64): 25%|██▍ | 88224/359152 [05:09<10:49, 417.08 examples/s] 
Tokenizing Prompts (num_proc=64): 25%|██▍ | 89224/359152 [05:13<13:12, 340.80 examples/s] 
Tokenizing Prompts (num_proc=64): 25%|██▌ | 90224/359152 [05:13<09:52, 453.60 examples/s] 
Tokenizing Prompts (num_proc=64): 25%|██▌ | 91224/359152 [05:19<14:28, 308.40 examples/s] 
Tokenizing Prompts (num_proc=64): 26%|██▌ | 92224/359152 [05:20<11:21, 391.89 examples/s] 
Tokenizing Prompts (num_proc=64): 26%|██▌ | 93224/359152 [05:21<08:31, 519.79 examples/s] 
Tokenizing Prompts (num_proc=64): 26%|██▌ | 93836/359152 [05:21<07:10, 616.80 examples/s] 
Tokenizing Prompts (num_proc=64): 26%|██▋ | 94836/359152 [05:21<05:24, 813.60 examples/s] 
Tokenizing Prompts (num_proc=64): 27%|██▋ | 95836/359152 [05:22<04:29, 978.37 examples/s] 
Tokenizing Prompts (num_proc=64): 27%|██▋ | 96836/359152 [05:26<08:24, 520.27 examples/s] 
Tokenizing Prompts (num_proc=64): 27%|██▋ | 97836/359152 [05:29<10:09, 429.01 examples/s] 
Tokenizing Prompts (num_proc=64): 28%|██▊ | 98836/359152 [05:31<10:02, 431.85 examples/s] 
Tokenizing Prompts (num_proc=64): 28%|██▊ | 99836/359152 [05:33<08:45, 493.80 examples/s] 
Tokenizing Prompts (num_proc=64): 28%|██▊ | 100448/359152 [05:37<12:52, 334.85 examples/s] 
Tokenizing Prompts (num_proc=64): 28%|██▊ | 101448/359152 [05:38<10:04, 426.43 examples/s] 
Tokenizing Prompts (num_proc=64): 29%|██▊ | 102448/359152 [05:38<07:00, 610.22 examples/s] 
Tokenizing Prompts (num_proc=64): 29%|██▉ | 103448/359152 [05:39<06:03, 703.24 examples/s] 
Tokenizing Prompts (num_proc=64): 29%|██▉ | 104448/359152 [05:45<12:47, 331.94 examples/s] 
Tokenizing Prompts (num_proc=64): 29%|██▉ | 105448/359152 [05:52<17:44, 238.28 examples/s] 
Tokenizing Prompts (num_proc=64): 30%|██▉ | 106448/359152 [06:00<22:15, 189.29 examples/s] 
Tokenizing Prompts (num_proc=64): 30%|██▉ | 107448/359152 [06:01<16:27, 254.93 examples/s] 
Tokenizing Prompts (num_proc=64): 30%|███ | 108448/359152 [06:02<12:58, 322.24 examples/s] 
Tokenizing Prompts (num_proc=64): 30%|███ | 109448/359152 [06:06<13:58, 297.83 examples/s] 
Tokenizing Prompts (num_proc=64): 31%|███ | 110448/359152 [06:09<13:07, 315.96 examples/s] 
Tokenizing Prompts (num_proc=64): 31%|███ | 111448/359152 [06:10<10:24, 396.65 examples/s] 
Tokenizing Prompts (num_proc=64): 31%|███ | 112060/359152 [06:10<09:05, 452.80 examples/s] 
Tokenizing Prompts (num_proc=64): 31%|███▏ | 113060/359152 [06:10<06:21, 644.61 examples/s] 
Tokenizing Prompts (num_proc=64): 32%|███▏ | 114060/359152 [06:20<16:55, 241.40 examples/s] 
Tokenizing Prompts (num_proc=64): 32%|███▏ | 114672/359152 [06:22<15:14, 267.31 examples/s] 
Tokenizing Prompts (num_proc=64): 32%|███▏ | 115672/359152 [06:29<20:16, 200.10 examples/s] 
Tokenizing Prompts (num_proc=64): 32%|███▏ | 116672/359152 [06:30<14:32, 277.95 examples/s] 
Tokenizing Prompts (num_proc=64): 33%|███▎ | 117672/359152 [06:32<13:00, 309.47 examples/s] 
Tokenizing Prompts (num_proc=64): 33%|███▎ | 118672/359152 [06:38<15:35, 257.08 examples/s] 
Tokenizing Prompts (num_proc=64): 33%|███▎ | 119672/359152 [06:39<12:46, 312.34 examples/s] 
Tokenizing Prompts (num_proc=64): 34%|███▎ | 120672/359152 [06:42<12:20, 322.25 examples/s] 
Tokenizing Prompts (num_proc=64): 34%|███▍ | 121672/359152 [06:52<20:33, 192.54 examples/s] 
Tokenizing Prompts (num_proc=64): 34%|███▍ | 122672/359152 [06:53<15:41, 251.07 examples/s] 
Tokenizing Prompts (num_proc=64): 34%|███▍ | 123672/359152 [06:55<12:40, 309.46 examples/s] 
Tokenizing Prompts (num_proc=64): 35%|███▍ | 124672/359152 [06:55<09:12, 424.12 examples/s] 
Tokenizing Prompts (num_proc=64): 35%|███▍ | 125672/359152 [06:59<11:16, 345.28 examples/s] 
Tokenizing Prompts (num_proc=64): 35%|███▌ | 126672/359152 [07:01<09:47, 396.03 examples/s] 
Tokenizing Prompts (num_proc=64): 36%|███▌ | 127672/359152 [07:01<07:06, 543.29 examples/s] 
Tokenizing Prompts (num_proc=64): 36%|███▌ | 128672/359152 [07:07<12:17, 312.62 examples/s] 
Tokenizing Prompts (num_proc=64): 36%|███▌ | 129672/359152 [07:09<10:21, 369.33 examples/s] 
Tokenizing Prompts (num_proc=64): 37%|███▋ | 131672/359152 [07:13<08:36, 440.76 examples/s] 
Tokenizing Prompts (num_proc=64): 37%|███▋ | 132284/359152 [07:23<17:44, 213.21 examples/s] 
Tokenizing Prompts (num_proc=64): 37%|███▋ | 133284/359152 [07:24<14:21, 262.15 examples/s] 
Tokenizing Prompts (num_proc=64): 37%|███▋ | 134284/359152 [07:26<12:04, 310.47 examples/s] 
Tokenizing Prompts (num_proc=64): 38%|███▊ | 135284/359152 [07:29<12:10, 306.59 examples/s] 
Tokenizing Prompts (num_proc=64): 38%|███▊ | 136284/359152 [07:31<09:50, 377.28 examples/s] 
Tokenizing Prompts (num_proc=64): 38%|███▊ | 136896/359152 [07:31<08:46, 422.16 examples/s] 
Tokenizing Prompts (num_proc=64): 38%|███▊ | 137896/359152 [07:33<07:57, 463.43 examples/s] 
Tokenizing Prompts (num_proc=64): 39%|███▊ | 138896/359152 [07:35<07:29, 489.76 examples/s] 
Tokenizing Prompts (num_proc=64): 39%|███▉ | 139896/359152 [07:37<07:33, 483.59 examples/s] 
Tokenizing Prompts (num_proc=64): 39%|███▉ | 140896/359152 [07:41<10:05, 360.59 examples/s] 
Tokenizing Prompts (num_proc=64): 39%|███▉ | 141508/359152 [07:42<09:18, 389.45 examples/s] 
Tokenizing Prompts (num_proc=64): 40%|███▉ | 142508/359152 [07:43<06:51, 526.57 examples/s] 
Tokenizing Prompts (num_proc=64): 40%|███▉ | 143508/359152 [07:52<15:02, 238.97 examples/s] 
Tokenizing Prompts (num_proc=64): 40%|████ | 144508/359152 [07:55<13:49, 258.87 examples/s] 
Tokenizing Prompts (num_proc=64): 41%|████ | 145508/359152 [07:57<11:32, 308.73 examples/s] 
Tokenizing Prompts (num_proc=64): 41%|████ | 146508/359152 [08:02<13:42, 258.54 examples/s] 
Tokenizing Prompts (num_proc=64): 41%|████ | 147508/359152 [08:03<10:28, 336.87 examples/s] 
Tokenizing Prompts (num_proc=64): 41%|████▏ | 148508/359152 [08:17<21:24, 163.98 examples/s] 
Tokenizing Prompts (num_proc=64): 42%|████▏ | 149508/359152 [08:17<15:23, 227.00 examples/s] 
Tokenizing Prompts (num_proc=64): 42%|████▏ | 150508/359152 [08:20<13:56, 249.40 examples/s] 
Tokenizing Prompts (num_proc=64): 42%|████▏ | 151508/359152 [08:25<14:17, 242.11 examples/s] 
Tokenizing Prompts (num_proc=64): 42%|████▏ | 152508/359152 [08:33<18:45, 183.58 examples/s] 
Tokenizing Prompts (num_proc=64): 43%|████▎ | 153508/359152 [08:39<19:38, 174.46 examples/s] 
Tokenizing Prompts (num_proc=64): 43%|████▎ | 154508/359152 [08:43<17:01, 200.34 examples/s] 
Tokenizing Prompts (num_proc=64): 43%|████▎ | 155508/359152 [08:52<21:38, 156.86 examples/s] 
Tokenizing Prompts (num_proc=64): 43%|████▎ | 156120/359152 [08:54<19:16, 175.62 examples/s] 
Tokenizing Prompts (num_proc=64): 44%|████▎ | 157120/359152 [08:57<15:42, 214.44 examples/s] 
Tokenizing Prompts (num_proc=64): 44%|████▍ | 158120/359152 [09:01<15:00, 223.26 examples/s] 
Tokenizing Prompts (num_proc=64): 44%|████▍ | 159120/359152 [09:04<13:03, 255.39 examples/s] 
Tokenizing Prompts (num_proc=64): 45%|████▍ | 160120/359152 [09:05<10:07, 327.60 examples/s] 
Tokenizing Prompts (num_proc=64): 45%|████▍ | 161120/359152 [09:05<07:39, 430.73 examples/s] 
Tokenizing Prompts (num_proc=64): 45%|████▌ | 162120/359152 [09:12<12:11, 269.43 examples/s] 
Tokenizing Prompts (num_proc=64): 45%|████▌ | 163120/359152 [09:13<09:25, 346.82 examples/s] 
Tokenizing Prompts (num_proc=64): 46%|████▌ | 164120/359152 [09:15<08:18, 391.61 examples/s] 
Tokenizing Prompts (num_proc=64): 46%|████▌ | 165120/359152 [09:16<06:26, 501.48 examples/s] 
Tokenizing Prompts (num_proc=64): 46%|████▋ | 166120/359152 [09:18<06:52, 468.13 examples/s] 
Tokenizing Prompts (num_proc=64): 47%|████▋ | 167120/359152 [09:20<06:33, 487.53 examples/s] 
Tokenizing Prompts (num_proc=64): 47%|████▋ | 168120/359152 [09:22<06:36, 481.24 examples/s] 
Tokenizing Prompts (num_proc=64): 47%|████▋ | 168732/359152 [09:24<07:02, 450.67 examples/s] 
Tokenizing Prompts (num_proc=64): 47%|████▋ | 169732/359152 [09:30<10:39, 296.00 examples/s] 
Tokenizing Prompts (num_proc=64): 48%|████▊ | 170732/359152 [09:30<07:41, 407.87 examples/s] 
Tokenizing Prompts (num_proc=64): 48%|████▊ | 171732/359152 [09:33<07:40, 407.02 examples/s] 
Tokenizing Prompts (num_proc=64): 48%|████▊ | 172344/359152 [09:33<07:02, 441.78 examples/s] 
Tokenizing Prompts (num_proc=64): 48%|████▊ | 173344/359152 [09:37<08:29, 364.80 examples/s] 
Tokenizing Prompts (num_proc=64): 49%|████▊ | 174344/359152 [09:39<07:52, 391.30 examples/s] 
Tokenizing Prompts (num_proc=64): 49%|████▉ | 175344/359152 [09:48<13:28, 227.21 examples/s] 
Tokenizing Prompts (num_proc=64): 49%|████▉ | 176344/359152 [09:50<11:45, 259.29 examples/s] 
Tokenizing Prompts (num_proc=64): 49%|████▉ | 177344/359152 [09:54<11:25, 265.06 examples/s] 
Tokenizing Prompts (num_proc=64): 50%|████▉ | 178344/359152 [10:05<18:27, 163.28 examples/s] 
Tokenizing Prompts (num_proc=64): 50%|████▉ | 178956/359152 [10:07<15:58, 187.97 examples/s] 
Tokenizing Prompts (num_proc=64): 50%|████▉ | 178956/359152 [10:18<15:58, 187.97 examples/s] 
Tokenizing Prompts (num_proc=64): 50%|████▉ | 179568/359152 [10:19<25:34, 117.03 examples/s] 
Tokenizing Prompts (num_proc=64): 50%|█████ | 180568/359152 [10:19<17:17, 172.07 examples/s] 
Tokenizing Prompts (num_proc=64): 50%|█████ | 181180/359152 [10:23<16:44, 177.26 examples/s] 
Tokenizing Prompts (num_proc=64): 51%|█████ | 182180/359152 [10:25<13:33, 217.58 examples/s] 
Tokenizing Prompts (num_proc=64): 51%|█████ | 183180/359152 [10:30<13:24, 218.83 examples/s] 
Tokenizing Prompts (num_proc=64): 51%|█████▏ | 184180/359152 [10:31<09:47, 297.82 examples/s] 
Tokenizing Prompts (num_proc=64): 52%|█████▏ | 185180/359152 [10:31<06:50, 423.48 examples/s] 
Tokenizing Prompts (num_proc=64): 52%|█████▏ | 186180/359152 [10:31<05:20, 540.22 examples/s] 
Tokenizing Prompts (num_proc=64): 52%|█████▏ | 187180/359152 [10:32<04:33, 628.00 examples/s] 
Tokenizing Prompts (num_proc=64): 52%|█████▏ | 188180/359152 [10:37<07:26, 382.68 examples/s] 
Tokenizing Prompts (num_proc=64): 53%|█████▎ | 189180/359152 [10:48<13:58, 202.74 examples/s] 
Tokenizing Prompts (num_proc=64): 53%|█████▎ | 189792/359152 [10:50<13:36, 207.52 examples/s] 
Tokenizing Prompts (num_proc=64): 53%|█████▎ | 190792/359152 [10:56<14:27, 194.12 examples/s] 
Tokenizing Prompts (num_proc=64): 53%|█████▎ | 191792/359152 [11:02<14:49, 188.24 examples/s] 
Tokenizing Prompts (num_proc=64): 54%|█████▎ | 192792/359152 [11:05<12:59, 213.37 examples/s] 
Tokenizing Prompts (num_proc=64): 54%|█████▍ | 193792/359152 [11:08<11:23, 242.02 examples/s] 
Tokenizing Prompts (num_proc=64): 54%|█████▍ | 194792/359152 [11:17<15:33, 176.00 examples/s] 
Tokenizing Prompts (num_proc=64): 55%|█████▍ | 195792/359152 [11:23<15:16, 178.32 examples/s] 
Tokenizing Prompts (num_proc=64): 55%|█████▍ | 196792/359152 [11:30<16:58, 159.41 examples/s] 
Tokenizing Prompts (num_proc=64): 55%|█████▌ | 197792/359152 [11:33<14:12, 189.34 examples/s] 
Tokenizing Prompts (num_proc=64): 56%|█████▌ | 199792/359152 [11:40<11:41, 227.32 examples/s] 
Tokenizing Prompts (num_proc=64): 56%|█████▌ | 200792/359152 [11:41<09:27, 279.29 examples/s] 
Tokenizing Prompts (num_proc=64): 56%|█████▌ | 201792/359152 [11:58<09:23, 279.29 examples/s] 
Tokenizing Prompts (num_proc=64): 56%|█████▋ | 202792/359152 [12:01<16:00, 162.86 examples/s] 
Tokenizing Prompts (num_proc=64): 57%|█████▋ | 203792/359152 [12:10<17:38, 146.75 examples/s] 
Tokenizing Prompts (num_proc=64): 57%|█████▋ | 205792/359152 [12:13<12:18, 207.77 examples/s] 
Tokenizing Prompts (num_proc=64): 58%|█████▊ | 206792/359152 [12:14<09:54, 256.18 examples/s] 
Tokenizing Prompts (num_proc=64): 58%|█████▊ | 207792/359152 [12:15<07:47, 323.73 examples/s] 
Tokenizing Prompts (num_proc=64): 58%|█████▊ | 208792/359152 [12:22<10:28, 239.11 examples/s] 
Tokenizing Prompts (num_proc=64): 58%|█████▊ | 209792/359152 [12:24<09:18, 267.58 examples/s] 
Tokenizing Prompts (num_proc=64): 59%|█████▊ | 210404/359152 [12:26<08:56, 277.43 examples/s] 
Tokenizing Prompts (num_proc=64): 59%|█████▊ | 210404/359152 [12:38<08:56, 277.43 examples/s] 
Tokenizing Prompts (num_proc=64): 59%|█████▉ | 211404/359152 [12:42<18:16, 134.75 examples/s] 
Tokenizing Prompts (num_proc=64): 59%|█████▉ | 212404/359152 [12:50<18:02, 135.59 examples/s] 
Tokenizing Prompts (num_proc=64): 59%|█████▉ | 213404/359152 [13:00<19:49, 122.55 examples/s] 
Tokenizing Prompts (num_proc=64): 60%|█████▉ | 214404/359152 [13:01<14:57, 161.30 examples/s] 
Tokenizing Prompts (num_proc=64): 60%|█████▉ | 215016/359152 [13:06<15:16, 157.32 examples/s] 
Tokenizing Prompts (num_proc=64): 60%|██████ | 215628/359152 [13:07<13:00, 183.86 examples/s] 
Tokenizing Prompts (num_proc=64): 60%|██████ | 216628/359152 [13:08<09:36, 247.41 examples/s] 
Tokenizing Prompts (num_proc=64): 60%|██████ | 217240/359152 [13:16<14:31, 162.85 examples/s] 
Tokenizing Prompts (num_proc=64): 61%|██████ | 218240/359152 [13:18<10:18, 227.92 examples/s] 
Tokenizing Prompts (num_proc=64): 61%|██████ | 219240/359152 [13:18<06:58, 334.34 examples/s] 
Tokenizing Prompts (num_proc=64): 61%|██████▏ | 220240/359152 [13:24<09:13, 250.76 examples/s] 
Tokenizing Prompts (num_proc=64): 62%|██████▏ | 221240/359152 [13:30<10:45, 213.65 examples/s] 
Tokenizing Prompts (num_proc=64): 62%|██████▏ | 221852/359152 [13:30<08:28, 270.11 examples/s] 
Tokenizing Prompts (num_proc=64): 62%|██████▏ | 222852/359152 [13:33<07:42, 294.57 examples/s] 
Tokenizing Prompts (num_proc=64): 62%|██████▏ | 223464/359152 [13:35<07:47, 289.99 examples/s] 
Tokenizing Prompts (num_proc=64): 62%|██████▏ | 224464/359152 [13:41<09:57, 225.24 examples/s] 
Tokenizing Prompts (num_proc=64): 63%|██████▎ | 225464/359152 [13:42<06:54, 322.63 examples/s] 
Tokenizing Prompts (num_proc=64): 63%|██████▎ | 226464/359152 [13:55<13:44, 160.89 examples/s] 
Tokenizing Prompts (num_proc=64): 63%|██████▎ | 227464/359152 [13:55<09:32, 230.05 examples/s] 
Tokenizing Prompts (num_proc=64): 64%|██████▎ | 228464/359152 [13:59<09:24, 231.36 examples/s] 
Tokenizing Prompts (num_proc=64): 64%|██████▍ | 229464/359152 [14:01<07:49, 275.94 examples/s] 
Tokenizing Prompts (num_proc=64): 64%|██████▍ | 230464/359152 [14:05<07:47, 275.44 examples/s] 
Tokenizing Prompts (num_proc=64): 64%|██████▍ | 231464/359152 [14:09<07:48, 272.31 examples/s] 
Tokenizing Prompts (num_proc=64): 65%|██████▍ | 232464/359152 [14:14<08:53, 237.25 examples/s] 
Tokenizing Prompts (num_proc=64): 65%|██████▌ | 233464/359152 [14:16<07:35, 276.20 examples/s] 
Tokenizing Prompts (num_proc=64): 65%|██████▌ | 234464/359152 [14:23<09:31, 218.11 examples/s] 
Tokenizing Prompts (num_proc=64): 66%|██████▌ | 235464/359152 [14:24<07:13, 285.45 examples/s] 
Tokenizing Prompts (num_proc=64): 66%|██████▌ | 235464/359152 [14:38<07:13, 285.45 examples/s] 
Tokenizing Prompts (num_proc=64): 66%|██████▌ | 236464/359152 [14:38<13:45, 148.67 examples/s] 
Tokenizing Prompts (num_proc=64): 66%|██████▌ | 237464/359152 [14:39<10:09, 199.60 examples/s] 
Tokenizing Prompts (num_proc=64): 66%|██████▋ | 238076/359152 [14:44<10:56, 184.56 examples/s] 
Tokenizing Prompts (num_proc=64): 67%|██████▋ | 239076/359152 [14:47<09:38, 207.52 examples/s] 
Tokenizing Prompts (num_proc=64): 67%|██████▋ | 240076/359152 [14:54<10:58, 180.85 examples/s] 
Tokenizing Prompts (num_proc=64): 67%|██████▋ | 241076/359152 [14:59<10:33, 186.40 examples/s] 
Tokenizing Prompts (num_proc=64): 67%|██████▋ | 242076/359152 [15:00<07:20, 265.71 examples/s] 
Tokenizing Prompts (num_proc=64): 68%|██████▊ | 243076/359152 [15:05<08:29, 228.01 examples/s] 
Tokenizing Prompts (num_proc=64): 68%|██████▊ | 244076/359152 [15:11<08:56, 214.65 examples/s] 
Tokenizing Prompts (num_proc=64): 68%|██████▊ | 245076/359152 [15:13<07:43, 245.87 examples/s] 
Tokenizing Prompts (num_proc=64): 69%|██████▊ | 246076/359152 [15:21<09:45, 193.22 examples/s] 
Tokenizing Prompts (num_proc=64): 69%|██████▉ | 247076/359152 [15:22<07:05, 263.30 examples/s] 
Tokenizing Prompts (num_proc=64): 69%|██████▉ | 248076/359152 [15:29<09:14, 200.16 examples/s] 
Tokenizing Prompts (num_proc=64): 69%|██████▉ | 249076/359152 [15:47<16:04, 114.18 examples/s] 
Tokenizing Prompts (num_proc=64): 70%|██████▉ | 249688/359152 [15:51<15:14, 119.64 examples/s] 
Tokenizing Prompts (num_proc=64): 70%|██████▉ | 250688/359152 [15:56<13:01, 138.87 examples/s] 
Tokenizing Prompts (num_proc=64): 70%|██████▉ | 251300/359152 [16:01<13:18, 135.05 examples/s] 
Tokenizing Prompts (num_proc=64): 70%|███████ | 252300/359152 [16:08<13:07, 135.64 examples/s] 
Tokenizing Prompts (num_proc=64): 71%|███████ | 253300/359152 [16:21<16:05, 109.63 examples/s] 
Tokenizing Prompts (num_proc=64): 71%|███████ | 253912/359152 [16:37<22:41, 77.30 examples/s] 
Tokenizing Prompts (num_proc=64): 71%|███████ | 254912/359152 [16:39<15:44, 110.34 examples/s] 
Tokenizing Prompts (num_proc=64): 71%|███████▏ | 255912/359152 [16:46<14:46, 116.42 examples/s] 
Tokenizing Prompts (num_proc=64): 72%|███████▏ | 256912/359152 [16:50<11:49, 144.11 examples/s] 
Tokenizing Prompts (num_proc=64): 72%|███████▏ | 257912/359152 [16:56<11:14, 149.99 examples/s] 
Tokenizing Prompts (num_proc=64): 72%|███████▏ | 258912/359152 [17:03<11:40, 143.15 examples/s] 
Tokenizing Prompts (num_proc=64): 72%|███████▏ | 259524/359152 [17:05<10:05, 164.44 examples/s] 
Tokenizing Prompts (num_proc=64): 73%|███████▎ | 260524/359152 [17:13<10:48, 152.09 examples/s] 
Tokenizing Prompts (num_proc=64): 73%|███████▎ | 261524/359152 [17:22<11:56, 136.20 examples/s] 
Tokenizing Prompts (num_proc=64): 73%|███████▎ | 262524/359152 [17:25<10:03, 160.13 examples/s] 
Tokenizing Prompts (num_proc=64): 73%|███████▎ | 263524/359152 [17:32<10:11, 156.50 examples/s] 
Tokenizing Prompts (num_proc=64): 74%|███████▎ | 264524/359152 [17:36<08:52, 177.69 examples/s]

Steps to reproduce

  1. you need a dataset of roughly 350k*20k tokens on average
  2. put it into an axolotl config
  3. axolotl train...

Config yaml

base_model: mistralai/Mistral-Nemo-Base-2407
tokenizer_type: AutoTokenizer
strict: false

plugins:
  - axolotl.integrations.liger.LigerPlugin

liger_rope: true
liger_rms_norm: true
liger_swiglu: true
liger_fused_linear_cross_entropy: true

datasets:
  - path: your_dataset
    type: chat_template
    field_messages: conversation
    message_property_mappings:
      role: role
      content: content
    roles:
      system:
        - system
      user:
        - user
      assistant:
        - assistant
    chat_template: chatml

dataset_processes: 64
dataset_prepared_path: last_run_prepared
val_set_size: 0
output_dir: ./outputs/out

sequence_len: 65536
sample_packing: true
sample_packing_sequentially: true
pad_to_sequence_len: true
curriculum_sampling: true

wandb_project: 
wandb_entity: 
wandb_watch:
wandb_name:
wandb_log_model:

sequence_parallel_degree: 1
gradient_accumulation_steps: 8
micro_batch_size: 2
num_epochs: 1
optimizer: adamw_torch_8bit
lr_scheduler: cosine
learning_rate: 2e-5

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: offload_disk
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 100
evals_per_epoch: 2
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed: zero3.json
weight_decay: 0.1
fsdp:
fsdp_config:

special_tokens:
  eos_token: <|im_end|>
  pad_token: <|im_end|>

Possible solution

No response

Which Operating Systems are you using?

  • [x] Linux
  • [ ] macOS
  • [ ] Windows

Python Version

3.11

axolotl branch-commit

release 0.9.2

Acknowledgements

  • [x] My issue title is concise, descriptive, and in title casing.
  • [x] I have searched the existing issues to make sure this bug has not been reported yet.
  • [x] I am using the latest version of axolotl.
  • [x] I have provided enough information for the maintainers to reproduce and diagnose the issue.

casper-hansen avatar May 16 '25 14:05 casper-hansen

Do you have an example of a public dataset that we can repro this on?

djsaunde avatar May 16 '25 16:05 djsaunde

Do you have an example of a public dataset that we can repro this on?

Unfortunately I don't

casper-hansen avatar May 16 '25 18:05 casper-hansen

Launching preprocessing in distributed mode is the main problem. You can probably create a dummy dataset of 1 million samples with 64k tokens each and try, but I cannot for the life of me not avoid the timeout when using with zero_first. Is it possible to remove this in the preprocessing and achieve the same effect in another way?

@retry_on_request_exceptions(max_retries=3, delay=5)
def prepare_dataset(cfg, tokenizer, processor=None, preprocess_iterable=None):
    prompters = []
    if not cfg.pretraining_dataset:
        with zero_first(is_local_main_process()):

casper-hansen avatar May 16 '25 19:05 casper-hansen

Maybe the axolotl preprocess CLI should not launch with accelerate? What do you think @winglian?

casper-hansen avatar May 16 '25 19:05 casper-hansen

In the legacy docs, we had used python.

Image

But digging into the cli, we don't use accelerate for the preprocess.

Image

winglian avatar May 16 '25 19:05 winglian

I would try with CUDA_VISIBLE_DEVICES="" axolotl preprocess config.yaml

winglian avatar May 16 '25 19:05 winglian

oh wait, are you using axolotl preprocess before axolotl train?

winglian avatar May 16 '25 19:05 winglian

I used axolotl train, triggered the error, then pivoted to axolotl preprocess and found the same error. I will need to check the commands again, but I'm pretty sure I can do the Python command instead.

casper-hansen avatar May 16 '25 20:05 casper-hansen

This does the trick. Though, I would recommend using something else than with zero_first(is_local_main_process()) in general. This lowers QoL when using axolotl and could be replaced with a simpler FileLock system: https://github.com/casper-hansen/FlashSamplePack/commit/c86cd04cf9e842acf48736ba8286502ff504237a

python -m axolotl.cli.preprocess axolotl_config.yaml

casper-hansen avatar May 17 '25 10:05 casper-hansen

@casper-hansen agreed, feel free to make a PR! Or, I'll probably do so later.

djsaunde avatar May 17 '25 15:05 djsaunde

@casper-hansen agreed, feel free to make a PR! Or, I'll probably do so later.

I probably won't be creating the PR, but let's leave this issue open until a solution is in place.

casper-hansen avatar May 17 '25 16:05 casper-hansen

FYI: the current plan is to roll this into my data loading refactor.

djsaunde avatar May 20 '25 13:05 djsaunde