litgpt Pretraining an OLMo model on the SlimPajama dataset

Hi! I am planning to test pretraining OLMo 1B model on the slim pajama dataset. I was trying to follow the tutorial for tinyllama but one of the steps for preparing the dataset uses the litgpt/data/prepare_slimpajama.py file which seems to be missing to me in the repo. Any workarounds for this?

Nov 26 '24 12:11 aflah02

CC: @rasbt @Andrei-Aksionov Just bumping this on your radar as this is a continuation to the OLMo PR

Nov 27 '24 14:11 aflah02

Hello @aflah02 Good catch! It looks like this file was accidentally deleted in one of the recent PRs: https://github.com/Lightning-AI/litgpt/pull/1821/files#diff-2646bbbf72cb6e84cfc29a226b4446985b6904dc04b6228ef8a69d9fcb4a2951

Could you bring it back in a PR?

Nov 27 '24 14:11 Andrei-Aksionov

Sure, I'll do that

Nov 27 '24 14:11 aflah02

I tried using the code to process the dataset however it doesn't seem to work for the train set due to size issues. Is there a way to reduce how many things are moved to/kept in the tmp dir?

Error -

OSError: [Errno 28] No space left on device: '/NS/llm-1/static00/data/slimpajama-raw/train/chunk2/example_train_4825.jsonl.zst' -> '/tmp/data/chunk2/example_train_4825.jsonl.zst'
OSError: [Errno 28] No space left on device: '/NS/llm-1/static00/data/slimpajama-raw/train/chunk5/example_train_2785.jsonl.zst' -> '/tmp/data/chunk5/example_train_2785.jsonl.zst'
File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litdata/processing/data_processor.py", line 167, in _download_data_target
shutil.copyfile(path, local_path)
File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/shutil.py", line 269, in copyfile
_fastcopy_sendfile(fsrc, fdst)
File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/shutil.py", line 158, in _fastcopy_sendfile
raise err from None
File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/shutil.py", line 144, in _fastcopy_sendfile
sent = os.sendfile(outfd, infd, offset, blocksize)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 28] No space left on device: '/NS/llm-1/static00/data/slimpajama-raw/train/chunk4/example_train_401.jsonl.zst' -> '/tmp/data/chunk4/example_train_401.jsonl.zst'
Progress: 2%|██▎ | 1157/59166 [24:37<20:34:42, 1.28s/it]
``

Nov 29 '24 10:11 aflah02

A simple fix that I'm using is to create a symlink with my NFS where I have more storage with /tmp/data and then running it. It seems to run for now (still in progress)

Nov 29 '24 15:11 aflah02

Hey @Andrei-Aksionov @rasbt

I was trying to set up a multinode run via SLURM and was testing this on 2 nodes with ethernet based interconnect however the init fails -

/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python3.11 /NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_li ...
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/16
[W1204 13:17:19.038836275 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:33641 (errno: 97 - Address family not supported by protocol).
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/16
[W1204 13:18:00.218036449 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:33641 (errno: 97 - Address family not supported by protocol).
[rank: 6] Seed set to 42
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/16
[W1204 13:18:00.580590861 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:33641 (errno: 97 - Address family not supported by protocol).
[rank: 1] Seed set to 42
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/16
[W1204 13:18:00.633244692 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:33641 (errno: 97 - Address family not supported by protocol).
[rank: 7] Seed set to 42
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/16
[W1204 13:18:01.815471680 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:33641 (errno: 97 - Address family not supported by protocol).
[rank: 3] Seed set to 42
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/16
[W1204 13:18:01.876939030 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:33641 (errno: 97 - Address family not supported by protocol).
[rank: 4] Seed set to 42
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/16
[W1204 13:18:01.918607039 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:33641 (errno: 97 - Address family not supported by protocol).
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/16
[rank: 5] Seed set to 42
[W1204 13:18:01.934432434 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:33641 (errno: 97 - Address family not supported by protocol).
[rank: 2] Seed set to 42
Traceback (most recent call last):
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/bin/litgpt", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/__main__.py", line 71, in main
    CLI(parser_data)
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 119, in CLI
    return _run_component(component, init.get(subcommand))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 204, in _run_component
    return component(**cfg)
           ^^^^^^^^^^^^^^^^
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 149, in setup
    fabric.launch()
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 843, in launch
    return self._wrap_and_launch(function, self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 928, in _wrap_and_launch
    return launcher.launch(to_run, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/strategies/launchers/subprocess_script.py", line 107, in launch
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 932, in _wrap_with_setup
    self._strategy.setup_environment()
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/strategies/fsdp.py", line 260, in setup_environment
    self._setup_distributed()
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/strategies/fsdp.py", line 671, in _setup_distributed
    _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 297, in _init_dist_connection
    torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
    func_return = func(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1361, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 258, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 185, in _create_c10d_store
    return TCPStore(
           ^^^^^^^^^
torch.distributed.DistStoreError: Timed out after 1801 seconds waiting for clients. 8/16 clients joined.
[rank4]: Traceback (most recent call last):
[rank4]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/bin/litgpt", line 8, in <module>
[rank4]:     sys.exit(main())
[rank4]:              ^^^^^^
[rank4]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/__main__.py", line 71, in main
[rank4]:     CLI(parser_data)
[rank4]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 119, in CLI
[rank4]:     return _run_component(component, init.get(subcommand))
[rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 204, in _run_component
[rank4]:     return component(**cfg)
[rank4]:            ^^^^^^^^^^^^^^^^
[rank4]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 155, in setup
[rank4]:     main(
[rank4]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 215, in main
[rank4]:     train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
[rank4]:                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 422, in get_dataloaders
[rank4]:     with fabric.rank_zero_first():
[rank4]:   File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/contextlib.py", line 137, in __enter__
[rank4]:     return next(self.gen)
[rank4]:            ^^^^^^^^^^^^^^
[rank4]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 635, in rank_zero_first
[rank4]:     with _InfiniteBarrier() as barrier:
[rank4]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 425, in __enter__
[rank4]:     self.group = torch.distributed.new_group(backend="gloo", timeout=timedelta(days=10000))
[rank4]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
[rank4]:     func_return = func(*args, **kwargs)
[rank4]:                   ^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4125, in new_group
[rank4]:     return _new_group_with_tag(
[rank4]:            ^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4205, in _new_group_with_tag
[rank4]:     pg, pg_store = _new_process_group_helper(
[rank4]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1569, in _new_process_group_helper
[rank4]:     backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank4]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: torch.distributed.DistNetworkError: Connection reset by peer
[rank6]: Traceback (most recent call last):
[rank6]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/bin/litgpt", line 8, in <module>
[rank6]:     sys.exit(main())
[rank6]:              ^^^^^^
[rank6]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/__main__.py", line 71, in main
[rank6]:     CLI(parser_data)
[rank6]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 119, in CLI
[rank6]:     return _run_component(component, init.get(subcommand))
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 204, in _run_component
[rank6]:     return component(**cfg)
[rank6]:            ^^^^^^^^^^^^^^^^
[rank6]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 155, in setup
[rank6]:     main(
[rank6]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 215, in main
[rank6]:     train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
[rank6]:                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 422, in get_dataloaders
[rank6]:     with fabric.rank_zero_first():
[rank6]:   File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/contextlib.py", line 137, in __enter__
[rank6]:     return next(self.gen)
[rank6]:            ^^^^^^^^^^^^^^
[rank6]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 635, in rank_zero_first
[rank6]:     with _InfiniteBarrier() as barrier:
[rank6]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 425, in __enter__
[rank6]:     self.group = torch.distributed.new_group(backend="gloo", timeout=timedelta(days=10000))
[rank6]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
[rank6]:     func_return = func(*args, **kwargs)
[rank6]:                   ^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4125, in new_group
[rank6]:     return _new_group_with_tag(
[rank6]:            ^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4205, in _new_group_with_tag
[rank6]:     pg, pg_store = _new_process_group_helper(
[rank6]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1569, in _new_process_group_helper
[rank6]:     backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank6]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: torch.distributed.DistNetworkError: Connection reset by peer
[rank2]: Traceback (most recent call last):
[rank2]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/bin/litgpt", line 8, in <module>
[rank2]:     sys.exit(main())
[rank2]:              ^^^^^^
[rank2]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/__main__.py", line 71, in main
[rank2]:     CLI(parser_data)
[rank2]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 119, in CLI
[rank2]:     return _run_component(component, init.get(subcommand))
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 204, in _run_component
[rank2]:     return component(**cfg)
[rank2]:            ^^^^^^^^^^^^^^^^
[rank2]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 155, in setup
[rank2]:     main(
[rank2]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 215, in main
[rank2]:     train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
[rank2]:                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 422, in get_dataloaders
[rank2]:     with fabric.rank_zero_first():
[rank2]:   File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/contextlib.py", line 137, in __enter__
[rank2]:     return next(self.gen)
[rank2]:            ^^^^^^^^^^^^^^
[rank2]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 635, in rank_zero_first
[rank2]:     with _InfiniteBarrier() as barrier:
[rank2]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 425, in __enter__
[rank2]:     self.group = torch.distributed.new_group(backend="gloo", timeout=timedelta(days=10000))
[rank2]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
[rank2]:     func_return = func(*args, **kwargs)
[rank2]:                   ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4125, in new_group
[rank2]:     return _new_group_with_tag(
[rank2]:            ^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4205, in _new_group_with_tag
[rank2]:     pg, pg_store = _new_process_group_helper(
[rank2]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1569, in _new_process_group_helper
[rank2]:     backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank2]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: torch.distributed.DistNetworkError: Connection reset by peer
[rank7]: Traceback (most recent call last):
[rank7]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/bin/litgpt", line 8, in <module>
[rank7]:     sys.exit(main())
[rank7]:              ^^^^^^
[rank7]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/__main__.py", line 71, in main
[rank7]:     CLI(parser_data)
[rank7]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 119, in CLI
[rank7]:     return _run_component(component, init.get(subcommand))
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 204, in _run_component
[rank7]:     return component(**cfg)
[rank7]:            ^^^^^^^^^^^^^^^^
[rank7]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 155, in setup
[rank7]:     main(
[rank7]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 215, in main
[rank7]:     train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
[rank7]:                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 422, in get_dataloaders
[rank7]:     with fabric.rank_zero_first():
[rank7]:   File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/contextlib.py", line 137, in __enter__
[rank7]:     return next(self.gen)
[rank7]:            ^^^^^^^^^^^^^^
[rank7]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 635, in rank_zero_first
[rank7]:     with _InfiniteBarrier() as barrier:
[rank7]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 425, in __enter__
[rank7]:     self.group = torch.distributed.new_group(backend="gloo", timeout=timedelta(days=10000))
[rank7]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
[rank7]:     func_return = func(*args, **kwargs)
[rank7]:                   ^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4125, in new_group
[rank7]:     return _new_group_with_tag(
[rank7]:            ^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4205, in _new_group_with_tag
[rank7]:     pg, pg_store = _new_process_group_helper(
[rank7]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1569, in _new_process_group_helper
[rank7]:     backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank7]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: torch.distributed.DistNetworkError: Connection reset by peer
[rank1]: Traceback (most recent call last):
[rank1]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/bin/litgpt", line 8, in <module>
[rank1]:     sys.exit(main())
[rank1]:              ^^^^^^
[rank1]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/__main__.py", line 71, in main
[rank1]:     CLI(parser_data)
[rank1]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 119, in CLI
[rank1]:     return _run_component(component, init.get(subcommand))
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 204, in _run_component
[rank1]:     return component(**cfg)
[rank1]:            ^^^^^^^^^^^^^^^^
[rank1]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 155, in setup
[rank1]:     main(
[rank1]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 215, in main
[rank1]:     train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
[rank1]:                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 422, in get_dataloaders
[rank1]:     with fabric.rank_zero_first():
[rank1]:   File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/contextlib.py", line 137, in __enter__
[rank1]:     return next(self.gen)
[rank1]:            ^^^^^^^^^^^^^^
[rank1]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 635, in rank_zero_first
[rank1]:     with _InfiniteBarrier() as barrier:
[rank1]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 425, in __enter__
[rank1]:     self.group = torch.distributed.new_group(backend="gloo", timeout=timedelta(days=10000))
[rank1]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
[rank1]:     func_return = func(*args, **kwargs)
[rank1]:                   ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4125, in new_group
[rank1]:     return _new_group_with_tag(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4205, in _new_group_with_tag
[rank1]:     pg, pg_store = _new_process_group_helper(
[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1569, in _new_process_group_helper
[rank1]:     backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank1]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: torch.distributed.DistNetworkError: Connection reset by peer
[rank5]: Traceback (most recent call last):
[rank5]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/bin/litgpt", line 8, in <module>
[rank5]:     sys.exit(main())
[rank5]:              ^^^^^^
[rank5]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/__main__.py", line 71, in main
[rank5]:     CLI(parser_data)
[rank5]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 119, in CLI
[rank5]:     return _run_component(component, init.get(subcommand))
[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 204, in _run_component
[rank5]:     return component(**cfg)
[rank5]:            ^^^^^^^^^^^^^^^^
[rank5]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 155, in setup
[rank5]:     main(
[rank5]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 215, in main
[rank5]:     train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
[rank5]:                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 422, in get_dataloaders
[rank5]:     with fabric.rank_zero_first():
[rank5]:   File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/contextlib.py", line 137, in __enter__
[rank5]:     return next(self.gen)
[rank5]:            ^^^^^^^^^^^^^^
[rank5]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 635, in rank_zero_first
[rank5]:     with _InfiniteBarrier() as barrier:
[rank5]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 425, in __enter__
[rank5]:     self.group = torch.distributed.new_group(backend="gloo", timeout=timedelta(days=10000))
[rank5]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
[rank5]:     func_return = func(*args, **kwargs)
[rank5]:                   ^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4125, in new_group
[rank5]:     return _new_group_with_tag(
[rank5]:            ^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4205, in _new_group_with_tag
[rank5]:     pg, pg_store = _new_process_group_helper(
[rank5]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1569, in _new_process_group_helper
[rank5]:     backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank5]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]: torch.distributed.DistNetworkError: Connection reset by peer
[rank3]: Traceback (most recent call last):
[rank3]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/bin/litgpt", line 8, in <module>
[rank3]:     sys.exit(main())
[rank3]:              ^^^^^^
[rank3]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/__main__.py", line 71, in main
[rank3]:     CLI(parser_data)
[rank3]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 119, in CLI
[rank3]:     return _run_component(component, init.get(subcommand))
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 204, in _run_component
[rank3]:     return component(**cfg)
[rank3]:            ^^^^^^^^^^^^^^^^
[rank3]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 155, in setup
[rank3]:     main(
[rank3]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 215, in main
[rank3]:     train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
[rank3]:                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 422, in get_dataloaders
[rank3]:     with fabric.rank_zero_first():
[rank3]:   File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/contextlib.py", line 137, in __enter__
[rank3]:     return next(self.gen)
[rank3]:            ^^^^^^^^^^^^^^
[rank3]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 635, in rank_zero_first
[rank3]:     with _InfiniteBarrier() as barrier:
[rank3]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 425, in __enter__
[rank3]:     self.group = torch.distributed.new_group(backend="gloo", timeout=timedelta(days=10000))
[rank3]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
[rank3]:     func_return = func(*args, **kwargs)
[rank3]:                   ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4125, in new_group
[rank3]:     return _new_group_with_tag(
[rank3]:            ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4205, in _new_group_with_tag
[rank3]:     pg, pg_store = _new_process_group_helper(
[rank3]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1569, in _new_process_group_helper
[rank3]:     backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank3]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: torch.distributed.DistNetworkError: Connection reset by peer

I also see this warning -

Warning: Not all GPUs are fully connected via NVLink. Some GPUs are connected via slower interfaces. It is recommended to switch to a different machine with faster GPU connections for optimal multi-GPU training performance.

Here's the config -


# The name of the model to pretrain. Choose from names in ``litgpt.config``. Mutually exclusive with
# ``model_config``. (type: Optional[str], default: null)
model_name: allenai/OLMo-1B-hf

# A ``litgpt.Config`` object to define the model architecture. Mutually exclusive with
# ``model_config``. (type: Optional[Config], default: null)
model_config:

# Directory in which to save checkpoints and logs. If running in a Lightning Studio Job, look for it in
# /teamspace/jobs/<job-name>/share. (type: <class 'Path'>, default: out/pretrain)
out_dir: out/pretrain/slim-olmo-2x8xH100-GBS-192

# The precision to use for pretraining. Possible choices: "bf16-true", "bf16-mixed", "32-true". (type: Optional[str], default: null)
precision: bf16-mixed

# Optional path to a checkpoint directory to initialize the model from.
# Useful for continued pretraining. Mutually exclusive with ``resume``. (type: Optional[Path], default: null)
initial_checkpoint_dir:

# Path to a checkpoint directory to resume from in case training was interrupted, or ``True`` to resume
# from the latest checkpoint in ``out_dir``. An error will be raised if no checkpoint is found. Passing
# ``'auto'`` will resume from the latest checkpoint but not error if no checkpoint exists.
# (type: Union[bool, Literal["auto"], Path], default: False)
resume: false

# Data-related arguments. If not provided, the default is ``litgpt.data.TinyLlama``.
data: MicroLlama
# Path - /NS/llm-1/static00/data/slimpajama

# Training-related arguments. See ``litgpt.args.TrainArgs`` for details
train:

  # Number of optimizer steps between saving checkpoints (type: Optional[int], default: 1000)
  save_interval: 100000

  # Number of iterations between logging calls (type: int, default: 1)
  log_interval: 1

  # Number of samples between optimizer steps across data-parallel ranks (type: int, default: 48)
  # Scale this number according to the number of GPU and memory size per GPU
  # For example, we used 48 for 4 x 24G 4090 
  global_batch_size: 192

  # Number of samples per data-parallel rank (type: int, default: 12)
  # Scale this number according to the memory size per GPU
  # For example, we used 12 for 24G 4090
  micro_batch_size: 12

  # Number of iterations with learning rate warmup active (type: int, default: 2000)
  lr_warmup_steps: 2000

  # Number of epochs to train on (type: Optional[int], default: null)
  epochs:

  # Total number of tokens to train on (type: Optional[int], default: 3000000000000)
  max_tokens: 3000000000000

  # Limits the number of optimizer steps to run. (type: Optional[int], default: null)
  max_steps:

  # Limits the length of samples. Off by default (type: Optional[int], default: null)
  max_seq_length: 2048

  # Whether to tie the embedding weights with the language modeling head weights. (type: Optional[bool], default: False)
  tie_embeddings:

  #   (type: Optional[float], default: 1.0)
  max_norm: 1.0

  #   (type: float, default: 4e-05)
  min_lr: 4.0e-05

# Evaluation-related arguments. See ``litgpt.args.EvalArgs`` for details
eval:

  # Number of optimizer steps between evaluation calls (type: int, default: 1000)
  interval: 1000

  # Number of tokens to generate (type: Optional[int], default: null)
  max_new_tokens:

  # Number of iterations (type: int, default: 100)
  max_iters: 100

  # Whether to evaluate on the validation set at the beginning of the training
  initial_validation: false

# Optimizer-related arguments
optimizer:

  class_path: torch.optim.AdamW
  
  init_args:
    
    #   (type: float, default: 0.001)
    lr: 4e-4
    
    #   (type: float, default: 0.01)
    weight_decay: 0.1
    
    #   (type: tuple, default: (0.9,0.999))
    betas:
      - 0.9
      - 0.95

# How many devices/GPUs to use. Uses all GPUs by default. (type: Union[int, str], default: auto)
devices: auto

# How many nodes to use. (type: int, default: 1)
num_nodes: 2

# Optional path to the tokenizer dir that was used for preprocessing the dataset. Only some data
# module require this. (type: Optional[Path], default: null)
tokenizer_dir: checkpoints/allenai/OLMo-1B-hf

# The name of the logger to send metrics to. (type: Literal['wandb', 'tensorboard', 'csv'], default: tensorboard)
logger_name: wandb

# The random seed to use for reproducibility. (type: int, default: 42)
seed: 42

This is my run command -

sbatch --partition=a100 --nodes=2 --gres=gpu:8 --cpus-per-task=32 --mem=244G --exclude=sws-3a100grid-01 --time=8-00:00 --output=/NS/llm-1/work/afkhan/USC_Collab/litgpt/SLURM_Runs/logs/litgpt-olmo-pretrain-slimpajama-2x8xH100-GBS-192.out --error=/NS/llm-1/work/afkhan/USC_Collab/litgpt/SLURM_Runs/logs/litgpt-olmo-pretrain-slimpajama-2x8xH100-GBS-192.err --job-name=litgpt-olmo-pretrain-slimpajama-2x8xH100-GBS-192 --wrap "litgpt pretrain --config config_hub/pretrain/slimolmo.yaml --data.data_path /NS/llm-1/static00/data/"

The code works when running on 1 node

Any clue what might be going wrong? I am using SLURM btw

Dec 04 '24 12:12 aflah02

nvidia-smi on the nodes (before timeout based crash) -

So one of the nodes doesn't really load anything

Dec 04 '24 13:12 aflah02

I just realized the error message and this tutorial (https://lightning.ai/docs/fabric/stable/guide/multi_node/slurm.html) seems to imply I should use srun. Running with this now -

sbatch --partition=a100 --nodes=2 --gres=gpu:8 --cpus-per-task=32 --mem=244G --exclude=sws-3a100grid-01 --time=8-00:00 --output=/NS/llm-1/work/afkhan/USC_Collab/litgpt/SLURM_Runs/logs/litgpt-olmo-pretrain-slimpajama-2x8xA100-GBS-192.out --error=/NS/llm-1/work/afkhan/USC_Collab/litgpt/SLURM_Runs/logs/litgpt-olmo-pretrain-slimpajama-2x8xA100-GBS-192.err --job-name=litgpt-olmo-pretrain-slimpajama-2x8xA100-GBS-192 --wrap "srun litgpt pretrain --config config_hub/pretrain/slimolmo.yaml --data.data_path /NS/llm-1/static00/data/"

Dec 04 '24 13:12 aflah02

This command works -

sbatch --partition=a100 --nodes=2 --gres=gpu:8 --ntasks-per-node=8 --mem=244G --exclude=sws-3a100grid-01 --time=8-00:00 --output=/NS/llm-1/work/afkhan/USC_Collab/litgpt/SLURM_Runs/logs/litgpt-olmo-pretrain-slimpajama-2x8xA100-GBS-192.out --error=/NS/llm-1/work/afkhan/USC_Collab/litgpt/SLURM_Runs/logs/litgpt-olmo-pretrain-slimpajama-2x8xA100-GBS-192.err --job-name=litgpt-olmo-pretrain-slimpajama-2x8xA100-GBS-192 --wrap "srun litgpt pretrain --config config_hub/pretrain/slimolmo.yaml --data.data_path /NS/llm-1/static00/data/"

But when I look at wandb I only see logs for one node (even though the loss is aggregated prior to backprop I don't see any device stats for the other node)

Dec 04 '24 13:12 aflah02

Hi @Andrei-Aksionov @rasbt I was trying to figure out the best batch size for pretraining OLMo 1B on A100 machines. I tried a lot of different batch sizes but everything OOMs except for batch size 12 which is quite surprising as that is the recommended batch size for Tiny Llama on a 24 GB 4090 while I have an 80GB A100 machine that I'm testing on. Any ideas what could be going wrong?

Here is my config -


# The name of the model to pretrain. Choose from names in ``litgpt.config``. Mutually exclusive with
# ``model_config``. (type: Optional[str], default: null)
model_name: allenai/OLMo-1B-hf

# A ``litgpt.Config`` object to define the model architecture. Mutually exclusive with
# ``model_config``. (type: Optional[Config], default: null)
model_config:

# Directory in which to save checkpoints and logs. If running in a Lightning Studio Job, look for it in
# /teamspace/jobs/<job-name>/share. (type: <class 'Path'>, default: out/pretrain)
out_dir: out/pretrain/slim-olmo-1x1xA100-GBS-24

# The precision to use for pretraining. Possible choices: "bf16-true", "bf16-mixed", "32-true". (type: Optional[str], default: null)
precision: bf16-mixed

# Optional path to a checkpoint directory to initialize the model from.
# Useful for continued pretraining. Mutually exclusive with ``resume``. (type: Optional[Path], default: null)
initial_checkpoint_dir:

# Path to a checkpoint directory to resume from in case training was interrupted, or ``True`` to resume
# from the latest checkpoint in ``out_dir``. An error will be raised if no checkpoint is found. Passing
# ``'auto'`` will resume from the latest checkpoint but not error if no checkpoint exists.
# (type: Union[bool, Literal["auto"], Path], default: False)
resume: false

# Data-related arguments. If not provided, the default is ``litgpt.data.TinyLlama``.
data: MicroLlama
# Path - /NS/llm-1/static00/data/slimpajama

# Training-related arguments. See ``litgpt.args.TrainArgs`` for details
train:

  # Number of optimizer steps between saving checkpoints (type: Optional[int], default: 1000)
  save_interval: 100000

  # Number of iterations between logging calls (type: int, default: 1)
  log_interval: 1

  # Number of samples between optimizer steps across data-parallel ranks (type: int, default: 48)
  # Scale this number according to the number of GPU and memory size per GPU
  # For example, we used 48 for 4 x 24G 4090 
  global_batch_size: 24

  # Number of samples per data-parallel rank (type: int, default: 12)
  # Scale this number according to the memory size per GPU
  # For example, we used 12 for 24G 4090
  micro_batch_size: 24

  # Number of iterations with learning rate warmup active (type: int, default: 2000)
  lr_warmup_steps: 2000

  # Number of epochs to train on (type: Optional[int], default: null)
  epochs:

  # Total number of tokens to train on (type: Optional[int], default: 3000000000000)
  max_tokens: 3000000000000

  # Limits the number of optimizer steps to run. (type: Optional[int], default: null)
  max_steps:

  # Limits the length of samples. Off by default (type: Optional[int], default: null)
  max_seq_length: 2048

  # Whether to tie the embedding weights with the language modeling head weights. (type: Optional[bool], default: False)
  tie_embeddings:

  #   (type: Optional[float], default: 1.0)
  max_norm: 1.0

  #   (type: float, default: 4e-05)
  min_lr: 4.0e-05

# Evaluation-related arguments. See ``litgpt.args.EvalArgs`` for details
eval:

  # Number of optimizer steps between evaluation calls (type: int, default: 1000)
  interval: 1000

  # Number of tokens to generate (type: Optional[int], default: null)
  max_new_tokens:

  # Number of iterations (type: int, default: 100)
  max_iters: 100

  # Whether to evaluate on the validation set at the beginning of the training
  initial_validation: false

# Optimizer-related arguments
optimizer:

  class_path: torch.optim.AdamW
  
  init_args:
    
    #   (type: float, default: 0.001)
    lr: 4e-4
    
    #   (type: float, default: 0.01)
    weight_decay: 0.1
    
    #   (type: tuple, default: (0.9,0.999))
    betas:
      - 0.9
      - 0.95

# How many devices/GPUs to use. Uses all GPUs by default. (type: Union[int, str], default: auto)
devices: auto

# How many nodes to use. (type: int, default: 1)
num_nodes: 1

# Optional path to the tokenizer dir that was used for preprocessing the dataset. Only some data
# module require this. (type: Optional[Path], default: null)
tokenizer_dir: checkpoints/allenai/OLMo-1B-hf

# The name of the logger to send metrics to. (type: Literal['wandb', 'tensorboard', 'csv'], default: tensorboard)
logger_name: wandb

# The random seed to use for reproducibility. (type: int, default: 42)
seed: 42

I tried on a single GPU as well as 8xA100 machines and I get the same OOMs

Dec 16 '24 13:12 aflah02

I looked at numbers from the Pythia paper and while training the 1B model they were able to use a batch size of 16 for a 40 GB A100 but I can't use that for OLMo 1B despite having a 2x larger GPU

Dec 16 '24 13:12 aflah02

Here's the WANDB GPU Usage Chart for Batch Size 16 -

Dec 16 '24 13:12 aflah02

Just a guess. Try to do a memory profiling. I can image that you will find a spike in memory consumption, caused by one of the examples.

 # Limits the length of samples. Off by default (type: Optional[int], default: null)
  max_seq_length: 2048

It might be there are only a couple of samples in the training set that have such a length. And because of them and the spike that they cause, you cannot enlarge the batch size.

But it's only a guess :)

Dec 16 '24 13:12 Andrei-Aksionov

I do plan to but I think even if the entire batch was this big it should still not OOM as Pythia had the same seq length and a GPU with half the size but still worked with larger batch sizes

Dec 16 '24 13:12 aflah02

I looked at numbers from the Pythia paper and while training the 1B model they were able to use a batch size of 16 for a 40 GB A100 but I can't use that for OLMo 1B despite having a 2x larger GPU

To better isolate the problem, could you try to repeat Pythia 1B with 40 batch size.

Dec 16 '24 14:12 Andrei-Aksionov

Thanks I'll do that

Also is there a simple way to use the profiler when pretraining? or do I need to modify pretrain.py and add the profiler in manually?

Dec 16 '24 14:12 aflah02