Pretraining an OLMo model on the SlimPajama dataset
Hi!
I am planning to test pretraining OLMo 1B model on the slim pajama dataset. I was trying to follow the tutorial for tinyllama but one of the steps for preparing the dataset uses the litgpt/data/prepare_slimpajama.py file which seems to be missing to me in the repo. Any workarounds for this?
CC: @rasbt @Andrei-Aksionov Just bumping this on your radar as this is a continuation to the OLMo PR
Hello @aflah02 Good catch! It looks like this file was accidentally deleted in one of the recent PRs: https://github.com/Lightning-AI/litgpt/pull/1821/files#diff-2646bbbf72cb6e84cfc29a226b4446985b6904dc04b6228ef8a69d9fcb4a2951
Could you bring it back in a PR?
Sure, I'll do that
I tried using the code to process the dataset however it doesn't seem to work for the train set due to size issues. Is there a way to reduce how many things are moved to/kept in the tmp dir?
Error -
OSError: [Errno 28] No space left on device: '/NS/llm-1/static00/data/slimpajama-raw/train/chunk2/example_train_4825.jsonl.zst' -> '/tmp/data/chunk2/example_train_4825.jsonl.zst'
OSError: [Errno 28] No space left on device: '/NS/llm-1/static00/data/slimpajama-raw/train/chunk5/example_train_2785.jsonl.zst' -> '/tmp/data/chunk5/example_train_2785.jsonl.zst'
File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litdata/processing/data_processor.py", line 167, in _download_data_target
shutil.copyfile(path, local_path)
File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/shutil.py", line 269, in copyfile
_fastcopy_sendfile(fsrc, fdst)
File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/shutil.py", line 158, in _fastcopy_sendfile
raise err from None
File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/shutil.py", line 144, in _fastcopy_sendfile
sent = os.sendfile(outfd, infd, offset, blocksize)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 28] No space left on device: '/NS/llm-1/static00/data/slimpajama-raw/train/chunk4/example_train_401.jsonl.zst' -> '/tmp/data/chunk4/example_train_401.jsonl.zst'
Progress: 2%|██▎ | 1157/59166 [24:37<20:34:42, 1.28s/it]
``
A simple fix that I'm using is to create a symlink with my NFS where I have more storage with /tmp/data and then running it. It seems to run for now (still in progress)
Hey @Andrei-Aksionov @rasbt
I was trying to set up a multinode run via SLURM and was testing this on 2 nodes with ethernet based interconnect however the init fails -
/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python3.11 /NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_li ...
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/16
[W1204 13:17:19.038836275 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:33641 (errno: 97 - Address family not supported by protocol).
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/16
[W1204 13:18:00.218036449 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:33641 (errno: 97 - Address family not supported by protocol).
[rank: 6] Seed set to 42
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/16
[W1204 13:18:00.580590861 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:33641 (errno: 97 - Address family not supported by protocol).
[rank: 1] Seed set to 42
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/16
[W1204 13:18:00.633244692 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:33641 (errno: 97 - Address family not supported by protocol).
[rank: 7] Seed set to 42
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/16
[W1204 13:18:01.815471680 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:33641 (errno: 97 - Address family not supported by protocol).
[rank: 3] Seed set to 42
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/16
[W1204 13:18:01.876939030 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:33641 (errno: 97 - Address family not supported by protocol).
[rank: 4] Seed set to 42
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/16
[W1204 13:18:01.918607039 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:33641 (errno: 97 - Address family not supported by protocol).
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/16
[rank: 5] Seed set to 42
[W1204 13:18:01.934432434 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [::ffff:127.0.0.1]:33641 (errno: 97 - Address family not supported by protocol).
[rank: 2] Seed set to 42
Traceback (most recent call last):
File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/bin/litgpt", line 8, in <module>
sys.exit(main())
^^^^^^
File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/__main__.py", line 71, in main
CLI(parser_data)
File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 119, in CLI
return _run_component(component, init.get(subcommand))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 204, in _run_component
return component(**cfg)
^^^^^^^^^^^^^^^^
File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 149, in setup
fabric.launch()
File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 843, in launch
return self._wrap_and_launch(function, self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 928, in _wrap_and_launch
return launcher.launch(to_run, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/strategies/launchers/subprocess_script.py", line 107, in launch
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 932, in _wrap_with_setup
self._strategy.setup_environment()
File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/strategies/fsdp.py", line 260, in setup_environment
self._setup_distributed()
File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/strategies/fsdp.py", line 671, in _setup_distributed
_init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)
File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 297, in _init_dist_connection
torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
func_return = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1361, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 258, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 185, in _create_c10d_store
return TCPStore(
^^^^^^^^^
torch.distributed.DistStoreError: Timed out after 1801 seconds waiting for clients. 8/16 clients joined.
[rank4]: Traceback (most recent call last):
[rank4]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/bin/litgpt", line 8, in <module>
[rank4]: sys.exit(main())
[rank4]: ^^^^^^
[rank4]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/__main__.py", line 71, in main
[rank4]: CLI(parser_data)
[rank4]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 119, in CLI
[rank4]: return _run_component(component, init.get(subcommand))
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 204, in _run_component
[rank4]: return component(**cfg)
[rank4]: ^^^^^^^^^^^^^^^^
[rank4]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 155, in setup
[rank4]: main(
[rank4]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 215, in main
[rank4]: train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 422, in get_dataloaders
[rank4]: with fabric.rank_zero_first():
[rank4]: File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/contextlib.py", line 137, in __enter__
[rank4]: return next(self.gen)
[rank4]: ^^^^^^^^^^^^^^
[rank4]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 635, in rank_zero_first
[rank4]: with _InfiniteBarrier() as barrier:
[rank4]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 425, in __enter__
[rank4]: self.group = torch.distributed.new_group(backend="gloo", timeout=timedelta(days=10000))
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
[rank4]: func_return = func(*args, **kwargs)
[rank4]: ^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4125, in new_group
[rank4]: return _new_group_with_tag(
[rank4]: ^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4205, in _new_group_with_tag
[rank4]: pg, pg_store = _new_process_group_helper(
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1569, in _new_process_group_helper
[rank4]: backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: torch.distributed.DistNetworkError: Connection reset by peer
[rank6]: Traceback (most recent call last):
[rank6]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/bin/litgpt", line 8, in <module>
[rank6]: sys.exit(main())
[rank6]: ^^^^^^
[rank6]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/__main__.py", line 71, in main
[rank6]: CLI(parser_data)
[rank6]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 119, in CLI
[rank6]: return _run_component(component, init.get(subcommand))
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 204, in _run_component
[rank6]: return component(**cfg)
[rank6]: ^^^^^^^^^^^^^^^^
[rank6]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 155, in setup
[rank6]: main(
[rank6]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 215, in main
[rank6]: train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 422, in get_dataloaders
[rank6]: with fabric.rank_zero_first():
[rank6]: File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/contextlib.py", line 137, in __enter__
[rank6]: return next(self.gen)
[rank6]: ^^^^^^^^^^^^^^
[rank6]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 635, in rank_zero_first
[rank6]: with _InfiniteBarrier() as barrier:
[rank6]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 425, in __enter__
[rank6]: self.group = torch.distributed.new_group(backend="gloo", timeout=timedelta(days=10000))
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
[rank6]: func_return = func(*args, **kwargs)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4125, in new_group
[rank6]: return _new_group_with_tag(
[rank6]: ^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4205, in _new_group_with_tag
[rank6]: pg, pg_store = _new_process_group_helper(
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1569, in _new_process_group_helper
[rank6]: backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: torch.distributed.DistNetworkError: Connection reset by peer
[rank2]: Traceback (most recent call last):
[rank2]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/bin/litgpt", line 8, in <module>
[rank2]: sys.exit(main())
[rank2]: ^^^^^^
[rank2]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/__main__.py", line 71, in main
[rank2]: CLI(parser_data)
[rank2]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 119, in CLI
[rank2]: return _run_component(component, init.get(subcommand))
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 204, in _run_component
[rank2]: return component(**cfg)
[rank2]: ^^^^^^^^^^^^^^^^
[rank2]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 155, in setup
[rank2]: main(
[rank2]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 215, in main
[rank2]: train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 422, in get_dataloaders
[rank2]: with fabric.rank_zero_first():
[rank2]: File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/contextlib.py", line 137, in __enter__
[rank2]: return next(self.gen)
[rank2]: ^^^^^^^^^^^^^^
[rank2]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 635, in rank_zero_first
[rank2]: with _InfiniteBarrier() as barrier:
[rank2]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 425, in __enter__
[rank2]: self.group = torch.distributed.new_group(backend="gloo", timeout=timedelta(days=10000))
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
[rank2]: func_return = func(*args, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4125, in new_group
[rank2]: return _new_group_with_tag(
[rank2]: ^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4205, in _new_group_with_tag
[rank2]: pg, pg_store = _new_process_group_helper(
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1569, in _new_process_group_helper
[rank2]: backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: torch.distributed.DistNetworkError: Connection reset by peer
[rank7]: Traceback (most recent call last):
[rank7]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/bin/litgpt", line 8, in <module>
[rank7]: sys.exit(main())
[rank7]: ^^^^^^
[rank7]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/__main__.py", line 71, in main
[rank7]: CLI(parser_data)
[rank7]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 119, in CLI
[rank7]: return _run_component(component, init.get(subcommand))
[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 204, in _run_component
[rank7]: return component(**cfg)
[rank7]: ^^^^^^^^^^^^^^^^
[rank7]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 155, in setup
[rank7]: main(
[rank7]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 215, in main
[rank7]: train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 422, in get_dataloaders
[rank7]: with fabric.rank_zero_first():
[rank7]: File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/contextlib.py", line 137, in __enter__
[rank7]: return next(self.gen)
[rank7]: ^^^^^^^^^^^^^^
[rank7]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 635, in rank_zero_first
[rank7]: with _InfiniteBarrier() as barrier:
[rank7]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 425, in __enter__
[rank7]: self.group = torch.distributed.new_group(backend="gloo", timeout=timedelta(days=10000))
[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
[rank7]: func_return = func(*args, **kwargs)
[rank7]: ^^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4125, in new_group
[rank7]: return _new_group_with_tag(
[rank7]: ^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4205, in _new_group_with_tag
[rank7]: pg, pg_store = _new_process_group_helper(
[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1569, in _new_process_group_helper
[rank7]: backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: torch.distributed.DistNetworkError: Connection reset by peer
[rank1]: Traceback (most recent call last):
[rank1]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/bin/litgpt", line 8, in <module>
[rank1]: sys.exit(main())
[rank1]: ^^^^^^
[rank1]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/__main__.py", line 71, in main
[rank1]: CLI(parser_data)
[rank1]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 119, in CLI
[rank1]: return _run_component(component, init.get(subcommand))
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 204, in _run_component
[rank1]: return component(**cfg)
[rank1]: ^^^^^^^^^^^^^^^^
[rank1]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 155, in setup
[rank1]: main(
[rank1]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 215, in main
[rank1]: train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 422, in get_dataloaders
[rank1]: with fabric.rank_zero_first():
[rank1]: File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/contextlib.py", line 137, in __enter__
[rank1]: return next(self.gen)
[rank1]: ^^^^^^^^^^^^^^
[rank1]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 635, in rank_zero_first
[rank1]: with _InfiniteBarrier() as barrier:
[rank1]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 425, in __enter__
[rank1]: self.group = torch.distributed.new_group(backend="gloo", timeout=timedelta(days=10000))
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
[rank1]: func_return = func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4125, in new_group
[rank1]: return _new_group_with_tag(
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4205, in _new_group_with_tag
[rank1]: pg, pg_store = _new_process_group_helper(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1569, in _new_process_group_helper
[rank1]: backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: torch.distributed.DistNetworkError: Connection reset by peer
[rank5]: Traceback (most recent call last):
[rank5]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/bin/litgpt", line 8, in <module>
[rank5]: sys.exit(main())
[rank5]: ^^^^^^
[rank5]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/__main__.py", line 71, in main
[rank5]: CLI(parser_data)
[rank5]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 119, in CLI
[rank5]: return _run_component(component, init.get(subcommand))
[rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 204, in _run_component
[rank5]: return component(**cfg)
[rank5]: ^^^^^^^^^^^^^^^^
[rank5]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 155, in setup
[rank5]: main(
[rank5]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 215, in main
[rank5]: train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
[rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 422, in get_dataloaders
[rank5]: with fabric.rank_zero_first():
[rank5]: File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/contextlib.py", line 137, in __enter__
[rank5]: return next(self.gen)
[rank5]: ^^^^^^^^^^^^^^
[rank5]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 635, in rank_zero_first
[rank5]: with _InfiniteBarrier() as barrier:
[rank5]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 425, in __enter__
[rank5]: self.group = torch.distributed.new_group(backend="gloo", timeout=timedelta(days=10000))
[rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
[rank5]: func_return = func(*args, **kwargs)
[rank5]: ^^^^^^^^^^^^^^^^^^^^^
[rank5]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4125, in new_group
[rank5]: return _new_group_with_tag(
[rank5]: ^^^^^^^^^^^^^^^^^^^^
[rank5]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4205, in _new_group_with_tag
[rank5]: pg, pg_store = _new_process_group_helper(
[rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1569, in _new_process_group_helper
[rank5]: backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]: torch.distributed.DistNetworkError: Connection reset by peer
[rank3]: Traceback (most recent call last):
[rank3]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/bin/litgpt", line 8, in <module>
[rank3]: sys.exit(main())
[rank3]: ^^^^^^
[rank3]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/__main__.py", line 71, in main
[rank3]: CLI(parser_data)
[rank3]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 119, in CLI
[rank3]: return _run_component(component, init.get(subcommand))
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/jsonargparse/_cli.py", line 204, in _run_component
[rank3]: return component(**cfg)
[rank3]: ^^^^^^^^^^^^^^^^
[rank3]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 155, in setup
[rank3]: main(
[rank3]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 215, in main
[rank3]: train_dataloader, val_dataloader = get_dataloaders(fabric, data, tokenizer, train, model.max_seq_length)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/litgpt/pretrain.py", line 422, in get_dataloaders
[rank3]: with fabric.rank_zero_first():
[rank3]: File "/NS/llm-1/nobackup/afkhan/anaconda3/lib/python3.11/contextlib.py", line 137, in __enter__
[rank3]: return next(self.gen)
[rank3]: ^^^^^^^^^^^^^^
[rank3]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/fabric.py", line 635, in rank_zero_first
[rank3]: with _InfiniteBarrier() as barrier:
[rank3]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/lightning/fabric/utilities/distributed.py", line 425, in __enter__
[rank3]: self.group = torch.distributed.new_group(backend="gloo", timeout=timedelta(days=10000))
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
[rank3]: func_return = func(*args, **kwargs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4125, in new_group
[rank3]: return _new_group_with_tag(
[rank3]: ^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 4205, in _new_group_with_tag
[rank3]: pg, pg_store = _new_process_group_helper(
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/NS/llm-1/work/afkhan/all_venvs/olmo_pretrain_litgpt_venv/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1569, in _new_process_group_helper
[rank3]: backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: torch.distributed.DistNetworkError: Connection reset by peer
I also see this warning -
Warning: Not all GPUs are fully connected via NVLink. Some GPUs are connected via slower interfaces. It is recommended to switch to a different machine with faster GPU connections for optimal multi-GPU training performance.
Here's the config -
# The name of the model to pretrain. Choose from names in ``litgpt.config``. Mutually exclusive with
# ``model_config``. (type: Optional[str], default: null)
model_name: allenai/OLMo-1B-hf
# A ``litgpt.Config`` object to define the model architecture. Mutually exclusive with
# ``model_config``. (type: Optional[Config], default: null)
model_config:
# Directory in which to save checkpoints and logs. If running in a Lightning Studio Job, look for it in
# /teamspace/jobs/<job-name>/share. (type: <class 'Path'>, default: out/pretrain)
out_dir: out/pretrain/slim-olmo-2x8xH100-GBS-192
# The precision to use for pretraining. Possible choices: "bf16-true", "bf16-mixed", "32-true". (type: Optional[str], default: null)
precision: bf16-mixed
# Optional path to a checkpoint directory to initialize the model from.
# Useful for continued pretraining. Mutually exclusive with ``resume``. (type: Optional[Path], default: null)
initial_checkpoint_dir:
# Path to a checkpoint directory to resume from in case training was interrupted, or ``True`` to resume
# from the latest checkpoint in ``out_dir``. An error will be raised if no checkpoint is found. Passing
# ``'auto'`` will resume from the latest checkpoint but not error if no checkpoint exists.
# (type: Union[bool, Literal["auto"], Path], default: False)
resume: false
# Data-related arguments. If not provided, the default is ``litgpt.data.TinyLlama``.
data: MicroLlama
# Path - /NS/llm-1/static00/data/slimpajama
# Training-related arguments. See ``litgpt.args.TrainArgs`` for details
train:
# Number of optimizer steps between saving checkpoints (type: Optional[int], default: 1000)
save_interval: 100000
# Number of iterations between logging calls (type: int, default: 1)
log_interval: 1
# Number of samples between optimizer steps across data-parallel ranks (type: int, default: 48)
# Scale this number according to the number of GPU and memory size per GPU
# For example, we used 48 for 4 x 24G 4090
global_batch_size: 192
# Number of samples per data-parallel rank (type: int, default: 12)
# Scale this number according to the memory size per GPU
# For example, we used 12 for 24G 4090
micro_batch_size: 12
# Number of iterations with learning rate warmup active (type: int, default: 2000)
lr_warmup_steps: 2000
# Number of epochs to train on (type: Optional[int], default: null)
epochs:
# Total number of tokens to train on (type: Optional[int], default: 3000000000000)
max_tokens: 3000000000000
# Limits the number of optimizer steps to run. (type: Optional[int], default: null)
max_steps:
# Limits the length of samples. Off by default (type: Optional[int], default: null)
max_seq_length: 2048
# Whether to tie the embedding weights with the language modeling head weights. (type: Optional[bool], default: False)
tie_embeddings:
# (type: Optional[float], default: 1.0)
max_norm: 1.0
# (type: float, default: 4e-05)
min_lr: 4.0e-05
# Evaluation-related arguments. See ``litgpt.args.EvalArgs`` for details
eval:
# Number of optimizer steps between evaluation calls (type: int, default: 1000)
interval: 1000
# Number of tokens to generate (type: Optional[int], default: null)
max_new_tokens:
# Number of iterations (type: int, default: 100)
max_iters: 100
# Whether to evaluate on the validation set at the beginning of the training
initial_validation: false
# Optimizer-related arguments
optimizer:
class_path: torch.optim.AdamW
init_args:
# (type: float, default: 0.001)
lr: 4e-4
# (type: float, default: 0.01)
weight_decay: 0.1
# (type: tuple, default: (0.9,0.999))
betas:
- 0.9
- 0.95
# How many devices/GPUs to use. Uses all GPUs by default. (type: Union[int, str], default: auto)
devices: auto
# How many nodes to use. (type: int, default: 1)
num_nodes: 2
# Optional path to the tokenizer dir that was used for preprocessing the dataset. Only some data
# module require this. (type: Optional[Path], default: null)
tokenizer_dir: checkpoints/allenai/OLMo-1B-hf
# The name of the logger to send metrics to. (type: Literal['wandb', 'tensorboard', 'csv'], default: tensorboard)
logger_name: wandb
# The random seed to use for reproducibility. (type: int, default: 42)
seed: 42
This is my run command -
sbatch --partition=a100 --nodes=2 --gres=gpu:8 --cpus-per-task=32 --mem=244G --exclude=sws-3a100grid-01 --time=8-00:00 --output=/NS/llm-1/work/afkhan/USC_Collab/litgpt/SLURM_Runs/logs/litgpt-olmo-pretrain-slimpajama-2x8xH100-GBS-192.out --error=/NS/llm-1/work/afkhan/USC_Collab/litgpt/SLURM_Runs/logs/litgpt-olmo-pretrain-slimpajama-2x8xH100-GBS-192.err --job-name=litgpt-olmo-pretrain-slimpajama-2x8xH100-GBS-192 --wrap "litgpt pretrain --config config_hub/pretrain/slimolmo.yaml --data.data_path /NS/llm-1/static00/data/"
The code works when running on 1 node
Any clue what might be going wrong? I am using SLURM btw
nvidia-smi on the nodes (before timeout based crash) -
So one of the nodes doesn't really load anything
I just realized the error message and this tutorial (https://lightning.ai/docs/fabric/stable/guide/multi_node/slurm.html) seems to imply I should use srun. Running with this now -
sbatch --partition=a100 --nodes=2 --gres=gpu:8 --cpus-per-task=32 --mem=244G --exclude=sws-3a100grid-01 --time=8-00:00 --output=/NS/llm-1/work/afkhan/USC_Collab/litgpt/SLURM_Runs/logs/litgpt-olmo-pretrain-slimpajama-2x8xA100-GBS-192.out --error=/NS/llm-1/work/afkhan/USC_Collab/litgpt/SLURM_Runs/logs/litgpt-olmo-pretrain-slimpajama-2x8xA100-GBS-192.err --job-name=litgpt-olmo-pretrain-slimpajama-2x8xA100-GBS-192 --wrap "srun litgpt pretrain --config config_hub/pretrain/slimolmo.yaml --data.data_path /NS/llm-1/static00/data/"
This command works -
sbatch --partition=a100 --nodes=2 --gres=gpu:8 --ntasks-per-node=8 --mem=244G --exclude=sws-3a100grid-01 --time=8-00:00 --output=/NS/llm-1/work/afkhan/USC_Collab/litgpt/SLURM_Runs/logs/litgpt-olmo-pretrain-slimpajama-2x8xA100-GBS-192.out --error=/NS/llm-1/work/afkhan/USC_Collab/litgpt/SLURM_Runs/logs/litgpt-olmo-pretrain-slimpajama-2x8xA100-GBS-192.err --job-name=litgpt-olmo-pretrain-slimpajama-2x8xA100-GBS-192 --wrap "srun litgpt pretrain --config config_hub/pretrain/slimolmo.yaml --data.data_path /NS/llm-1/static00/data/"
But when I look at wandb I only see logs for one node (even though the loss is aggregated prior to backprop I don't see any device stats for the other node)
Hi @Andrei-Aksionov @rasbt I was trying to figure out the best batch size for pretraining OLMo 1B on A100 machines. I tried a lot of different batch sizes but everything OOMs except for batch size 12 which is quite surprising as that is the recommended batch size for Tiny Llama on a 24 GB 4090 while I have an 80GB A100 machine that I'm testing on. Any ideas what could be going wrong?
Here is my config -
# The name of the model to pretrain. Choose from names in ``litgpt.config``. Mutually exclusive with
# ``model_config``. (type: Optional[str], default: null)
model_name: allenai/OLMo-1B-hf
# A ``litgpt.Config`` object to define the model architecture. Mutually exclusive with
# ``model_config``. (type: Optional[Config], default: null)
model_config:
# Directory in which to save checkpoints and logs. If running in a Lightning Studio Job, look for it in
# /teamspace/jobs/<job-name>/share. (type: <class 'Path'>, default: out/pretrain)
out_dir: out/pretrain/slim-olmo-1x1xA100-GBS-24
# The precision to use for pretraining. Possible choices: "bf16-true", "bf16-mixed", "32-true". (type: Optional[str], default: null)
precision: bf16-mixed
# Optional path to a checkpoint directory to initialize the model from.
# Useful for continued pretraining. Mutually exclusive with ``resume``. (type: Optional[Path], default: null)
initial_checkpoint_dir:
# Path to a checkpoint directory to resume from in case training was interrupted, or ``True`` to resume
# from the latest checkpoint in ``out_dir``. An error will be raised if no checkpoint is found. Passing
# ``'auto'`` will resume from the latest checkpoint but not error if no checkpoint exists.
# (type: Union[bool, Literal["auto"], Path], default: False)
resume: false
# Data-related arguments. If not provided, the default is ``litgpt.data.TinyLlama``.
data: MicroLlama
# Path - /NS/llm-1/static00/data/slimpajama
# Training-related arguments. See ``litgpt.args.TrainArgs`` for details
train:
# Number of optimizer steps between saving checkpoints (type: Optional[int], default: 1000)
save_interval: 100000
# Number of iterations between logging calls (type: int, default: 1)
log_interval: 1
# Number of samples between optimizer steps across data-parallel ranks (type: int, default: 48)
# Scale this number according to the number of GPU and memory size per GPU
# For example, we used 48 for 4 x 24G 4090
global_batch_size: 24
# Number of samples per data-parallel rank (type: int, default: 12)
# Scale this number according to the memory size per GPU
# For example, we used 12 for 24G 4090
micro_batch_size: 24
# Number of iterations with learning rate warmup active (type: int, default: 2000)
lr_warmup_steps: 2000
# Number of epochs to train on (type: Optional[int], default: null)
epochs:
# Total number of tokens to train on (type: Optional[int], default: 3000000000000)
max_tokens: 3000000000000
# Limits the number of optimizer steps to run. (type: Optional[int], default: null)
max_steps:
# Limits the length of samples. Off by default (type: Optional[int], default: null)
max_seq_length: 2048
# Whether to tie the embedding weights with the language modeling head weights. (type: Optional[bool], default: False)
tie_embeddings:
# (type: Optional[float], default: 1.0)
max_norm: 1.0
# (type: float, default: 4e-05)
min_lr: 4.0e-05
# Evaluation-related arguments. See ``litgpt.args.EvalArgs`` for details
eval:
# Number of optimizer steps between evaluation calls (type: int, default: 1000)
interval: 1000
# Number of tokens to generate (type: Optional[int], default: null)
max_new_tokens:
# Number of iterations (type: int, default: 100)
max_iters: 100
# Whether to evaluate on the validation set at the beginning of the training
initial_validation: false
# Optimizer-related arguments
optimizer:
class_path: torch.optim.AdamW
init_args:
# (type: float, default: 0.001)
lr: 4e-4
# (type: float, default: 0.01)
weight_decay: 0.1
# (type: tuple, default: (0.9,0.999))
betas:
- 0.9
- 0.95
# How many devices/GPUs to use. Uses all GPUs by default. (type: Union[int, str], default: auto)
devices: auto
# How many nodes to use. (type: int, default: 1)
num_nodes: 1
# Optional path to the tokenizer dir that was used for preprocessing the dataset. Only some data
# module require this. (type: Optional[Path], default: null)
tokenizer_dir: checkpoints/allenai/OLMo-1B-hf
# The name of the logger to send metrics to. (type: Literal['wandb', 'tensorboard', 'csv'], default: tensorboard)
logger_name: wandb
# The random seed to use for reproducibility. (type: int, default: 42)
seed: 42
I tried on a single GPU as well as 8xA100 machines and I get the same OOMs
I looked at numbers from the Pythia paper and while training the 1B model they were able to use a batch size of 16 for a 40 GB A100 but I can't use that for OLMo 1B despite having a 2x larger GPU
Here's the WANDB GPU Usage Chart for Batch Size 16 -
Just a guess. Try to do a memory profiling. I can image that you will find a spike in memory consumption, caused by one of the examples.
# Limits the length of samples. Off by default (type: Optional[int], default: null)
max_seq_length: 2048
It might be there are only a couple of samples in the training set that have such a length. And because of them and the spike that they cause, you cannot enlarge the batch size.
But it's only a guess :)
I do plan to but I think even if the entire batch was this big it should still not OOM as Pythia had the same seq length and a GPU with half the size but still worked with larger batch sizes
I looked at numbers from the Pythia paper and while training the 1B model they were able to use a batch size of 16 for a 40 GB A100 but I can't use that for OLMo 1B despite having a 2x larger GPU
To better isolate the problem, could you try to repeat Pythia 1B with 40 batch size.
Thanks I'll do that
Also is there a simple way to use the profiler when pretraining? or do I need to modify pretrain.py and add the profiler in manually?