axolotl
axolotl copied to clipboard
Training fails with an error `WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202100 closing signal SIGTERM`
Please check that this issue hasn't been reported before.
- [X] I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
I firstly ran python -m axolotl.cli.preprocess examples/llama-2/ver2.0.yml because I have a lot of data(total_num_tokens: 10394324568)
It did ran successfully and data saved in last_run_prepared folder.
after that, I ran accelerate launch -m axolotl.cli.train examples/llama-2/ver2.0.yml to train.
Current behaviour
But, the train hangs from here about 10min,
(axo) root@notebook-deployment-25-5b4fb57786-p6qqr:~/fileviewer/LLM/axolotl# accelerate launch -m axolotl.cli.train examples/llama-2/ver2.0.yml
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_processes` was set to a value of `8`
More than one GPU was found, enabling multi-GPU training.
If this was unintended please pass in `--num_processes=1`.
`--num_machines` was set to a value of `1`
`--mixed_precision` was set to a value of `'no'`
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/home/jovyan/.local/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
[2024-02-07 10:17:21,364] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/jovyan/.local/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/home/jovyan/.local/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/home/jovyan/.local/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/home/jovyan/.local/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/home/jovyan/.local/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/home/jovyan/.local/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/home/jovyan/.local/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
[2024-02-07 10:17:23,822] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-07 10:17:23,831] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-07 10:17:23,836] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-07 10:17:23,840] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-07 10:17:24,091] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-07 10:17:24,095] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-07 10:17:24,149] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-02-07 10:17:28,326] [INFO] [axolotl.normalize_config:150] [PID:202106] [RANK:6] GPU memory usage baseline: 0.000GB (+1.723GB misc)
[2024-02-07 10:17:29,878] [INFO] [axolotl.normalize_config:150] [PID:202100] [RANK:0] GPU memory usage baseline: 0.000GB (+3.121GB misc)
[2024-02-07 10:17:30,110] [INFO] [axolotl.normalize_config:150] [PID:202105] [RANK:5] GPU memory usage baseline: 0.000GB (+1.723GB misc)
[2024-02-07 10:17:30,226] [INFO] [axolotl.normalize_config:150] [PID:202107] [RANK:7] GPU memory usage baseline: 0.000GB (+1.863GB misc)
[2024-02-07 10:17:30,260] [INFO] [axolotl.normalize_config:150] [PID:202104] [RANK:4] GPU memory usage baseline: 0.000GB (+1.723GB misc)
[2024-02-07 10:17:30,635] [INFO] [axolotl.normalize_config:150] [PID:202103] [RANK:3] GPU memory usage baseline: 0.000GB (+1.723GB misc)
[2024-02-07 10:17:30,645] [INFO] [axolotl.normalize_config:150] [PID:202101] [RANK:1] GPU memory usage baseline: 0.000GB (+2.285GB misc)
[2024-02-07 10:17:30,694] [INFO] [axolotl.normalize_config:150] [PID:202102] [RANK:2] GPU memory usage baseline: 0.000GB (+1.723GB misc)
dP dP dP
88 88 88
.d8888b. dP. .dP .d8888b. 88 .d8888b. d8888P 88
88' `88 `8bd8' 88' `88 88 88' `88 88 88
88. .88 .d88b. 88. .88 88 88. .88 88 88
`88888P8 dP' `dP `88888P' dP `88888P' dP dP
[2024-02-07 10:17:32,656] [DEBUG] [axolotl.load_tokenizer:210] [PID:202100] [RANK:0] EOS: 57290 / </eot>
[2024-02-07 10:17:32,656] [DEBUG] [axolotl.load_tokenizer:211] [PID:202100] [RANK:0] BOS: 1 / <s>
[2024-02-07 10:17:32,656] [DEBUG] [axolotl.load_tokenizer:212] [PID:202100] [RANK:0] PAD: 2 / </s>
[2024-02-07 10:17:32,656] [DEBUG] [axolotl.load_tokenizer:213] [PID:202100] [RANK:0] UNK: 0 / <unk>
[2024-02-07 10:17:32,656] [INFO] [axolotl.load_tokenizer:218] [PID:202100] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2024-02-07 10:17:32,670] [DEBUG] [axolotl.load_tokenizer:210] [PID:202106] [RANK:6] EOS: 57290 / </eot>
[2024-02-07 10:17:32,670] [DEBUG] [axolotl.load_tokenizer:211] [PID:202106] [RANK:6] BOS: 1 / <s>
[2024-02-07 10:17:32,670] [DEBUG] [axolotl.load_tokenizer:212] [PID:202106] [RANK:6] PAD: 2 / </s>
[2024-02-07 10:17:32,670] [DEBUG] [axolotl.load_tokenizer:213] [PID:202106] [RANK:6] UNK: 0 / <unk>
[2024-02-07 10:17:32,670] [INFO] [axolotl.load_tokenizer:218] [PID:202106] [RANK:6] No Chat template selected. Consider adding a chat template for easier inference.
[2024-02-07 10:17:32,676] [DEBUG] [axolotl.load_tokenizer:210] [PID:202101] [RANK:1] EOS: 57290 / </eot>
[2024-02-07 10:17:32,676] [DEBUG] [axolotl.load_tokenizer:211] [PID:202101] [RANK:1] BOS: 1 / <s>
[2024-02-07 10:17:32,676] [DEBUG] [axolotl.load_tokenizer:212] [PID:202101] [RANK:1] PAD: 2 / </s>
[2024-02-07 10:17:32,676] [DEBUG] [axolotl.load_tokenizer:213] [PID:202101] [RANK:1] UNK: 0 / <unk>
[2024-02-07 10:17:32,676] [INFO] [axolotl.load_tokenizer:218] [PID:202101] [RANK:1] No Chat template selected. Consider adding a chat template for easier inference.
[2024-02-07 10:17:32,679] [DEBUG] [axolotl.load_tokenizer:210] [PID:202107] [RANK:7] EOS: 57290 / </eot>
[2024-02-07 10:17:32,679] [DEBUG] [axolotl.load_tokenizer:211] [PID:202107] [RANK:7] BOS: 1 / <s>
[2024-02-07 10:17:32,679] [DEBUG] [axolotl.load_tokenizer:212] [PID:202107] [RANK:7] PAD: 2 / </s>
[2024-02-07 10:17:32,679] [DEBUG] [axolotl.load_tokenizer:213] [PID:202107] [RANK:7] UNK: 0 / <unk>
[2024-02-07 10:17:32,679] [INFO] [axolotl.load_tokenizer:218] [PID:202107] [RANK:7] No Chat template selected. Consider adding a chat template for easier inference.
[2024-02-07 10:17:32,682] [DEBUG] [axolotl.load_tokenizer:210] [PID:202105] [RANK:5] EOS: 57290 / </eot>
[2024-02-07 10:17:32,682] [DEBUG] [axolotl.load_tokenizer:211] [PID:202105] [RANK:5] BOS: 1 / <s>
[2024-02-07 10:17:32,682] [DEBUG] [axolotl.load_tokenizer:212] [PID:202105] [RANK:5] PAD: 2 / </s>
[2024-02-07 10:17:32,682] [DEBUG] [axolotl.load_tokenizer:213] [PID:202105] [RANK:5] UNK: 0 / <unk>
[2024-02-07 10:17:32,682] [INFO] [axolotl.load_tokenizer:218] [PID:202105] [RANK:5] No Chat template selected. Consider adding a chat template for easier inference.
[2024-02-07 10:17:32,689] [DEBUG] [axolotl.load_tokenizer:210] [PID:202104] [RANK:4] EOS: 57290 / </eot>
[2024-02-07 10:17:32,689] [DEBUG] [axolotl.load_tokenizer:211] [PID:202104] [RANK:4] BOS: 1 / <s>
[2024-02-07 10:17:32,689] [DEBUG] [axolotl.load_tokenizer:212] [PID:202104] [RANK:4] PAD: 2 / </s>
[2024-02-07 10:17:32,689] [DEBUG] [axolotl.load_tokenizer:213] [PID:202104] [RANK:4] UNK: 0 / <unk>
[2024-02-07 10:17:32,689] [INFO] [axolotl.load_tokenizer:218] [PID:202104] [RANK:4] No Chat template selected. Consider adding a chat template for easier inference.
[2024-02-07 10:17:32,692] [DEBUG] [axolotl.load_tokenizer:210] [PID:202102] [RANK:2] EOS: 57290 / </eot>
[2024-02-07 10:17:32,692] [DEBUG] [axolotl.load_tokenizer:211] [PID:202102] [RANK:2] BOS: 1 / <s>
[2024-02-07 10:17:32,692] [DEBUG] [axolotl.load_tokenizer:212] [PID:202102] [RANK:2] PAD: 2 / </s>
[2024-02-07 10:17:32,692] [DEBUG] [axolotl.load_tokenizer:213] [PID:202102] [RANK:2] UNK: 0 / <unk>
[2024-02-07 10:17:32,692] [INFO] [axolotl.load_tokenizer:218] [PID:202102] [RANK:2] No Chat template selected. Consider adding a chat template for easier inference.
[2024-02-07 10:17:32,747] [INFO] [axolotl.load_tokenized_prepared_datasets:156] [PID:202100] [RANK:0] Loading prepared dataset from disk at data/last_run_prepared/ver1.5/5a0d42bdf37a5f63628d102b050b290a...
[2024-02-07 10:17:33,049] [DEBUG] [axolotl.load_tokenizer:210] [PID:202103] [RANK:3] EOS: 57290 / </eot>
[2024-02-07 10:17:33,049] [DEBUG] [axolotl.load_tokenizer:211] [PID:202103] [RANK:3] BOS: 1 / <s>
[2024-02-07 10:17:33,049] [DEBUG] [axolotl.load_tokenizer:212] [PID:202103] [RANK:3] PAD: 2 / </s>
[2024-02-07 10:17:33,049] [DEBUG] [axolotl.load_tokenizer:213] [PID:202103] [RANK:3] UNK: 0 / <unk>
[2024-02-07 10:17:33,049] [INFO] [axolotl.load_tokenizer:218] [PID:202103] [RANK:3] No Chat template selected. Consider adding a chat template for easier inference.
[2024-02-07 10:17:36,276] [INFO] [axolotl.load_tokenized_prepared_datasets:158] [PID:202100] [RANK:0] Prepared dataset loaded from disk...
[2024-02-07 10:18:08,342] [INFO] [axolotl.load_tokenized_prepared_datasets:156] [PID:202101] [RANK:1] Loading prepared dataset from disk at data/last_run_prepared/ver1.5/5a0d42bdf37a5f63628d102b050b290a...
[2024-02-07 10:18:08,342] [INFO] [axolotl.load_tokenized_prepared_datasets:156] [PID:202102] [RANK:2] Loading prepared dataset from disk at data/last_run_prepared/ver1.5/5a0d42bdf37a5f63628d102b050b290a...
[2024-02-07 10:18:08,343] [INFO] [axolotl.load_tokenized_prepared_datasets:156] [PID:202103] [RANK:3] Loading prepared dataset from disk at data/last_run_prepared/ver1.5/5a0d42bdf37a5f63628d102b050b290a...
[2024-02-07 10:18:08,343] [INFO] [axolotl.load_tokenized_prepared_datasets:156] [PID:202106] [RANK:6] Loading prepared dataset from disk at data/last_run_prepared/ver1.5/5a0d42bdf37a5f63628d102b050b290a...
[2024-02-07 10:18:08,343] [INFO] [axolotl.load_tokenized_prepared_datasets:156] [PID:202105] [RANK:5] Loading prepared dataset from disk at data/last_run_prepared/ver1.5/5a0d42bdf37a5f63628d102b050b290a...
[2024-02-07 10:18:08,343] [INFO] [axolotl.load_tokenized_prepared_datasets:156] [PID:202107] [RANK:7] Loading prepared dataset from disk at data/last_run_prepared/ver1.5/5a0d42bdf37a5f63628d102b050b290a...
[2024-02-07 10:18:08,343] [INFO] [axolotl.load_tokenized_prepared_datasets:156] [PID:202104] [RANK:4] Loading prepared dataset from disk at data/last_run_prepared/ver1.5/5a0d42bdf37a5f63628d102b050b290a...
[2024-02-07 10:18:11,243] [INFO] [axolotl.load_tokenized_prepared_datasets:158] [PID:202101] [RANK:1] Prepared dataset loaded from disk...
[2024-02-07 10:18:11,297] [INFO] [axolotl.load_tokenized_prepared_datasets:158] [PID:202102] [RANK:2] Prepared dataset loaded from disk...
[2024-02-07 10:18:11,311] [INFO] [axolotl.load_tokenized_prepared_datasets:158] [PID:202103] [RANK:3] Prepared dataset loaded from disk...
[2024-02-07 10:18:12,061] [INFO] [axolotl.load_tokenized_prepared_datasets:158] [PID:202106] [RANK:6] Prepared dataset loaded from disk...
[2024-02-07 10:18:12,128] [INFO] [axolotl.load_tokenized_prepared_datasets:158] [PID:202104] [RANK:4] Prepared dataset loaded from disk...
[2024-02-07 10:18:12,296] [INFO] [axolotl.load_tokenized_prepared_datasets:158] [PID:202105] [RANK:5] Prepared dataset loaded from disk...
[2024-02-07 10:18:12,302] [INFO] [axolotl.load_tokenized_prepared_datasets:158] [PID:202107] [RANK:7] Prepared dataset loaded from disk...
[2024-02-07 10:18:18,753] [DEBUG] [axolotl.log:61] [PID:202100] [RANK:0] max_input_len: 2048
Filter (num_proc=92): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1093118/1093118 [00:18<00:00, 58856.73 examples/s]
[2024-02-07 10:18:52,990] [DEBUG] [axolotl.log:61] [PID:202100] [RANK:0] total_num_tokens: 104953362
[2024-02-07 10:19:03,674] [DEBUG] [axolotl.log:61] [PID:202100] [RANK:0] `total_supervised_tokens: 104953362`
[2024-02-07 10:19:10,842] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:202100] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 13119170
[2024-02-07 10:19:10,842] [DEBUG] [axolotl.log:61] [PID:202100] [RANK:0] data_loader_len: 50727
[2024-02-07 10:19:21,549] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:202105] [RANK:5] packing_efficiency_estimate: 1.0 total_num_tokens per device: 13119170
[2024-02-07 10:19:22,906] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:202103] [RANK:3] packing_efficiency_estimate: 1.0 total_num_tokens per device: 13119170
[2024-02-07 10:19:23,052] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:202101] [RANK:1] packing_efficiency_estimate: 1.0 total_num_tokens per device: 13119170
[2024-02-07 10:19:23,523] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:202102] [RANK:2] packing_efficiency_estimate: 1.0 total_num_tokens per device: 13119170
[2024-02-07 10:19:23,592] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:202107] [RANK:7] packing_efficiency_estimate: 1.0 total_num_tokens per device: 13119170
[2024-02-07 10:19:25,488] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:202104] [RANK:4] packing_efficiency_estimate: 1.0 total_num_tokens per device: 13119170
[2024-02-07 10:19:27,145] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:202106] [RANK:6] packing_efficiency_estimate: 1.0 total_num_tokens per device: 13119170
[2024-02-07 10:19:29,415] [INFO] [axolotl.log:61] [PID:202100] [RANK:0] sample_packing_eff_est across ranks: [0.8551673293113708, 0.8553243279457092, 0.856024444103241, 0.8551245331764221, 0.8540699481964111, 0.8542835116386414, 0.8545684218406677, 0.8550674915313721]
[2024-02-07 10:19:29,416] [DEBUG] [axolotl.log:61] [PID:202100] [RANK:0] sample_packing_eff_est: None
[2024-02-07 10:19:29,416] [DEBUG] [axolotl.log:61] [PID:202100] [RANK:0] total_num_steps: 6340
[2024-02-07 10:21:38,297] [DEBUG] [axolotl.log:61] [PID:202100] [RANK:0] total_num_tokens: 10394324568
After 10min, it crashes with these message
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202100 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202101 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202102 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202103 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202104 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202105 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 202106 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 7 (pid: 202107) of binary: /opt/conda/envs/axo/bin/python
Traceback (most recent call last):
File "/opt/conda/envs/axo/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/jovyan/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/jovyan/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1014, in launch_command
multi_gpu_launcher(args)
File "/home/jovyan/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 672, in multi_gpu_launcher
distrib_run.run(args)
File "/home/jovyan/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/jovyan/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/jovyan/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
axolotl.cli.train FAILED
-------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-02-07_10:38:21
host : notebook-deployment-25-5b4fb57786-p6qqr
rank : 7 (local_rank: 7)
exitcode : -9 (pid: 202107)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 202107
=======================================================
I thought it was due to something like timelimit, so I modified is_distributed function in distributed.py like below, but it does not helped.
def is_distributed():
"""
Check if distributed training is initialized.
"""
global accelerate # pylint: disable=global-statement
ipg_handler = InitProcessGroupKwargs(
timeout=timedelta(seconds=54000)
)
if not accelerate:
accelerate = Accelerator(
kwargs_handlers=[ipg_handler],
)
return dist.is_available() and dist.is_initialized()
I also tried ddp_timeout: 99999 but it does not work, either.
Steps to reproduce
I just used ver2.0.yml below, and preprocess and then train.
Config yaml
seed: 42
ddp_timeout: 99999
base_model: ./konure_hollow
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
is_llama_derived_model: true
load_in_8bit: false
load_in_4bit: false
strict: false
datasets:
## whole bunch of datasets that has beed preprocessed with axolotl preprocess
dataset_prepared_path: ./data/last_run_prepared/ver1.5
val_set_size: 0.01
output_dir: ./results/ver2.0
sequence_len: 2048
sample_packing: true
pad_to_sequence_len: true
resize_token_embeddings_to_32x: true
adapter:
lora_model_dir:
lora_r:
lora_alpha:
lora_dropout:
lora_target_linear:
lora_fan_in_fan_out:
gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_hf
lr_scheduler: cosine
learning_rate: 0.00001
train_on_inputs: true
group_by_length: false
bf16: true
fp16: false
tf32: false
gradient_checkpointing: false
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
warmup_ratio: 0.15
evals_per_epoch: 20
eval_table_size:
saves_per_epoch: 25
save_total_limit: 2
debug:
deepspeed: deepspeed/zero2_ver1.0.json # multi-gpu only
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:
bos_token: "<s>"
eos_token: "</eot>"
unk_token: "<unk>"
unfrozen_parameters:
- lm_head.*
- model.embed_tokens.*
- model.layers.4.*
- model.layers.9.*
- model.layers.14.*
- model.layers.19.*
- model.layers.24.*
- model.layers.29.*
- model.layers.34.*
- model.layers.39.*
Possible solution
I think it is something to do with time limit, but I don't know how to fix this.
Which Operating Systems are you using?
- [X] Linux
- [ ] macOS
- [ ] Windows
Python Version
3.10
axolotl branch-commit
main
Acknowledgements
- [X] My issue title is concise, descriptive, and in title casing.
- [X] I have searched the existing issues to make sure this bug has not been reported yet.
- [X] I am using the latest version of axolotl.
- [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.
what's your hardware specifications? did you run out of ram/vram?
what's your hardware specifications? did you run out of ram/vram?
I have A100-40GB*8 for vram and and ram is like below
total used free shared buff/cache available
Mem: 885Gi 81Gi 524Gi 9.4Gi 279Gi 787Gi
Swap: 0B 0B 0B
I monitored through, but did not ran out neither of them.
btw, is CUDA_VISIBLE_DEVICES="" necessary in doing python -m axolotl.cli.preprocess examples/llama-2/ver2.0.yml? I think I didn't when I preprocess
btw, is CUDA_VISIBLE_DEVICES="" necessary in doing python -m axolotl.cli.preprocess examples/llama-2/ver2.0.yml? I think I didn't when I preprocess
Shouldn't be any issue.
May I ask which model size you're running? It wasn't that clear from the yaml.
Shouldn't be any issue.
May I ask which model size you're running? It wasn't that clear from the yaml.
It is just Llama-2-7b-hf model w/ extra columns in embedding and lm_head.
I don't know why but after preprocessing w/o CUDA and train works well!
I get similar error with
accelerate launch -m axolotl.cli.train llama_lora.yml --deepspeed deepspeed_configs/zero1.json
With config same in examples. Just added additionally
lora_modules_to_save:
- embed_tokens
- lm_head
special_tokens:
bos_token: "<s>"
eos_token: "</s>"
unk_token: "<unk>"
tokens: # these are delimiters
- "<|im_start|>"
- "<|im_end|>"
It works in other 2 cases:
- If I remove deepseed
- Change lora to qlora.
The error occurs after an epoch is complete
I get similar error with
accelerate launch -m axolotl.cli.train llama_lora.yml --deepspeed deepspeed_configs/zero1.jsonWith config same in examples.
Just added additionally
lora_modules_to_save: - embed_tokens - lm_head special_tokens: bos_token: "<s>" eos_token: "</s>" unk_token: "<unk>" tokens: # these are delimiters - "<|im_start|>" - "<|im_end|>"It works in other 2 cases:
If I remove deepseed
Change lora to qlora.
The error occurs after an epoch is complete
In your case, it's usually out of system RAM when it's gathering the weights from the various gpus
@winglian yeah the exit code is -9, that probably relates to system RAM OOM issue, but why that would happen even though I had 800GB free RAM.
Sorry for the necro, but how do you solve this issue if renting compute?
I have the same problem.
The thing I noticed is this only happens after I resume training from a checkpoint, never during the first run (although I can see how this could also happen during the normal run), and it happens during saving a checkpoint (when the model is transferred from the GPU to the system memory). The problem is that we run out of system RAM and the OS kills the process to save itself (otherwise it would crash) - this is a normal behavior of the OS, but the question is why this happens.
If I start training, I can train with no problem (although, again, this is my case, I can see how others can have this problem even during this stage). In the below image, you can see the system RAM usage. The "spikes" are when the checkpoint is being saved (it's set to every 100 steps because of this issue so we do not lose too much training when it happens) and the thing to notice is sometimes uses more RAM for several steps and drops down again:
And I can train for how many steps I like. But once I stop the training and restart it from the last checkpoint:
It, for some reason, uses more RAM to start and during the whole training, then, on top of this, also has these moments when it consumes more RAM, up to the point when the memory usage rises again and runs out of system RAM.
It seems like something in the system memory is not being cleaned up properly. The charts suggest that possibly:
- before resuming - it "sometimes" does not remove the model from the memory after saving a checkpoint
- after resuming - it possibly additionally does not remove the loaded model (it seems to be creating the main model from the modeling code even if it's then loading the model from the checkpoint? And possibly does not free up this memory?).
I'm using multi-GPU training with DeepSpeed ZeRO3 (I'm not using any CPU offload) and training part of the model in this case.