[BUG] Getting error when running deepseed in dolly training with exits with return code = -9

Open dineshreddy221 opened this issue 2 years ago • 1 comments

Describe the bug I started running deepseed config on EleutherAI/pythia-2.8b model, I ran into error exits with run code= -9. After splitting and preprocessing the dataset i am getting [ERROR] [launch.py:434:sigkill_handler].

Log output I used train_dolly.py to launch the application in databricks repo..

Expected behavior I am not understanding the error or the info is mentioning in the output.

deepseed config { "bf16": { "enabled": "auto" }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": "auto", "warmup_max_lr": "auto", "warmup_num_steps": "auto" } }, "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false }

Screenshots Screenshot1

System info (please complete the following information):

1 GPU with 56GB
Databricks runtime 11.3
Python version 3.9

Additional context Please let me know what is the issue and [info] tells in the screenshots.

May 05 '23 11:05 dineshreddy221

[2023-05-05 13:33:53,831] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 754 [2023-05-05 13:33:53,866] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python3', '-u', 'run_clm.py', '--local_rank=0', '--deepspeed', 'ds_config_stage3.json', '--model_name_or_path', 'decapoda-research/llama-7b-hf', '--train_file', 'train.csv', '--validation_file', 'validation.csv', '--do_train', '--do_eval', '--fp16', '--overwrite_cache', '--evaluation_strategy=steps', '--output_dir', 'finetuned', '--num_train_epochs', '1', '--eval_steps', '20', '--gradient_accumulation_steps', '128', '--per_device_train_batch_size', '1', '--use_fast_tokenizer', 'False', '--learning_rate', '5e-06', '--warmup_steps', '10', '--save_total_limit', '1', '--save_steps', '20', '--save_strategy', 'steps', '--load_best_model_at_end=True', '--block_size=176', '--report_to=wandb'] exits with return code = -9 root@e60d0a05d3ea:/workspace/finetuning_repo# i ma facing similar issues when i am trying to initiate my training on llms any help in this regard.

May 08 '23 06:05 gmm005