Llama-Chinese icon indicating copy to clipboard operation
Llama-Chinese copied to clipboard

RuntimeError: handle_0 INTERNAL ASSERT FAILED at "../c10/cuda/driver_api.cpp":15, please report a bug to PyTorch.

Open hellangleZ opened this issue 1 year ago • 2 comments

整个训练的步骤我都截取下来了,不知道为什么报这个错误

(/aml2/ds2) root@A100:/aml2/Llama2-Chinese-main/train/pretrain# export NCCL_IB_DISABLE=1;export NCCL_SOCKET_IFNAME=eth0; NCCL_DEBUG=INFO;TORCH_CPP_LOG_LEVEL=DEBUG; export NCCL_DEBUG_SUBSYS=ALL;export TORCH_DISTRIBUTED_DEBUG=INFO; export export NCCL_P2P_DISABLE=1; ./pretrain.sh [2023-11-13 02:45:14,800] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-11-13 02:45:15,831] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-11-13 02:45:15,831] [INFO] [runner.py:570:main] cmd = /aml2/ds2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None pretrain_clm.py --model_name_or_path /aml2/llama2 --train_files ../../data/train_sft.csv --validation_files ../../data/dev_sft.csv --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --do_train --output_dir /data1/pretrain --evaluation_strategy steps --use_fast_tokenizer false --max_eval_samples 500 --learning_rate 3e-5 --gradient_accumulation_steps 4 --num_train_epochs 1 --warmup_steps 10000 --logging_dir /data1/pretrain/logs --logging_strategy steps --logging_steps 2 --save_strategy steps --preprocessing_num_workers 1 --save_steps 500 --eval_steps 500 --save_total_limit 2000 --seed 42 --disable_tqdm false --ddp_find_unused_parameters false --block_size 4096 --overwrite_output_dir --report_to tensorboard --run_name /data1/pretrain --bf16 --bf16_full_eval --deepspeed ./ds_config_zero3.json --ignore_data_skip true --ddp_timeout 18000000 [2023-11-13 02:45:17,543] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-11-13 02:45:18,306] [INFO] [launch.py:138:main] 0 NCCL_IB_DISABLE=1 [2023-11-13 02:45:18,306] [INFO] [launch.py:138:main] 0 NCCL_P2P_LEVEL=NVL [2023-11-13 02:45:18,306] [INFO] [launch.py:138:main] 0 NCCL_P2P_DISABLE=1 [2023-11-13 02:45:18,306] [INFO] [launch.py:138:main] 0 NCCL_DEBUG_SUBSYS=ALL [2023-11-13 02:45:18,306] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=eth0 [2023-11-13 02:45:18,306] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]} [2023-11-13 02:45:18,306] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0 [2023-11-13 02:45:18,306] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]}) [2023-11-13 02:45:18,307] [INFO] [launch.py:163:main] dist_world_size=4 [2023-11-13 02:45:18,307] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 [2023-11-13 02:45:20,413] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-11-13 02:45:20,455] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-11-13 02:45:20,489] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-11-13 02:45:20,502] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-11-13 02:45:23,355] [INFO] [comm.py:637:init_distributed] cdb=None [2023-11-13 02:45:23,429] [INFO] [comm.py:637:init_distributed] cdb=None [2023-11-13 02:45:23,429] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2023-11-13 02:45:23,441] [INFO] [comm.py:637:init_distributed] cdb=None [2023-11-13 02:45:23,531] [INFO] [comm.py:637:init_distributed] cdb=None 11/13/2023 02:45:24 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 11/13/2023 02:45:24 - INFO - main - Training/evaluation parameters TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=True, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=18000000, debug=[], deepspeed=./ds_config_zero3.json, disable_tqdm=False, dispatch_batches=None, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=500, evaluation_strategy=steps, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=4, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=<HUB_TOKEN>, ignore_data_skip=True, include_inputs_for_metrics=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=3e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=/data1/pretrain/logs, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=2, logging_strategy=steps, lr_scheduler_type=linear, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=1.0, optim=adamw_torch, optim_args=None, output_dir=/data1/pretrain, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=1, per_device_train_batch_size=1, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=<PUSH_TO_HUB_TOKEN>, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], resume_from_checkpoint=None, run_name=/data1/pretrain, save_on_each_node=False, save_safetensors=True, save_steps=500, save_strategy=steps, save_total_limit=2000, seed=42, skip_memory_metrics=True, split_batches=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=10000, weight_decay=0.0, ) ['../../data/train_sft.csv'] 训练文件总个数 1 /aml2/ds2/lib/python3.10/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( 11/13/2023 02:45:24 - WARNING - main - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: False ['../../data/train_sft.csv'] 训练文件总个数 1 /aml2/ds2/lib/python3.10/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( 11/13/2023 02:45:24 - WARNING - main - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: False ['../../data/train_sft.csv'] 训练文件总个数 1 /aml2/ds2/lib/python3.10/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( 11/13/2023 02:45:25 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: False ['../../data/train_sft.csv'] 训练文件总个数 1 /aml2/ds2/lib/python3.10/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( Using custom data configuration default-e93cda2aba79e721 11/13/2023 02:45:25 - INFO - datasets.builder - Using custom data configuration default-e93cda2aba79e721 Loading Dataset Infos from /aml2/ds2/lib/python3.10/site-packages/datasets/packaged_modules/csv 11/13/2023 02:45:25 - INFO - datasets.info - Loading Dataset Infos from /aml2/ds2/lib/python3.10/site-packages/datasets/packaged_modules/csv Overwrite dataset info from restored data version if exists. 11/13/2023 02:45:25 - INFO - datasets.builder - Overwrite dataset info from restored data version if exists. Loading Dataset info from /data1/pretrain/dataset_cache/csv/default-e93cda2aba79e721/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d 11/13/2023 02:45:25 - INFO - datasets.info - Loading Dataset info from /data1/pretrain/dataset_cache/csv/default-e93cda2aba79e721/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d Found cached dataset csv (/data1/pretrain/dataset_cache/csv/default-e93cda2aba79e721/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d) 11/13/2023 02:45:25 - INFO - datasets.builder - Found cached dataset csv (/data1/pretrain/dataset_cache/csv/default-e93cda2aba79e721/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d) Loading Dataset info from /data1/pretrain/dataset_cache/csv/default-e93cda2aba79e721/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d 11/13/2023 02:45:25 - INFO - datasets.info - Loading Dataset info from /data1/pretrain/dataset_cache/csv/default-e93cda2aba79e721/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d [INFO|configuration_utils.py:715] 2023-11-13 02:45:25,842 >> loading configuration file /aml2/llama2/config.json [INFO|configuration_utils.py:777] 2023-11-13 02:45:25,843 >> Model config LlamaConfig { "_name_or_path": "/aml2/llama2", "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "bos_token_id": 1, "eos_token_id": 2, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 4096, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 32, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 10000.0, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.35.0", "use_cache": true, "vocab_size": 32000 }

0 start load tokenizer [INFO|tokenization_utils_base.py:2020] 2023-11-13 02:45:25,844 >> loading file tokenizer.model [INFO|tokenization_utils_base.py:2020] 2023-11-13 02:45:25,844 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2020] 2023-11-13 02:45:25,844 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2020] 2023-11-13 02:45:25,844 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2020] 2023-11-13 02:45:25,844 >> loading file tokenizer.json 3 start load tokenizer 0 end load tokenizer 0 start load model [INFO|modeling_utils.py:3118] 2023-11-13 02:45:25,919 >> loading weights file /aml2/llama2/pytorch_model.bin.index.json [INFO|modeling_utils.py:3227] 2023-11-13 02:45:25,919 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model [INFO|configuration_utils.py:791] 2023-11-13 02:45:25,922 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2 }

3 end load tokenizer 3 start load model 2 start load tokenizer 2 end load tokenizer 2 start load model 1 start load tokenizer 1 end load tokenizer 1 start load model [2023-11-13 02:45:27,707] [INFO] [partition_parameters.py:347:exit] finished initializing model - num_params = 291, num_elems = 6.74B Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00, 2.05s/it] 2 end load model Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00, 2.06s/it] 3 end load model Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00, 2.07s/it] 1 end load model ['text'] ['text'] ['text'] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:07<00:00, 3.75s/it] [INFO|modeling_utils.py:3950] 2023-11-13 02:45:35,243 >> All model checkpoint weights were used when initializing LlamaForCausalLM.

[INFO|modeling_utils.py:3958] 2023-11-13 02:45:35,243 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /aml2/llama2. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training. [INFO|configuration_utils.py:749] 2023-11-13 02:45:35,246 >> loading configuration file /aml2/llama2/generation_config.json [INFO|configuration_utils.py:791] 2023-11-13 02:45:35,247 >> Generate config GenerationConfig { "bos_token_id": 1, "do_sample": true, "eos_token_id": 2, "max_length": 4096, "pad_token_id": 0, "temperature": 0.6, "top_p": 0.9 }

0 end load model [INFO|modeling_utils.py:1648] 2023-11-13 02:45:35,331 >> You are resizing the embedding layer without providing a pad_to_multiple_of parameter. This means that the new embedding dimension will be 32000. This might induce some performance reduction as Tensor Cores will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc ['text'] Running tokenizer on dataset: 0%| | 0/9861 [00:00<?, ? examples/s]Caching processed dataset at /data1/pretrain/dataset_cache/csv/default-e93cda2aba79e721/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-8d11445ba8b581be.arrow 11/13/2023 02:45:35 - INFO - datasets.arrow_dataset - Caching processed dataset at /data1/pretrain/dataset_cache/csv/default-e93cda2aba79e721/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-8d11445ba8b581be.arrow Running tokenizer on dataset: 100%|█████████████████████████████████████████████████████████████████| 9861/9861 [00:06<00:00, 1626.91 examples/s] Running tokenizer on dataset: 0%| | 0/200 [00:00<?, ? examples/s]Caching processed dataset at /data1/pretrain/dataset_cache/csv/default-e93cda2aba79e721/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-2192c637811980c7.arrow 11/13/2023 02:45:41 - INFO - datasets.arrow_dataset - Caching processed dataset at /data1/pretrain/dataset_cache/csv/default-e93cda2aba79e721/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-2192c637811980c7.arrow Running tokenizer on dataset: 100%|███████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 1584.68 examples/s] Running tokenizer on dataset: 30%|███████████████████▊ | 3000/9861 [00:01<00:04, 1710.12 examples/s]11/13/2023 02:45:43 - INFO - main - group texts input examples length9861 after_group size794 Caching processed dataset at /data1/pretrain/dataset_cache/csv/default-e93cda2aba79e721/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-9364bce2ad8aff56.arrow 11/13/2023 02:45:43 - INFO - datasets.arrow_dataset - Caching processed dataset at /data1/pretrain/dataset_cache/csv/default-e93cda2aba79e721/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-9364bce2ad8aff56.arrow Grouping texts in chunks of 4096: 100%|█████████████████████████████████████████████████████████████| 9861/9861 [00:02<00:00, 3822.62 examples/s] Grouping texts in chunks of 4096: 0%| | 0/200 [00:00<?, ? examples/s]11/13/2023 02:45:44 - INFO - main - group texts input examples length200 after_group size16 Caching processed dataset at /data1/pretrain/dataset_cache/csv/default-e93cda2aba79e721/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-5bfab54fa6d473dd.arrow 11/13/2023 02:45:44 - INFO - datasets.arrow_dataset - Caching processed dataset at /data1/pretrain/dataset_cache/csv/default-e93cda2aba79e721/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d/cache-5bfab54fa6d473dd.arrow Grouping texts in chunks of 4096: 100%|███████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 3697.18 examples/s] Running tokenizer on dataset: 100%|█████████████████████████████████████████████████████████████████| 9861/9861 [00:05<00:00, 1713.59 examples/s] Running tokenizer on dataset: 100%|█████████████████████████████████████████████████████████████████| 9861/9861 [00:05<00:00, 1698.38 examples/s] Running tokenizer on dataset: 100%|█████████████████████████████████████████████████████████████████| 9861/9861 [00:05<00:00, 1687.24 examples/s] Running tokenizer on dataset: 100%|███████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 1588.47 examples/s] Running tokenizer on dataset: 100%|███████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 1598.50 examples/s] Running tokenizer on dataset: 100%|███████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 1590.89 examples/s] 0 start select train_dataset 0 end select train_dataset 0 start select eval_dataset Grouping texts in chunks of 4096: 0%| | 0/9861 [00:00<?, ? examples/s]0 end select eval_dataset 0 start load metric 0 end load metric 0 Initialize our Trainer 11/13/2023 02:45:47 - WARNING - accelerate.utils.other - Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. [INFO|trainer.py:593] 2023-11-13 02:45:47,565 >> Using auto half precision backend 0 start train [2023-11-13 02:45:47,700] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.12.2, git-hash=unknown, git-branch=unknown [2023-11-13 02:45:47,739] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py310_cu121/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.0937197208404541 seconds /aml2/ds2/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) self._dummy_overflow_buf = get_accelerator().IntTensor([0]) [2023-11-13 02:45:48,102] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2023-11-13 02:45:48,103] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2023-11-13 02:45:48,112] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam [2023-11-13 02:45:48,112] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'> [2023-11-13 02:45:48,112] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False [2023-11-13 02:45:48,112] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer [2023-11-13 02:45:48,212] [INFO] [utils.py:802:see_memory_usage] Stage 3 initialize beginning [2023-11-13 02:45:48,213] [INFO] [utils.py:803:see_memory_usage] MA 3.57 GB Max_MA 4.26 GB CA 7.36 GB Max_CA 16 GB [2023-11-13 02:45:48,213] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 21.54 GB, percent = 2.5% [2023-11-13 02:45:48,214] [INFO] [stage3.py:126:init] Reduce bucket size 16777216 [2023-11-13 02:45:48,214] [INFO] [stage3.py:127:init] Prefetch bucket size 15099494 [2023-11-13 02:45:48,303] [INFO] [utils.py:802:see_memory_usage] DeepSpeedZeRoOffload initialize [begin] [2023-11-13 02:45:48,303] [INFO] [utils.py:803:see_memory_usage] MA 3.57 GB Max_MA 3.57 GB CA 7.36 GB Max_CA 7 GB [2023-11-13 02:45:48,303] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 21.57 GB, percent = 2.5% Parameter Offload: Total persistent parameters: 266240 in 65 params [2023-11-13 02:45:48,410] [INFO] [utils.py:802:see_memory_usage] DeepSpeedZeRoOffload initialize [end] [2023-11-13 02:45:48,410] [INFO] [utils.py:803:see_memory_usage] MA 3.21 GB Max_MA 3.63 GB CA 7.36 GB Max_CA 7 GB [2023-11-13 02:45:48,411] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 21.69 GB, percent = 2.5% [2023-11-13 02:45:48,500] [INFO] [utils.py:802:see_memory_usage] Before creating fp16 partitions [2023-11-13 02:45:48,501] [INFO] [utils.py:803:see_memory_usage] MA 3.21 GB Max_MA 3.21 GB CA 7.36 GB Max_CA 7 GB [2023-11-13 02:45:48,501] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 21.69 GB, percent = 2.5% Grouping texts in chunks of 4096: 100%|█████████████████████████████████████████████████████████████| 9861/9861 [00:02<00:00, 3983.14 examples/s] Grouping texts in chunks of 4096: 100%|█████████████████████████████████████████████████████████████| 9861/9861 [00:02<00:00, 3950.63 examples/s] Grouping texts in chunks of 4096: 100%|█████████████████████████████████████████████████████████████| 9861/9861 [00:02<00:00, 3942.50 examples/s] Grouping texts in chunks of 4096: 100%|███████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 3831.04 examples/s] 2 start select train_dataset 2 end select train_dataset 2 start select eval_dataset 2 end select eval_dataset 2 start load metric 2 end load metric 2 Initialize our Trainer Grouping texts in chunks of 4096: 100%|███████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 3846.77 examples/s] 3 start select train_dataset 3 end select train_dataset 3 start select eval_dataset 3 end select eval_dataset 3 start load metric Grouping texts in chunks of 4096: 100%|███████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 3871.23 examples/s] 1 start select train_dataset 1 end select train_dataset 1 start select eval_dataset 1 end select eval_dataset 1 start load metric 3 end load metric 3 Initialize our Trainer 1 end load metric 1 Initialize our Trainer 2 start train 3 start train 1 start train Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py310_cu121/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.09486579895019531 seconds /aml2/ds2/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) self._dummy_overflow_buf = get_accelerator().IntTensor([0]) Loading extension module fused_adam... Time to load fused_adam op: 0.10223221778869629 seconds /aml2/ds2/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) self._dummy_overflow_buf = get_accelerator().IntTensor([0]) Loading extension module fused_adam... Time to load fused_adam op: 0.10130715370178223 seconds /aml2/ds2/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py:96: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.) self._dummy_overflow_buf = get_accelerator().IntTensor([0]) [2023-11-13 02:45:51,775] [INFO] [utils.py:802:see_memory_usage] After creating fp16 partitions: 2 [2023-11-13 02:45:51,776] [INFO] [utils.py:803:see_memory_usage] MA 3.2 GB Max_MA 3.21 GB CA 3.2 GB Max_CA 7 GB [2023-11-13 02:45:51,776] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 21.16 GB, percent = 2.4% [2023-11-13 02:45:51,866] [INFO] [utils.py:802:see_memory_usage] Before creating fp32 partitions [2023-11-13 02:45:51,866] [INFO] [utils.py:803:see_memory_usage] MA 3.2 GB Max_MA 3.2 GB CA 3.2 GB Max_CA 3 GB [2023-11-13 02:45:51,867] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 21.16 GB, percent = 2.4% [2023-11-13 02:45:51,964] [INFO] [utils.py:802:see_memory_usage] After creating fp32 partitions [2023-11-13 02:45:51,965] [INFO] [utils.py:803:see_memory_usage] MA 9.48 GB Max_MA 10.75 GB CA 11.35 GB Max_CA 11 GB [2023-11-13 02:45:51,965] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 21.16 GB, percent = 2.4% [2023-11-13 02:45:52,061] [INFO] [utils.py:802:see_memory_usage] Before initializing optimizer states [2023-11-13 02:45:52,062] [INFO] [utils.py:803:see_memory_usage] MA 9.48 GB Max_MA 9.48 GB CA 11.35 GB Max_CA 11 GB [2023-11-13 02:45:52,062] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 21.16 GB, percent = 2.4% [2023-11-13 02:45:52,172] [INFO] [utils.py:802:see_memory_usage] After initializing optimizer states [2023-11-13 02:45:52,173] [INFO] [utils.py:803:see_memory_usage] MA 22.03 GB Max_MA 25.76 GB CA 27.65 GB Max_CA 28 GB [2023-11-13 02:45:52,173] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 21.16 GB, percent = 2.4% [2023-11-13 02:45:52,174] [INFO] [stage3.py:460:_setup_for_real_optimizer] optimizer state initialized [2023-11-13 02:45:52,376] [WARNING] [lr_schedules.py:751:init] total_num_steps 49 is less than warmup_num_steps 10000 [2023-11-13 02:45:52,376] [WARNING] [lr_schedules.py:751:init] total_num_steps 49 is less than warmup_num_steps 10000 [2023-11-13 02:45:52,376] [WARNING] [lr_schedules.py:751:init] total_num_steps 49 is less than warmup_num_steps 10000 [2023-11-13 02:45:52,480] [INFO] [utils.py:802:see_memory_usage] After initializing ZeRO optimizer [2023-11-13 02:45:52,481] [INFO] [utils.py:803:see_memory_usage] MA 25.2 GB Max_MA 25.69 GB CA 44.01 GB Max_CA 44 GB [2023-11-13 02:45:52,481] [INFO] [utils.py:810:see_memory_usage] CPU Virtual Memory: used = 21.16 GB, percent = 2.4% [2023-11-13 02:45:52,481] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw [2023-11-13 02:45:52,481] [WARNING] [lr_schedules.py:751:init] total_num_steps 49 is less than warmup_num_steps 10000 [2023-11-13 02:45:52,481] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = WarmupDecayLR [2023-11-13 02:45:52,481] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupDecayLR object at 0x7fade9f06980> [2023-11-13 02:45:52,481] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[3e-05], mom=[[0.9, 0.999]] [2023-11-13 02:45:52,482] [INFO] [config.py:972:print] DeepSpeedEngine configuration: [2023-11-13 02:45:52,482] [INFO] [config.py:976:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2023-11-13 02:45:52,482] [INFO] [config.py:976:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2023-11-13 02:45:52,482] [INFO] [config.py:976:print] amp_enabled .................. False [2023-11-13 02:45:52,482] [INFO] [config.py:976:print] amp_params ................... False [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] bfloat16_enabled ............. True [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] checkpoint_parallel_write_pipeline False [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] checkpoint_tag_validation_enabled True [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] checkpoint_tag_validation_fail False [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7faed0ec6b90> [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] communication_data_type ...... None [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] curriculum_enabled_legacy .... False [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] curriculum_params_legacy ..... False [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] data_efficiency_enabled ...... False [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] dataloader_drop_last ......... False [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] disable_allgather ............ False [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] dump_state ................... False [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] dynamic_loss_scale_args ...... None [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] eigenvalue_enabled ........... False [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] eigenvalue_gas_boundary_resolution 1 [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] eigenvalue_layer_name ........ bert.encoder.layer [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] eigenvalue_layer_num ......... 0 [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] eigenvalue_max_iter .......... 100 [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] eigenvalue_stability ......... 1e-06 [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] eigenvalue_tol ............... 0.01 [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] eigenvalue_verbose ........... False [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] elasticity_enabled ........... False [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] fp16_auto_cast ............... None [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] fp16_enabled ................. False [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] fp16_master_weights_and_gradients False [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] global_rank .................. 0 [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] grad_accum_dtype ............. None [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] gradient_accumulation_steps .. 4 [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] gradient_clipping ............ 1.0 [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] gradient_predivide_factor .... 1.0 [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] initial_dynamic_scale ........ 1 [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] load_universal_checkpoint .... False [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] loss_scale ................... 1.0 [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] memory_breakdown ............. False [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] mics_hierarchial_params_gather False [2023-11-13 02:45:52,483] [INFO] [config.py:976:print] mics_shard_size .............. -1 [2023-11-13 02:45:52,484] [INFO] [config.py:976:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2023-11-13 02:45:52,484] [INFO] [config.py:976:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2023-11-13 02:45:52,484] [INFO] [config.py:976:print] optimizer_legacy_fusion ...... False [2023-11-13 02:45:52,484] [INFO] [config.py:976:print] optimizer_name ............... adamw [2023-11-13 02:45:52,484] [INFO] [config.py:976:print] optimizer_params ............. {'lr': 3e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.0} [2023-11-13 02:45:52,484] [INFO] [config.py:976:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2023-11-13 02:45:52,484] [INFO] [config.py:976:print] pld_enabled .................. False [2023-11-13 02:45:52,484] [INFO] [config.py:976:print] pld_params ................... False [2023-11-13 02:45:52,484] [INFO] [config.py:976:print] prescale_gradients ........... False [2023-11-13 02:45:52,484] [INFO] [config.py:976:print] scheduler_name ............... WarmupDecayLR [2023-11-13 02:45:52,484] [INFO] [config.py:976:print] scheduler_params ............. {'last_batch_iteration': -1, 'total_num_steps': 49, 'warmup_min_lr': 0, 'warmup_max_lr': 3e-05, 'warmup_num_steps': 10000} [2023-11-13 02:45:52,484] [INFO] [config.py:976:print] seq_parallel_communication_data_type torch.float32 [2023-11-13 02:45:52,484] [INFO] [config.py:976:print] sparse_attention ............. None [2023-11-13 02:45:52,484] [INFO] [config.py:976:print] sparse_gradients_enabled ..... False [2023-11-13 02:45:52,484] [INFO] [config.py:976:print] steps_per_print .............. inf [2023-11-13 02:45:52,484] [INFO] [config.py:976:print] train_batch_size ............. 16 [2023-11-13 02:45:52,484] [INFO] [config.py:976:print] train_micro_batch_size_per_gpu 1 [2023-11-13 02:45:52,484] [INFO] [config.py:976:print] use_node_local_storage ....... False [2023-11-13 02:45:52,484] [INFO] [config.py:976:print] wall_clock_breakdown ......... False [2023-11-13 02:45:52,484] [INFO] [config.py:976:print] weight_quantization_config ... None [2023-11-13 02:45:52,484] [INFO] [config.py:976:print] world_size ................... 4 [2023-11-13 02:45:52,484] [INFO] [config.py:976:print] zero_allow_untested_optimizer False [2023-11-13 02:45:52,484] [INFO] [config.py:976:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=16777216 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=15099494 param_persistence_threshold=40960 model_persistence_threshold=sys.maxsize max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2023-11-13 02:45:52,484] [INFO] [config.py:976:print] zero_enabled ................. True [2023-11-13 02:45:52,484] [INFO] [config.py:976:print] zero_force_ds_cpu_optimizer .. True [2023-11-13 02:45:52,484] [INFO] [config.py:976:print] zero_optimization_stage ...... 3 [2023-11-13 02:45:52,484] [INFO] [config.py:962:print_user_config] json = { "fp16": { "enabled": false, "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1, "fp16_opt_level": "O2" }, "bf16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": 3e-05, "betas": [0.9, 0.999], "eps": 1e-08, "weight_decay": 0.0 } }, "scheduler": { "type": "WarmupDecayLR", "params": { "last_batch_iteration": -1, "total_num_steps": 49, "warmup_min_lr": 0, "warmup_max_lr": 3e-05, "warmup_num_steps": 1.000000e+04 } }, "zero_optimization": { "stage": 3, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1.000000e+09, "reduce_bucket_size": 1.677722e+07, "stage3_prefetch_bucket_size": 1.509949e+07, "stage3_param_persistence_threshold": 4.096000e+04, "stage3_max_live_parameters": 1.000000e+09, "stage3_max_reuse_distance": 1.000000e+09, "gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": 4, "gradient_clipping": 1.0, "steps_per_print": inf, "train_batch_size": 16, "train_micro_batch_size_per_gpu": 1, "wall_clock_breakdown": false } [INFO|trainer.py:1723] 2023-11-13 02:45:52,484 >> ***** Running training ***** [INFO|trainer.py:1724] 2023-11-13 02:45:52,484 >> Num examples = 794 [INFO|trainer.py:1725] 2023-11-13 02:45:52,484 >> Num Epochs = 1 [INFO|trainer.py:1726] 2023-11-13 02:45:52,485 >> Instantaneous batch size per device = 1 [INFO|trainer.py:1729] 2023-11-13 02:45:52,485 >> Total train batch size (w. parallel, distributed & accumulation) = 16 [INFO|trainer.py:1730] 2023-11-13 02:45:52,485 >> Gradient Accumulation steps = 4 [INFO|trainer.py:1731] 2023-11-13 02:45:52,485 >> Total optimization steps = 49 [INFO|trainer.py:1732] 2023-11-13 02:45:52,486 >> Number of trainable parameters = 6,738,415,616 0%| | 0/49 [00:00<?, ?it/s]Traceback (most recent call last): File "/aml2/Llama2-Chinese-main/train/pretrain/pretrain_clm.py", line 612, in Traceback (most recent call last): File "/aml2/Llama2-Chinese-main/train/pretrain/pretrain_clm.py", line 612, in Traceback (most recent call last): File "/aml2/Llama2-Chinese-main/train/pretrain/pretrain_clm.py", line 612, in main() File "/aml2/Llama2-Chinese-main/train/pretrain/pretrain_clm.py", line 573, in main main() File "/aml2/Llama2-Chinese-main/train/pretrain/pretrain_clm.py", line 573, in main main() File "/aml2/Llama2-Chinese-main/train/pretrain/pretrain_clm.py", line 573, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/aml2/ds2/lib/python3.10/site-packages/transformers/trainer.py", line 1555, in train train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/aml2/ds2/lib/python3.10/site-packages/transformers/trainer.py", line 1555, in train train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/aml2/ds2/lib/python3.10/site-packages/transformers/trainer.py", line 1555, in train return inner_training_loop( File "/aml2/ds2/lib/python3.10/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop return inner_training_loop( return inner_training_loop( File "/aml2/ds2/lib/python3.10/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop

File "/aml2/ds2/lib/python3.10/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/aml2/ds2/lib/python3.10/site-packages/transformers/trainer.py", line 2725, in training_step tr_loss_step = self.training_step(model, inputs) File "/aml2/ds2/lib/python3.10/site-packages/transformers/trainer.py", line 2725, in training_step tr_loss_step = self.training_step(model, inputs) File "/aml2/ds2/lib/python3.10/site-packages/transformers/trainer.py", line 2725, in training_step loss = self.compute_loss(model, inputs) File "/aml2/ds2/lib/python3.10/site-packages/transformers/trainer.py", line 2748, in compute_loss loss = self.compute_loss(model, inputs) File "/aml2/ds2/lib/python3.10/site-packages/transformers/trainer.py", line 2748, in compute_loss loss = self.compute_loss(model, inputs) File "/aml2/ds2/lib/python3.10/site-packages/transformers/trainer.py", line 2748, in compute_loss outputs = model(**inputs)outputs = model(**inputs)

File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl outputs = model(**inputs) File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs)return self._call_impl(*args, **kwargs)

File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return self._call_impl(*args, **kwargs) File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs)return forward_call(*args, **kwargs)return forward_call(*args, **kwargs)

File "/aml2/ds2/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn File "/aml2/ds2/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn File "/aml2/ds2/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs)ret_val = func(*args, **kwargs)ret_val = func(*args, **kwargs)

File "/aml2/ds2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1814, in forward File "/aml2/ds2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1814, in forward File "/aml2/ds2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1814, in forward loss = self.module(*inputs, **kwargs)loss = self.module(*inputs, **kwargs)loss = self.module(*inputs, **kwargs)

File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl return self._call_impl(*args, **kwargs)
return self._call_impl(*args, **kwargs) File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl

File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl result = forward_call(*args, **kwargs)result = forward_call(*args, **kwargs)

result = forward_call(*args, **kwargs) File "/aml2/ds2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1034, in forward File "/aml2/ds2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1034, in forward

File "/aml2/ds2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1034, in forward outputs = self.model(outputs = self.model(

File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl outputs = self.model( File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl

File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs)return self._call_impl(*args, **kwargs)return self._call_impl(*args, **kwargs)

File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl result = forward_call(*args, **kwargs) File "/aml2/ds2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 922, in forward result = forward_call(*args, **kwargs)result = forward_call(*args, **kwargs)

File "/aml2/ds2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 922, in forward File "/aml2/ds2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 922, in forward layer_outputs = decoder_layer( File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl layer_outputs = decoder_layer(
layer_outputs = decoder_layer( File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl

File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl return self._call_impl(*args, **kwargs)return self._call_impl(*args, **kwargs)

File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl result = forward_call(*args, **kwargs) File "/aml2/ds2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 672, in forward result = forward_call(*args, **kwargs)result = forward_call(*args, **kwargs)

File "/aml2/ds2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 672, in forward File "/aml2/ds2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 672, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl hidden_states, self_attn_weights, present_key_value = self.self_attn(hidden_states, self_attn_weights, present_key_value = self.self_attn(

File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl return self._call_impl(*args, **kwargs)return self._call_impl(*args, **kwargs)

File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl result = forward_call(*args, **kwargs) File "/aml2/ds2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 406, in forward attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype) File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/functional.py", line 1858, in softmax result = forward_call(*args, **kwargs)result = forward_call(*args, **kwargs)

File "/aml2/ds2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 406, in forward File "/aml2/ds2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 406, in forward attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)

File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/functional.py", line 1858, in softmax File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/functional.py", line 1858, in softmax ret = input.softmax(dim, dtype=dtype) RuntimeError: handle_0 INTERNAL ASSERT FAILED at "../c10/cuda/driver_api.cpp":15, please report a bug to PyTorch. ret = input.softmax(dim, dtype=dtype)ret = input.softmax(dim, dtype=dtype)

RuntimeError: handle_0 INTERNAL ASSERT FAILED at "../c10/cuda/driver_api.cpp":15, please report a bug to PyTorch. RuntimeError: handle_0 INTERNAL ASSERT FAILED at "../c10/cuda/driver_api.cpp":15, please report a bug to PyTorch. Traceback (most recent call last): File "/aml2/Llama2-Chinese-main/train/pretrain/pretrain_clm.py", line 612, in main() File "/aml2/Llama2-Chinese-main/train/pretrain/pretrain_clm.py", line 573, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/aml2/ds2/lib/python3.10/site-packages/transformers/trainer.py", line 1555, in train return inner_training_loop( File "/aml2/ds2/lib/python3.10/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/aml2/ds2/lib/python3.10/site-packages/transformers/trainer.py", line 2725, in training_step loss = self.compute_loss(model, inputs) File "/aml2/ds2/lib/python3.10/site-packages/transformers/trainer.py", line 2748, in compute_loss outputs = model(**inputs) File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/aml2/ds2/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/aml2/ds2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1814, in forward loss = self.module(*inputs, **kwargs) File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl result = forward_call(*args, **kwargs) File "/aml2/ds2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1034, in forward outputs = self.model( File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl result = forward_call(*args, **kwargs) File "/aml2/ds2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 922, in forward layer_outputs = decoder_layer( File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl result = forward_call(*args, **kwargs) File "/aml2/ds2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 672, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl result = forward_call(*args, **kwargs) File "/aml2/ds2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 406, in forward attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype) File "/aml2/ds2/lib/python3.10/site-packages/torch/nn/functional.py", line 1858, in softmax ret = input.softmax(dim, dtype=dtype) RuntimeError: handle_0 INTERNAL ASSERT FAILED at "../c10/cuda/driver_api.cpp":15, please report a bug to PyTorch. 0%| | 0/49 [00:01<?, ?it/s] [2023-11-13 02:45:56,352] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 14347 [2023-11-13 02:45:56,364] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 14348 [2023-11-13 02:45:56,364] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 14349 [2023-11-13 02:45:56,371] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 14350

hellangleZ avatar Nov 13 '23 03:11 hellangleZ

看吧都没人回答你

yinhongtao16 avatar Dec 10 '23 06:12 yinhongtao16

看吧都没人回答你

Cuda issue, I try and change 11.3 to use it and fixed the problem :)

hellangleZ avatar Dec 10 '23 06:12 hellangleZ