Doc2EDAG
Doc2EDAG copied to clipboard
I try to train the model the way you do ,but meet problems below, can you help me ?
I just trained the model like the way you do, but meet some problems, can you help me ? (base) tantra@server121:~/workspace/lc/Doc2EDAG/Doc2EDAG-master$ CUDA_VISIBLE_DEVICES=0,1,2,3 ./train_multi.sh 4 --task_name TASK1 --gradient_accumulation_steps 16 /home/tantra/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead logger.warn( The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases.
Please read local_rank from os.environ('LOCAL_RANK')
instead.
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : run_dee_task.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 4
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:29500
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_kh8gjpdz/none_12bm0xei INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group /home/tantra/anaconda3/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0, 1, 2, 3] role_ranks=[0, 1, 2, 3] global_ranks=[0, 1, 2, 3] role_world_sizes=[4, 4, 4, 4] global_world_sizes=[4, 4, 4, 4]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_kh8gjpdz/none_12bm0xei/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_kh8gjpdz/none_12bm0xei/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_kh8gjpdz/none_12bm0xei/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_kh8gjpdz/none_12bm0xei/attempt_0/3/error.json
2021-08-13 10:45:52 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 3
2021-08-13 10:45:52 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 1
2021-08-13 10:45:52 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 2
2021-08-13 10:45:52 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 0
2021-08-13 10:45:52 - INFO - torch.distributed.distributed_c10d - Rank 0: Completed store-based barrier for 4 nodes.
2021-08-13 10:45:52 - INFO - torch.distributed.distributed_c10d - Rank 1: Completed store-based barrier for 4 nodes.
2021-08-13 10:45:52 - INFO - torch.distributed.distributed_c10d - Rank 2: Completed store-based barrier for 4 nodes.
2021-08-13 10:45:52 - INFO - torch.distributed.distributed_c10d - Rank 3: Completed store-based barrier for 4 nodes.
2021-08-13 10:45:52 - INFO - DEETask - Rank 0 World Size 4 Rank 0, Local Rank 0, Device Num 4, Device 0
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 2 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 3 using best-guess GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
2021-08-13 10:45:56 - INFO - DEETask - Rank 0 ====================Check Setting Validity====================
2021-08-13 10:45:56 - INFO - DEETask - Rank 0 Setting: {
"data_dir": "./Data",
"model_dir": "./Exps/TASK1/Model",
"output_dir": "./Exps/TASK1/Output",
"bert_model": "bert-base-chinese",
"train_file_name": "train.json",
"dev_file_name": "dev.json",
"test_file_name": "test.json",
"max_seq_len": 128,
"train_batch_size": 64,
"eval_batch_size": 2,
"learning_rate": 0.0001,
"num_train_epochs": 100,
"warmup_proportion": 0.1,
"no_cuda": false,
"local_rank": 0,
"seed": 99,
"gradient_accumulation_steps": 16,
"optimize_on_cpu": false,
"fp16": false,
"loss_scale": 128,
"cpt_file_name": "Doc2EDAG",
"summary_dir_name": "/tmp/Summary",
"max_sent_len": 128,
"max_sent_num": 64,
"use_bert": false,
"only_master_logging": true,
"resume_latest_cpt": true,
"model_type": "Doc2EDAG",
"rearrange_sent": false,
"use_crf_layer": true,
"min_teacher_prob": 0.1,
"schedule_epoch_start": 10,
"schedule_epoch_length": 10,
"loss_lambda": 0.05,
"loss_gamma": 1.0,
"add_greedy_dec": true,
"use_token_role": true,
"seq_reduce_type": "MaxPooling",
"hidden_size": 768,
"dropout": 0.1,
"ff_size": 1024,
"num_tf_layers": 4,
"use_path_mem": true,
"use_scheduled_sampling": true,
"use_doc_enc": true,
"neg_field_loss_scaling": 3.0
}
2021-08-13 10:45:56 - INFO - DEETask - Rank 0 ====================Init Device====================
2021-08-13 10:45:56 - INFO - DEETask - Rank 0 device cuda:0 n_gpu 1 distributed training True
2021-08-13 10:45:56 - INFO - DEETask - Rank 0 ====================Reset Random Seed to 99====================
2021-08-13 10:45:56 - INFO - DEETask - Rank 0 Initializing DEETask
2021-08-13 10:45:57 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt from cache at /home/tantra/.pytorch_pretrained_bert/8a0c070123c1f794c42a29c6904beb7c1b8715741e235bee04aca2c7636fc83f.9b42061518a39ca00b8b52059fd2bede8daa613f8a8671500e518a8c29de8c00
Traceback (most recent call last):
File "run_dee_task.py", line 61, in
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_kh8gjpdz/none_12bm0xei/attempt_1/0/error.json INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_kh8gjpdz/none_12bm0xei/attempt_1/1/error.json INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_kh8gjpdz/none_12bm0xei/attempt_1/2/error.json INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_kh8gjpdz/none_12bm0xei/attempt_1/3/error.json 2021-08-13 10:46:03 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 2 2021-08-13 10:46:03 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 3 2021-08-13 10:46:03 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 1 2021-08-13 10:46:03 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 0 2021-08-13 10:46:13 - INFO - torch.distributed.distributed_c10d - Waiting in store based barrier to initialize process group for rank: 2, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00) 2021-08-13 10:46:13 - INFO - torch.distributed.distributed_c10d - Waiting in store based barrier to initialize process group for rank: 3, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00) 2021-08-13 10:46:13 - INFO - torch.distributed.distributed_c10d - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00) 2021-08-13 10:46:13 - INFO - torch.distributed.distributed_c10d - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)