Doc2EDAG icon indicating copy to clipboard operation
Doc2EDAG copied to clipboard

I try to train the model the way you do ,but meet problems below, can you help me ?

Open KDLc-design opened this issue 3 years ago • 0 comments

I just trained the model like the way you do, but meet some problems, can you help me ? (base) tantra@server121:~/workspace/lc/Doc2EDAG/Doc2EDAG-master$ CUDA_VISIBLE_DEVICES=0,1,2,3 ./train_multi.sh 4 --task_name TASK1 --gradient_accumulation_steps 16 /home/tantra/anaconda3/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead logger.warn( The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases. Please read local_rank from os.environ('LOCAL_RANK') instead. INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : run_dee_task.py min_nodes : 1 max_nodes : 1 nproc_per_node : 4 run_id : none rdzv_backend : static rdzv_endpoint : 127.0.0.1:29500 rdzv_configs : {'rank': 0, 'timeout': 900} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_kh8gjpdz/none_12bm0xei INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group /home/tantra/anaconda3/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future. warnings.warn( INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0, 1, 2, 3] role_ranks=[0, 1, 2, 3] global_ranks=[0, 1, 2, 3] role_world_sizes=[4, 4, 4, 4] global_world_sizes=[4, 4, 4, 4]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_kh8gjpdz/none_12bm0xei/attempt_0/0/error.json INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_kh8gjpdz/none_12bm0xei/attempt_0/1/error.json INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_kh8gjpdz/none_12bm0xei/attempt_0/2/error.json INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_kh8gjpdz/none_12bm0xei/attempt_0/3/error.json 2021-08-13 10:45:52 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 3 2021-08-13 10:45:52 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 1 2021-08-13 10:45:52 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 2 2021-08-13 10:45:52 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 0 2021-08-13 10:45:52 - INFO - torch.distributed.distributed_c10d - Rank 0: Completed store-based barrier for 4 nodes. 2021-08-13 10:45:52 - INFO - torch.distributed.distributed_c10d - Rank 1: Completed store-based barrier for 4 nodes. 2021-08-13 10:45:52 - INFO - torch.distributed.distributed_c10d - Rank 2: Completed store-based barrier for 4 nodes. 2021-08-13 10:45:52 - INFO - torch.distributed.distributed_c10d - Rank 3: Completed store-based barrier for 4 nodes. 2021-08-13 10:45:52 - INFO - DEETask - Rank 0 World Size 4 Rank 0, Local Rank 0, Device Num 4, Device 0 [W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [W ProcessGroupNCCL.cpp:1569] Rank 2 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. [W ProcessGroupNCCL.cpp:1569] Rank 3 using best-guess GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. 2021-08-13 10:45:56 - INFO - DEETask - Rank 0 ====================Check Setting Validity==================== 2021-08-13 10:45:56 - INFO - DEETask - Rank 0 Setting: { "data_dir": "./Data", "model_dir": "./Exps/TASK1/Model", "output_dir": "./Exps/TASK1/Output", "bert_model": "bert-base-chinese", "train_file_name": "train.json", "dev_file_name": "dev.json", "test_file_name": "test.json", "max_seq_len": 128, "train_batch_size": 64, "eval_batch_size": 2, "learning_rate": 0.0001, "num_train_epochs": 100, "warmup_proportion": 0.1, "no_cuda": false, "local_rank": 0, "seed": 99, "gradient_accumulation_steps": 16, "optimize_on_cpu": false, "fp16": false, "loss_scale": 128, "cpt_file_name": "Doc2EDAG", "summary_dir_name": "/tmp/Summary", "max_sent_len": 128, "max_sent_num": 64, "use_bert": false, "only_master_logging": true, "resume_latest_cpt": true, "model_type": "Doc2EDAG", "rearrange_sent": false, "use_crf_layer": true, "min_teacher_prob": 0.1, "schedule_epoch_start": 10, "schedule_epoch_length": 10, "loss_lambda": 0.05, "loss_gamma": 1.0, "add_greedy_dec": true, "use_token_role": true, "seq_reduce_type": "MaxPooling", "hidden_size": 768, "dropout": 0.1, "ff_size": 1024, "num_tf_layers": 4, "use_path_mem": true, "use_scheduled_sampling": true, "use_doc_enc": true, "neg_field_loss_scaling": 3.0 } 2021-08-13 10:45:56 - INFO - DEETask - Rank 0 ====================Init Device==================== 2021-08-13 10:45:56 - INFO - DEETask - Rank 0 device cuda:0 n_gpu 1 distributed training True 2021-08-13 10:45:56 - INFO - DEETask - Rank 0 ====================Reset Random Seed to 99==================== 2021-08-13 10:45:56 - INFO - DEETask - Rank 0 Initializing DEETask 2021-08-13 10:45:57 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt from cache at /home/tantra/.pytorch_pretrained_bert/8a0c070123c1f794c42a29c6904beb7c1b8715741e235bee04aca2c7636fc83f.9b42061518a39ca00b8b52059fd2bede8daa613f8a8671500e518a8c29de8c00 Traceback (most recent call last): File "run_dee_task.py", line 61, in dee_task = DEETask(dee_setting, load_train=not in_argv.skip_train) File "/home/tantra/workspace/lc/Doc2EDAG/Doc2EDAG-master/dee/dee_task.py", line 83, in init self.tokenizer = BERTChineseCharacterTokenizer.from_pretrained(self.setting.bert_model) File "/home/tantra/anaconda3/lib/python3.8/site-packages/pytorch_pretrained_bert/tokenization.py", line 197, in from_pretrained tokenizer = cls(resolved_vocab_file, *inputs, **kwargs) TypeError: init() got an unexpected keyword argument 'max_len' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 138242) of binary: /home/tantra/anaconda3/bin/python ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=1 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0, 1, 2, 3] role_ranks=[0, 1, 2, 3] global_ranks=[0, 1, 2, 3] role_world_sizes=[4, 4, 4, 4] global_world_sizes=[4, 4, 4, 4]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_kh8gjpdz/none_12bm0xei/attempt_1/0/error.json INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_kh8gjpdz/none_12bm0xei/attempt_1/1/error.json INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_kh8gjpdz/none_12bm0xei/attempt_1/2/error.json INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_kh8gjpdz/none_12bm0xei/attempt_1/3/error.json 2021-08-13 10:46:03 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 2 2021-08-13 10:46:03 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 3 2021-08-13 10:46:03 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 1 2021-08-13 10:46:03 - INFO - torch.distributed.distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 0 2021-08-13 10:46:13 - INFO - torch.distributed.distributed_c10d - Waiting in store based barrier to initialize process group for rank: 2, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00) 2021-08-13 10:46:13 - INFO - torch.distributed.distributed_c10d - Waiting in store based barrier to initialize process group for rank: 3, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00) 2021-08-13 10:46:13 - INFO - torch.distributed.distributed_c10d - Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00) 2021-08-13 10:46:13 - INFO - torch.distributed.distributed_c10d - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:30:00)

KDLc-design avatar Aug 13 '21 02:08 KDLc-design