RocketQA
RocketQA copied to clipboard
如何在RTX系列GPU上进行单机多卡运行
安装了nccl,

https://github.com/PaddlePaddle/Paddle/issues/28757 https://github.com/PaddlePaddle/Paddle/issues/29172 https://github.com/PaddlePaddle/Paddle/issues/36608 也尝试上述方法,似乎都不太行
workerlog.1日志:
----------------------
Error Message Summary:
----------------------
FatalError: `Process abort signal` is detected by the operating system.
[TimeInfo: *** Aborted at 1651162139 (unix time) try "date -d @1651162139" if you are using GNU date ***]
[SignalInfo: *** SIGABRT (@0xbd88) received by PID 48520 (TID 0x7f30cdeaf700) from PID 48520 ***]
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0428 16:13:33.103967 2391 init.cc:85] Before Parse: argc is 2, Init commandline: dummy --tryfromenv=check_nan_inf,fast_check_nan_inf,benchmark,eager_delete_scope,fraction_of_cpu_memory_to_use,initial_cpu_memory_in_mb,init_allocated_mem,paddle_num_threads,dist_threadpool_size,eager_delete_tensor_gb,fast_eager_deletion_mode,memory_fraction_of_eager_deletion,allocator_strategy,reader_queue_speed_test_mode,print_sub_graph_dir,pe_profile_fname,inner_op_parallelism,enable_parallel_graph,fuse_parameter_groups_size,multiple_of_cupti_buffer_size,fuse_parameter_memory_size,tracer_profile_fname,dygraph_debug,use_system_allocator,enable_unused_var_check,free_idle_chunk,free_when_no_cache_hit,call_stack_level,sort_sum_gradient,max_inplace_grad_add,use_pinned_memory,cpu_deterministic,use_mkldnn,tracer_mkldnn_ops_on,tracer_mkldnn_ops_off,fraction_of_gpu_memory_to_use,initial_gpu_memory_in_mb,reallocate_gpu_memory_in_mb,cudnn_deterministic,enable_cublas_tensor_op_math,conv_workspace_size_limit,cudnn_exhaustive_search,selected_gpus,sync_nccl_allreduce,cudnn_batchnorm_spatial_persistent,gpu_allocator_retry_time,local_exe_sub_scope_limit,gpu_memory_limit_mb
I0428 16:13:33.104110 2391 init.cc:92] After Parse: argc is 1
2022-04-28 16:13:33,503 - INFO - ----------- Configuration Arguments -----------
[INFO] 2022-04-28 16:13:33,503 [ args.py: 68]: ----------- Configuration Arguments -----------
2022-04-28 16:13:33,503 - INFO - batch_size: 8
[INFO] 2022-04-28 16:13:33,503 [ args.py: 70]: batch_size: 8
2022-04-28 16:13:33,503 - INFO - checkpoints: output
[INFO] 2022-04-28 16:13:33,503 [ args.py: 70]: checkpoints: output
2022-04-28 16:13:33,504 - INFO - chunk_scheme: IOB
[INFO] 2022-04-28 16:13:33,504 [ args.py: 70]: chunk_scheme: IOB
2022-04-28 16:13:33,504 - INFO - decr_every_n_nan_or_inf: 2
[INFO] 2022-04-28 16:13:33,504 [ args.py: 70]: decr_every_n_nan_or_inf: 2
2022-04-28 16:13:33,504 - INFO - decr_ratio: 0.8
[INFO] 2022-04-28 16:13:33,504 [ args.py: 70]: decr_ratio: 0.8
2022-04-28 16:13:33,504 - INFO - dev_set: None
[INFO] 2022-04-28 16:13:33,504 [ args.py: 70]: dev_set: None
2022-04-28 16:13:33,504 - INFO - diagnostic: None
[INFO] 2022-04-28 16:13:33,504 [ args.py: 70]: diagnostic: None
2022-04-28 16:13:33,504 - INFO - diagnostic_save: None
[INFO] 2022-04-28 16:13:33,504 [ args.py: 70]: diagnostic_save: None
2022-04-28 16:13:33,504 - INFO - do_lower_case: True
[INFO] 2022-04-28 16:13:33,504 [ args.py: 70]: do_lower_case: True
2022-04-28 16:13:33,504 - INFO - do_test: False
[INFO] 2022-04-28 16:13:33,504 [ args.py: 70]: do_test: False
2022-04-28 16:13:33,504 - INFO - do_train: True
[INFO] 2022-04-28 16:13:33,504 [ args.py: 70]: do_train: True
2022-04-28 16:13:33,504 - INFO - do_val: False
[INFO] 2022-04-28 16:13:33,504 [ args.py: 70]: do_val: False
2022-04-28 16:13:33,504 - INFO - doc_stride: 128
[INFO] 2022-04-28 16:13:33,504 [ args.py: 70]: doc_stride: 128
2022-04-28 16:13:33,505 - INFO - epoch: 3
[INFO] 2022-04-28 16:13:33,505 [ args.py: 70]: epoch: 3
2022-04-28 16:13:33,505 - INFO - ernie_config_path: pretrained-models/ernie_base_1.0_CN/ernie_config.json
[INFO] 2022-04-28 16:13:33,505 [ args.py: 70]: ernie_config_path: pretrained-models/ernie_base_1.0_CN/ernie_config.json
2022-04-28 16:13:33,505 - INFO - for_cn: True
[INFO] 2022-04-28 16:13:33,505 [ args.py: 70]: for_cn: True
2022-04-28 16:13:33,505 - INFO - in_tokens: False
[INFO] 2022-04-28 16:13:33,505 [ args.py: 70]: in_tokens: False
2022-04-28 16:13:33,505 - INFO - incr_every_n_steps: 100
[INFO] 2022-04-28 16:13:33,505 [ args.py: 70]: incr_every_n_steps: 100
2022-04-28 16:13:33,505 - INFO - incr_ratio: 2.0
[INFO] 2022-04-28 16:13:33,505 [ args.py: 70]: incr_ratio: 2.0
2022-04-28 16:13:33,505 - INFO - init_checkpoint: None
[INFO] 2022-04-28 16:13:33,505 [ args.py: 70]: init_checkpoint: None
2022-04-28 16:13:33,505 - INFO - init_pretraining_params: pretrained-models/ernie_base_1.0_CN/params
[INFO] 2022-04-28 16:13:33,505 [ args.py: 70]: init_pretraining_params: pretrained-models/ernie_base_1.0_CN/params
2022-04-28 16:13:33,505 - INFO - is_classify: True
[INFO] 2022-04-28 16:13:33,505 [ args.py: 70]: is_classify: True
2022-04-28 16:13:33,505 - INFO - is_distributed: False
[INFO] 2022-04-28 16:13:33,505 [ args.py: 70]: is_distributed: False
2022-04-28 16:13:33,505 - INFO - is_regression: False
[INFO] 2022-04-28 16:13:33,505 [ args.py: 70]: is_regression: False
2022-04-28 16:13:33,506 - INFO - label_map_config: None
[INFO] 2022-04-28 16:13:33,506 [ args.py: 70]: label_map_config: None
2022-04-28 16:13:33,506 - INFO - learning_rate: 1e-05
[INFO] 2022-04-28 16:13:33,506 [ args.py: 70]: learning_rate: 1e-05
2022-04-28 16:13:33,506 - INFO - lr_scheduler: linear_warmup_decay
[INFO] 2022-04-28 16:13:33,506 [ args.py: 70]: lr_scheduler: linear_warmup_decay
2022-04-28 16:13:33,506 - INFO - max_answer_length: 100
[INFO] 2022-04-28 16:13:33,506 [ args.py: 70]: max_answer_length: 100
2022-04-28 16:13:33,506 - INFO - max_query_length: 64
[INFO] 2022-04-28 16:13:33,506 [ args.py: 70]: max_query_length: 64
2022-04-28 16:13:33,506 - INFO - max_seq_len: 384
[INFO] 2022-04-28 16:13:33,506 [ args.py: 70]: max_seq_len: 384
2022-04-28 16:13:33,506 - INFO - metric: simple_accuracy
[INFO] 2022-04-28 16:13:33,506 [ args.py: 70]: metric: simple_accuracy
2022-04-28 16:13:33,506 - INFO - metrics: True
[INFO] 2022-04-28 16:13:33,506 [ args.py: 70]: metrics: True
2022-04-28 16:13:33,506 - INFO - n_best_size: 20
[INFO] 2022-04-28 16:13:33,506 [ args.py: 70]: n_best_size: 20
2022-04-28 16:13:33,506 - INFO - num_iteration_per_drop_scope: 1
[INFO] 2022-04-28 16:13:33,506 [ args.py: 70]: num_iteration_per_drop_scope: 1
2022-04-28 16:13:33,506 - INFO - num_labels: 2
[INFO] 2022-04-28 16:13:33,506 [ args.py: 70]: num_labels: 2
2022-04-28 16:13:33,507 - INFO - output_file_name: None
[INFO] 2022-04-28 16:13:33,507 [ args.py: 70]: output_file_name: None
2022-04-28 16:13:33,507 - INFO - output_item: 3
[INFO] 2022-04-28 16:13:33,507 [ args.py: 70]: output_item: 3
2022-04-28 16:13:33,507 - INFO - p_max_seq_len: 256
[INFO] 2022-04-28 16:13:33,507 [ args.py: 70]: p_max_seq_len: 256
2022-04-28 16:13:33,507 - INFO - predict_batch_size: None
[INFO] 2022-04-28 16:13:33,507 [ args.py: 70]: predict_batch_size: None
2022-04-28 16:13:33,507 - INFO - q_max_seq_len: 32
[INFO] 2022-04-28 16:13:33,507 [ args.py: 70]: q_max_seq_len: 32
2022-04-28 16:13:33,507 - INFO - random_seed: 1
[INFO] 2022-04-28 16:13:33,507 [ args.py: 70]: random_seed: 1
2022-04-28 16:13:33,507 - INFO - save_steps: 104247
[INFO] 2022-04-28 16:13:33,507 [ args.py: 70]: save_steps: 104247
2022-04-28 16:13:33,507 - INFO - shuffle: True
[INFO] 2022-04-28 16:13:33,507 [ args.py: 70]: shuffle: True
2022-04-28 16:13:33,507 - INFO - skip_steps: 10
[INFO] 2022-04-28 16:13:33,507 [ args.py: 70]: skip_steps: 10
2022-04-28 16:13:33,507 - INFO - task_id: 0
[INFO] 2022-04-28 16:13:33,507 [ args.py: 70]: task_id: 0
2022-04-28 16:13:33,507 - INFO - test_data_cnt: 1110000
[INFO] 2022-04-28 16:13:33,507 [ args.py: 70]: test_data_cnt: 1110000
2022-04-28 16:13:33,508 - INFO - test_save: ./checkpoints/test_result
[INFO] 2022-04-28 16:13:33,508 [ args.py: 70]: test_save: ./checkpoints/test_result
2022-04-28 16:13:33,508 - INFO - test_set: None
[INFO] 2022-04-28 16:13:33,508 [ args.py: 70]: test_set: None
2022-04-28 16:13:33,508 - INFO - tokenizer: FullTokenizer
[INFO] 2022-04-28 16:13:33,508 [ args.py: 70]: tokenizer: FullTokenizer
2022-04-28 16:13:33,508 - INFO - train_data_size: 1111968
[INFO] 2022-04-28 16:13:33,508 [ args.py: 70]: train_data_size: 1111968
2022-04-28 16:13:33,508 - INFO - train_set: dureader-retrieval-baseline-dataset/train/cross.train.tsv
[INFO] 2022-04-28 16:13:33,508 [ args.py: 70]: train_set: dureader-retrieval-baseline-dataset/train/cross.train.tsv
2022-04-28 16:13:33,508 - INFO - use_cross_batch: False
[INFO] 2022-04-28 16:13:33,508 [ args.py: 70]: use_cross_batch: False
2022-04-28 16:13:33,508 - INFO - use_cuda: True
[INFO] 2022-04-28 16:13:33,508 [ args.py: 70]: use_cuda: True
2022-04-28 16:13:33,508 - INFO - use_dynamic_loss_scaling: True
[INFO] 2022-04-28 16:13:33,508 [ args.py: 70]: use_dynamic_loss_scaling: True
2022-04-28 16:13:33,508 - INFO - use_fast_executor: False
[INFO] 2022-04-28 16:13:33,508 [ args.py: 70]: use_fast_executor: False
2022-04-28 16:13:33,508 - INFO - use_lamb: False
[INFO] 2022-04-28 16:13:33,508 [ args.py: 70]: use_lamb: False
2022-04-28 16:13:33,508 - INFO - use_mix_precision: False
[INFO] 2022-04-28 16:13:33,508 [ args.py: 70]: use_mix_precision: False
2022-04-28 16:13:33,509 - INFO - use_multi_gpu_test: False
[INFO] 2022-04-28 16:13:33,509 [ args.py: 70]: use_multi_gpu_test: False
2022-04-28 16:13:33,509 - INFO - use_recompute: False
[INFO] 2022-04-28 16:13:33,509 [ args.py: 70]: use_recompute: False
2022-04-28 16:13:33,509 - INFO - validation_steps: 104247
[INFO] 2022-04-28 16:13:33,509 [ args.py: 70]: validation_steps: 104247
2022-04-28 16:13:33,509 - INFO - verbose: True
[INFO] 2022-04-28 16:13:33,509 [ args.py: 70]: verbose: True
2022-04-28 16:13:33,509 - INFO - vocab_path: pretrained-models/ernie_base_1.0_CN/vocab.txt
[INFO] 2022-04-28 16:13:33,509 [ args.py: 70]: vocab_path: pretrained-models/ernie_base_1.0_CN/vocab.txt
2022-04-28 16:13:33,509 - INFO - warmup_proportion: 0.0
[INFO] 2022-04-28 16:13:33,509 [ args.py: 70]: warmup_proportion: 0.0
2022-04-28 16:13:33,509 - INFO - weight_decay: 0.01
[INFO] 2022-04-28 16:13:33,509 [ args.py: 70]: weight_decay: 0.01
2022-04-28 16:13:33,509 - INFO - ------------------------------------------------
[INFO] 2022-04-28 16:13:33,509 [ args.py: 71]: ------------------------------------------------
2022-04-28 16:13:33,509 - INFO - attention_probs_dropout_prob: 0.1
[INFO] 2022-04-28 16:13:33,509 [ ernie.py: 51]: attention_probs_dropout_prob: 0.1
2022-04-28 16:13:33,510 - INFO - hidden_act: relu
[INFO] 2022-04-28 16:13:33,510 [ ernie.py: 51]: hidden_act: relu
2022-04-28 16:13:33,510 - INFO - hidden_dropout_prob: 0.1
[INFO] 2022-04-28 16:13:33,510 [ ernie.py: 51]: hidden_dropout_prob: 0.1
2022-04-28 16:13:33,510 - INFO - hidden_size: 768
[INFO] 2022-04-28 16:13:33,510 [ ernie.py: 51]: hidden_size: 768
2022-04-28 16:13:33,510 - INFO - initializer_range: 0.02
[INFO] 2022-04-28 16:13:33,510 [ ernie.py: 51]: initializer_range: 0.02
2022-04-28 16:13:33,510 - INFO - max_position_embeddings: 513
[INFO] 2022-04-28 16:13:33,510 [ ernie.py: 51]: max_position_embeddings: 513
2022-04-28 16:13:33,510 - INFO - num_attention_heads: 12
[INFO] 2022-04-28 16:13:33,510 [ ernie.py: 51]: num_attention_heads: 12
2022-04-28 16:13:33,510 - INFO - num_hidden_layers: 12
[INFO] 2022-04-28 16:13:33,510 [ ernie.py: 51]: num_hidden_layers: 12
2022-04-28 16:13:33,510 - INFO - type_vocab_size: 2
[INFO] 2022-04-28 16:13:33,510 [ ernie.py: 51]: type_vocab_size: 2
2022-04-28 16:13:33,510 - INFO - vocab_size: 18000
[INFO] 2022-04-28 16:13:33,510 [ ernie.py: 51]: vocab_size: 18000
2022-04-28 16:13:33,510 - INFO - ------------------------------------------------
[INFO] 2022-04-28 16:13:33,510 [ ernie.py: 52]: ------------------------------------------------
2022-04-28 16:13:42,372 - INFO - apply sharding 1/2
[INFO] 2022-04-28 16:13:42,372 [reader_ce.py: 251]: apply sharding 1/2
2022-04-28 16:13:42,372 - INFO - Device count: 2
[INFO] 2022-04-28 16:13:42,372 [ train_ce.py: 116]: Device count: 2
2022-04-28 16:13:42,372 - INFO - Num train examples: 1111968
[INFO] 2022-04-28 16:13:42,372 [ train_ce.py: 117]: Num train examples: 1111968
2022-04-28 16:13:42,372 - INFO - Max train steps: 208494
[INFO] 2022-04-28 16:13:42,372 [ train_ce.py: 118]: Max train steps: 208494
2022-04-28 16:13:42,373 - INFO - Num warmup steps: 0
[INFO] 2022-04-28 16:13:42,373 [ train_ce.py: 119]: Num warmup steps: 0
2022-04-28 16:13:42,374 - WARNING - paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead.
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/layers/io.py:721: DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead
'paddle.fluid.layers.py_reader() may be deprecated in the near future. '
[WARNING] 2022-04-28 16:13:42,374 [ io.py: 721]: paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead.
I0428 16:13:42.374289 2391 reader_py.cc:385] init_lod_tensor_blocking_queue
2022-04-28 16:13:46,214 - WARNING - set use_hierarchical_allreduce=False since you only have 1 node.
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /work/src/model/ernie.py:128
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /work/src/model/ernie.py:129
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /work/src/model/transformer_encoder.py:118
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /work/src/model/transformer_encoder.py:207
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/clip.py:631: UserWarning: Caution! 'set_gradient_clip' is not recommended and may be deprecated in future! We recommend a new strategy: set 'grad_clip' when initializing the 'optimizer'. This method can reduce the mistakes, please refer to documention of 'optimizer'.
warnings.warn("Caution! 'set_gradient_clip' is not recommended "
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/incubate/fleet/collective/__init__.py:394: DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead
"set use_hierarchical_allreduce=False since you only have 1 node."
[WARNING] 2022-04-28 16:13:46,214 [ __init__.py: 394]: set use_hierarchical_allreduce=False since you only have 1 node.
2022-04-28 16:13:46,509 - INFO - Theoretical memory usage in training: 31304.218 - 32794.895 MB
API is deprecated since 2.0.0 Please use FleetAPI instead.
WIKI: https://github.com/PaddlePaddle/Fleet/blob/develop/markdown_doc/transpiler
[INFO] 2022-04-28 16:13:46,509 [ train_ce.py: 174]: Theoretical memory usage in training: 31304.218 - 32794.895 MB
W0428 16:13:46.585139 2391 device_context.cc:362] Please NOTE: device: 1, GPU Compute Capability: 7.5, Driver API Version: 10.1, Runtime API Version: 10.1
W0428 16:13:46.589557 2391 device_context.cc:372] device: 1, cuDNN Version: 7.6.
I0428 16:13:51.732564 2391 gen_nccl_id_op.cc:92] trainer_id:1, use_hierarchical_allreduce:0, nccl_comm_num:1, inter_nranks:0, inter_trainer_id:-1, exter_trainer_id:-1, trainers:127.0.0.1:54640,127.0.0.1:43155,
I0428 16:13:51.732654 2391 gen_nccl_id_op_helper.cc:176] Server listening on: 127.0.0.1:43155 successful.
2022-04-28 16:13:58,030 - INFO - Load pretraining parameters from pretrained-models/ernie_base_1.0_CN/params.
[INFO] 2022-04-28 16:13:58,030 [ init.py: 74]: Load pretraining parameters from pretrained-models/ernie_base_1.0_CN/params.
I0428 16:13:58.467259 2391 parallel_executor.cc:662] The Program will be executed on CUDA using ParallelExecutor, 1 cards are used, so 1 programs are executed in parallel.
I0428 16:13:58.467324 2391 parallel_executor.cc:270] not find NCCLCommunicator in scope, so recreate it!
I0428 16:13:58.467339 2391 parallel_executor.cc:137] nccl comm num:1, nranks:2, num_trainers:2, trainer_id:1
I0428 16:13:58.473305 2391 nccl_helper.h:133] init nccl rank:1, nranks:2, gpu_id:1, dev_id:1
terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
what(): (External) Nccl error, unhandled cuda error, detail: Resource temporarily unavailable
Please try one of the following solutions:
1. export NCCL_SHM_DISABLE=1;
2. export NCCL_P2P_LEVEL=SYS;
3. Increase shared memory by setting the -shm-size option when starting docker container, e.g., setting -shm-size=2g.
(at /paddle/paddle/fluid/platform/nccl_helper.h:72)
--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0 paddle::framework::ParallelExecutor::ParallelExecutor(std::vector<paddle::platform::Place, std::allocator<paddle::platform::Place> > const&, std::vector<std::string, std::allocator<std::string > > const&, std::string const&, paddle::framework::Scope*, std::vector<paddle::framework::Scope*, std::allocator<paddle::framework::Scope*> > const&, paddle::framework::details::ExecutionStrategy const&, paddle::framework::details::BuildStrategy const&, paddle::framework::ir::Graph*)
1 paddle::framework::ParallelExecutorPrivate::InitOrGetNCCLCommunicator(paddle::framework::Scope*, paddle::framework::details::BuildStrategy*)
2 paddle::framework::ParallelExecutorPrivate::InitNCCLCtxs(paddle::framework::Scope*, paddle::framework::details::BuildStrategy const&)
3 paddle::platform::NCCLCommunicator::InitFlatCtxs(std::vector<paddle::platform::Place, std::allocator<paddle::platform::Place> > const&, std::vector<ncclUniqueId*, std::allocator<ncclUniqueId*> > const&, unsigned long, unsigned long)
4 paddle::platform::NCCLContextMap::NCCLContextMap(std::vector<paddle::platform::Place, std::allocator<paddle::platform::Place> > const&, ncclUniqueId*, unsigned long, unsigned long)
5 paddle::framework::SignalHandle(char const*, int)
6 paddle::platform::GetCurrentTraceBackString[abi:cxx11]()
----------------------
Error Message Summary:
----------------------
FatalError: `Process abort signal` is detected by the operating system.
[TimeInfo: *** Aborted at 1651162438 (unix time) try "date -d @1651162438" if you are using GNU date ***]
[SignalInfo: *** SIGABRT (@0x957) received by PID 2391 (TID 0x7ff2c863f700) from PID 2391 ***]
workerlog.0日志
----------------------
Error Message Summary:
----------------------
FatalError: `Process abort signal` is detected by the operating system.
[TimeInfo: *** Aborted at 1651162139 (unix time) try "date -d @1651162139" if you are using GNU date ***]
[SignalInfo: *** SIGABRT (@0xbd83) received by PID 48515 (TID 0x7f3976375700) from PID 48515 ***]
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0428 16:13:33.089520 2386 init.cc:85] Before Parse: argc is 2, Init commandline: dummy --tryfromenv=check_nan_inf,fast_check_nan_inf,benchmark,eager_delete_scope,fraction_of_cpu_memory_to_use,initial_cpu_memory_in_mb,init_allocated_mem,paddle_num_threads,dist_threadpool_size,eager_delete_tensor_gb,fast_eager_deletion_mode,memory_fraction_of_eager_deletion,allocator_strategy,reader_queue_speed_test_mode,print_sub_graph_dir,pe_profile_fname,inner_op_parallelism,enable_parallel_graph,fuse_parameter_groups_size,multiple_of_cupti_buffer_size,fuse_parameter_memory_size,tracer_profile_fname,dygraph_debug,use_system_allocator,enable_unused_var_check,free_idle_chunk,free_when_no_cache_hit,call_stack_level,sort_sum_gradient,max_inplace_grad_add,use_pinned_memory,cpu_deterministic,use_mkldnn,tracer_mkldnn_ops_on,tracer_mkldnn_ops_off,fraction_of_gpu_memory_to_use,initial_gpu_memory_in_mb,reallocate_gpu_memory_in_mb,cudnn_deterministic,enable_cublas_tensor_op_math,conv_workspace_size_limit,cudnn_exhaustive_search,selected_gpus,sync_nccl_allreduce,cudnn_batchnorm_spatial_persistent,gpu_allocator_retry_time,local_exe_sub_scope_limit,gpu_memory_limit_mb
I0428 16:13:33.089696 2386 init.cc:92] After Parse: argc is 1
2022-04-28 16:13:33,511 - INFO - ----------- Configuration Arguments -----------
[INFO] 2022-04-28 16:13:33,511 [ args.py: 68]: ----------- Configuration Arguments -----------
2022-04-28 16:13:33,511 - INFO - batch_size: 8
[INFO] 2022-04-28 16:13:33,511 [ args.py: 70]: batch_size: 8
2022-04-28 16:13:33,511 - INFO - checkpoints: output
[INFO] 2022-04-28 16:13:33,511 [ args.py: 70]: checkpoints: output
2022-04-28 16:13:33,511 - INFO - chunk_scheme: IOB
[INFO] 2022-04-28 16:13:33,511 [ args.py: 70]: chunk_scheme: IOB
2022-04-28 16:13:33,511 - INFO - decr_every_n_nan_or_inf: 2
[INFO] 2022-04-28 16:13:33,511 [ args.py: 70]: decr_every_n_nan_or_inf: 2
2022-04-28 16:13:33,512 - INFO - decr_ratio: 0.8
[INFO] 2022-04-28 16:13:33,512 [ args.py: 70]: decr_ratio: 0.8
2022-04-28 16:13:33,512 - INFO - dev_set: None
[INFO] 2022-04-28 16:13:33,512 [ args.py: 70]: dev_set: None
2022-04-28 16:13:33,512 - INFO - diagnostic: None
[INFO] 2022-04-28 16:13:33,512 [ args.py: 70]: diagnostic: None
2022-04-28 16:13:33,512 - INFO - diagnostic_save: None
[INFO] 2022-04-28 16:13:33,512 [ args.py: 70]: diagnostic_save: None
2022-04-28 16:13:33,512 - INFO - do_lower_case: True
[INFO] 2022-04-28 16:13:33,512 [ args.py: 70]: do_lower_case: True
2022-04-28 16:13:33,512 - INFO - do_test: False
[INFO] 2022-04-28 16:13:33,512 [ args.py: 70]: do_test: False
2022-04-28 16:13:33,512 - INFO - do_train: True
[INFO] 2022-04-28 16:13:33,512 [ args.py: 70]: do_train: True
2022-04-28 16:13:33,512 - INFO - do_val: False
[INFO] 2022-04-28 16:13:33,512 [ args.py: 70]: do_val: False
2022-04-28 16:13:33,512 - INFO - doc_stride: 128
[INFO] 2022-04-28 16:13:33,512 [ args.py: 70]: doc_stride: 128
2022-04-28 16:13:33,512 - INFO - epoch: 3
[INFO] 2022-04-28 16:13:33,512 [ args.py: 70]: epoch: 3
2022-04-28 16:13:33,512 - INFO - ernie_config_path: pretrained-models/ernie_base_1.0_CN/ernie_config.json
[INFO] 2022-04-28 16:13:33,512 [ args.py: 70]: ernie_config_path: pretrained-models/ernie_base_1.0_CN/ernie_config.json
2022-04-28 16:13:33,513 - INFO - for_cn: True
[INFO] 2022-04-28 16:13:33,513 [ args.py: 70]: for_cn: True
2022-04-28 16:13:33,513 - INFO - in_tokens: False
[INFO] 2022-04-28 16:13:33,513 [ args.py: 70]: in_tokens: False
2022-04-28 16:13:33,513 - INFO - incr_every_n_steps: 100
[INFO] 2022-04-28 16:13:33,513 [ args.py: 70]: incr_every_n_steps: 100
2022-04-28 16:13:33,513 - INFO - incr_ratio: 2.0
[INFO] 2022-04-28 16:13:33,513 [ args.py: 70]: incr_ratio: 2.0
2022-04-28 16:13:33,513 - INFO - init_checkpoint: None
[INFO] 2022-04-28 16:13:33,513 [ args.py: 70]: init_checkpoint: None
2022-04-28 16:13:33,513 - INFO - init_pretraining_params: pretrained-models/ernie_base_1.0_CN/params
[INFO] 2022-04-28 16:13:33,513 [ args.py: 70]: init_pretraining_params: pretrained-models/ernie_base_1.0_CN/params
2022-04-28 16:13:33,513 - INFO - is_classify: True
[INFO] 2022-04-28 16:13:33,513 [ args.py: 70]: is_classify: True
2022-04-28 16:13:33,513 - INFO - is_distributed: False
[INFO] 2022-04-28 16:13:33,513 [ args.py: 70]: is_distributed: False
2022-04-28 16:13:33,513 - INFO - is_regression: False
[INFO] 2022-04-28 16:13:33,513 [ args.py: 70]: is_regression: False
2022-04-28 16:13:33,513 - INFO - label_map_config: None
[INFO] 2022-04-28 16:13:33,513 [ args.py: 70]: label_map_config: None
2022-04-28 16:13:33,513 - INFO - learning_rate: 1e-05
[INFO] 2022-04-28 16:13:33,513 [ args.py: 70]: learning_rate: 1e-05
2022-04-28 16:13:33,514 - INFO - lr_scheduler: linear_warmup_decay
[INFO] 2022-04-28 16:13:33,514 [ args.py: 70]: lr_scheduler: linear_warmup_decay
2022-04-28 16:13:33,514 - INFO - max_answer_length: 100
[INFO] 2022-04-28 16:13:33,514 [ args.py: 70]: max_answer_length: 100
2022-04-28 16:13:33,514 - INFO - max_query_length: 64
[INFO] 2022-04-28 16:13:33,514 [ args.py: 70]: max_query_length: 64
2022-04-28 16:13:33,514 - INFO - max_seq_len: 384
[INFO] 2022-04-28 16:13:33,514 [ args.py: 70]: max_seq_len: 384
2022-04-28 16:13:33,514 - INFO - metric: simple_accuracy
[INFO] 2022-04-28 16:13:33,514 [ args.py: 70]: metric: simple_accuracy
2022-04-28 16:13:33,514 - INFO - metrics: True
[INFO] 2022-04-28 16:13:33,514 [ args.py: 70]: metrics: True
2022-04-28 16:13:33,514 - INFO - n_best_size: 20
[INFO] 2022-04-28 16:13:33,514 [ args.py: 70]: n_best_size: 20
2022-04-28 16:13:33,514 - INFO - num_iteration_per_drop_scope: 1
[INFO] 2022-04-28 16:13:33,514 [ args.py: 70]: num_iteration_per_drop_scope: 1
2022-04-28 16:13:33,514 - INFO - num_labels: 2
[INFO] 2022-04-28 16:13:33,514 [ args.py: 70]: num_labels: 2
2022-04-28 16:13:33,514 - INFO - output_file_name: None
[INFO] 2022-04-28 16:13:33,514 [ args.py: 70]: output_file_name: None
2022-04-28 16:13:33,514 - INFO - output_item: 3
[INFO] 2022-04-28 16:13:33,514 [ args.py: 70]: output_item: 3
2022-04-28 16:13:33,515 - INFO - p_max_seq_len: 256
[INFO] 2022-04-28 16:13:33,515 [ args.py: 70]: p_max_seq_len: 256
2022-04-28 16:13:33,515 - INFO - predict_batch_size: None
[INFO] 2022-04-28 16:13:33,515 [ args.py: 70]: predict_batch_size: None
2022-04-28 16:13:33,515 - INFO - q_max_seq_len: 32
[INFO] 2022-04-28 16:13:33,515 [ args.py: 70]: q_max_seq_len: 32
2022-04-28 16:13:33,515 - INFO - random_seed: 1
[INFO] 2022-04-28 16:13:33,515 [ args.py: 70]: random_seed: 1
2022-04-28 16:13:33,515 - INFO - save_steps: 104247
[INFO] 2022-04-28 16:13:33,515 [ args.py: 70]: save_steps: 104247
2022-04-28 16:13:33,515 - INFO - shuffle: True
[INFO] 2022-04-28 16:13:33,515 [ args.py: 70]: shuffle: True
2022-04-28 16:13:33,515 - INFO - skip_steps: 10
[INFO] 2022-04-28 16:13:33,515 [ args.py: 70]: skip_steps: 10
2022-04-28 16:13:33,515 - INFO - task_id: 0
[INFO] 2022-04-28 16:13:33,515 [ args.py: 70]: task_id: 0
2022-04-28 16:13:33,515 - INFO - test_data_cnt: 1110000
[INFO] 2022-04-28 16:13:33,515 [ args.py: 70]: test_data_cnt: 1110000
2022-04-28 16:13:33,515 - INFO - test_save: ./checkpoints/test_result
[INFO] 2022-04-28 16:13:33,515 [ args.py: 70]: test_save: ./checkpoints/test_result
2022-04-28 16:13:33,515 - INFO - test_set: None
[INFO] 2022-04-28 16:13:33,515 [ args.py: 70]: test_set: None
2022-04-28 16:13:33,516 - INFO - tokenizer: FullTokenizer
[INFO] 2022-04-28 16:13:33,516 [ args.py: 70]: tokenizer: FullTokenizer
2022-04-28 16:13:33,516 - INFO - train_data_size: 1111968
[INFO] 2022-04-28 16:13:33,516 [ args.py: 70]: train_data_size: 1111968
2022-04-28 16:13:33,516 - INFO - train_set: dureader-retrieval-baseline-dataset/train/cross.train.tsv
[INFO] 2022-04-28 16:13:33,516 [ args.py: 70]: train_set: dureader-retrieval-baseline-dataset/train/cross.train.tsv
2022-04-28 16:13:33,516 - INFO - use_cross_batch: False
[INFO] 2022-04-28 16:13:33,516 [ args.py: 70]: use_cross_batch: False
2022-04-28 16:13:33,516 - INFO - use_cuda: True
[INFO] 2022-04-28 16:13:33,516 [ args.py: 70]: use_cuda: True
2022-04-28 16:13:33,516 - INFO - use_dynamic_loss_scaling: True
[INFO] 2022-04-28 16:13:33,516 [ args.py: 70]: use_dynamic_loss_scaling: True
2022-04-28 16:13:33,516 - INFO - use_fast_executor: False
[INFO] 2022-04-28 16:13:33,516 [ args.py: 70]: use_fast_executor: False
2022-04-28 16:13:33,516 - INFO - use_lamb: False
[INFO] 2022-04-28 16:13:33,516 [ args.py: 70]: use_lamb: False
2022-04-28 16:13:33,516 - INFO - use_mix_precision: False
[INFO] 2022-04-28 16:13:33,516 [ args.py: 70]: use_mix_precision: False
2022-04-28 16:13:33,516 - INFO - use_multi_gpu_test: False
[INFO] 2022-04-28 16:13:33,516 [ args.py: 70]: use_multi_gpu_test: False
2022-04-28 16:13:33,517 - INFO - use_recompute: False
[INFO] 2022-04-28 16:13:33,517 [ args.py: 70]: use_recompute: False
2022-04-28 16:13:33,517 - INFO - validation_steps: 104247
[INFO] 2022-04-28 16:13:33,517 [ args.py: 70]: validation_steps: 104247
2022-04-28 16:13:33,517 - INFO - verbose: True
[INFO] 2022-04-28 16:13:33,517 [ args.py: 70]: verbose: True
2022-04-28 16:13:33,517 - INFO - vocab_path: pretrained-models/ernie_base_1.0_CN/vocab.txt
[INFO] 2022-04-28 16:13:33,517 [ args.py: 70]: vocab_path: pretrained-models/ernie_base_1.0_CN/vocab.txt
2022-04-28 16:13:33,517 - INFO - warmup_proportion: 0.0
[INFO] 2022-04-28 16:13:33,517 [ args.py: 70]: warmup_proportion: 0.0
2022-04-28 16:13:33,517 - INFO - weight_decay: 0.01
[INFO] 2022-04-28 16:13:33,517 [ args.py: 70]: weight_decay: 0.01
2022-04-28 16:13:33,517 - INFO - ------------------------------------------------
[INFO] 2022-04-28 16:13:33,517 [ args.py: 71]: ------------------------------------------------
2022-04-28 16:13:33,517 - INFO - attention_probs_dropout_prob: 0.1
[INFO] 2022-04-28 16:13:33,517 [ ernie.py: 51]: attention_probs_dropout_prob: 0.1
2022-04-28 16:13:33,517 - INFO - hidden_act: relu
[INFO] 2022-04-28 16:13:33,517 [ ernie.py: 51]: hidden_act: relu
2022-04-28 16:13:33,518 - INFO - hidden_dropout_prob: 0.1
[INFO] 2022-04-28 16:13:33,518 [ ernie.py: 51]: hidden_dropout_prob: 0.1
2022-04-28 16:13:33,518 - INFO - hidden_size: 768
[INFO] 2022-04-28 16:13:33,518 [ ernie.py: 51]: hidden_size: 768
2022-04-28 16:13:33,518 - INFO - initializer_range: 0.02
[INFO] 2022-04-28 16:13:33,518 [ ernie.py: 51]: initializer_range: 0.02
2022-04-28 16:13:33,518 - INFO - max_position_embeddings: 513
[INFO] 2022-04-28 16:13:33,518 [ ernie.py: 51]: max_position_embeddings: 513
2022-04-28 16:13:33,518 - INFO - num_attention_heads: 12
[INFO] 2022-04-28 16:13:33,518 [ ernie.py: 51]: num_attention_heads: 12
2022-04-28 16:13:33,518 - INFO - num_hidden_layers: 12
[INFO] 2022-04-28 16:13:33,518 [ ernie.py: 51]: num_hidden_layers: 12
2022-04-28 16:13:33,518 - INFO - type_vocab_size: 2
[INFO] 2022-04-28 16:13:33,518 [ ernie.py: 51]: type_vocab_size: 2
2022-04-28 16:13:33,518 - INFO - vocab_size: 18000
[INFO] 2022-04-28 16:13:33,518 [ ernie.py: 51]: vocab_size: 18000
2022-04-28 16:13:33,518 - INFO - ------------------------------------------------
[INFO] 2022-04-28 16:13:33,518 [ ernie.py: 52]: ------------------------------------------------
2022-04-28 16:13:42,480 - INFO - apply sharding 0/2
[INFO] 2022-04-28 16:13:42,480 [reader_ce.py: 251]: apply sharding 0/2
2022-04-28 16:13:42,480 - INFO - Device count: 2
[INFO] 2022-04-28 16:13:42,480 [ train_ce.py: 116]: Device count: 2
2022-04-28 16:13:42,480 - INFO - Num train examples: 1111968
[INFO] 2022-04-28 16:13:42,480 [ train_ce.py: 117]: Num train examples: 1111968
2022-04-28 16:13:42,481 - INFO - Max train steps: 208494
[INFO] 2022-04-28 16:13:42,481 [ train_ce.py: 118]: Max train steps: 208494
2022-04-28 16:13:42,481 - INFO - Num warmup steps: 0
[INFO] 2022-04-28 16:13:42,481 [ train_ce.py: 119]: Num warmup steps: 0
2022-04-28 16:13:42,481 - WARNING - paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead.
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/layers/io.py:721: DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead
'paddle.fluid.layers.py_reader() may be deprecated in the near future. '
[WARNING] 2022-04-28 16:13:42,481 [ io.py: 721]: paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead.
I0428 16:13:42.482144 2386 reader_py.cc:385] init_lod_tensor_blocking_queue
2022-04-28 16:13:46,320 - WARNING - set use_hierarchical_allreduce=False since you only have 1 node.
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /work/src/model/ernie.py:128
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /work/src/model/ernie.py:129
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /work/src/model/transformer_encoder.py:118
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /work/src/model/transformer_encoder.py:207
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/clip.py:631: UserWarning: Caution! 'set_gradient_clip' is not recommended and may be deprecated in future! We recommend a new strategy: set 'grad_clip' when initializing the 'optimizer'. This method can reduce the mistakes, please refer to documention of 'optimizer'.
warnings.warn("Caution! 'set_gradient_clip' is not recommended "
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/incubate/fleet/collective/__init__.py:394: DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead
"set use_hierarchical_allreduce=False since you only have 1 node."
[WARNING] 2022-04-28 16:13:46,320 [ __init__.py: 394]: set use_hierarchical_allreduce=False since you only have 1 node.
API is deprecated since 2.0.0 Please use FleetAPI instead.
WIKI: https://github.com/PaddlePaddle/Fleet/blob/develop/markdown_doc/transpiler
server not ready, wait 3 sec to retry...
not ready endpoints:['127.0.0.1:43155']
server not ready, wait 3 sec to retry...
not ready endpoints:['127.0.0.1:43155']
2022-04-28 16:13:52,735 - INFO - Theoretical memory usage in training: 31304.218 - 32794.895 MB
[INFO] 2022-04-28 16:13:52,735 [ train_ce.py: 174]: Theoretical memory usage in training: 31304.218 - 32794.895 MB
W0428 16:13:52.836230 2386 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 10.1, Runtime API Version: 10.1
W0428 16:13:52.841756 2386 device_context.cc:372] device: 0, cuDNN Version: 7.6.
I0428 16:13:57.318223 2386 gen_nccl_id_op.cc:92] trainer_id:0, use_hierarchical_allreduce:0, nccl_comm_num:1, inter_nranks:0, inter_trainer_id:-1, exter_trainer_id:-1, trainers:127.0.0.1:54640,127.0.0.1:43155,
2022-04-28 16:13:58,097 - INFO - Load pretraining parameters from pretrained-models/ernie_base_1.0_CN/params.
[INFO] 2022-04-28 16:13:58,097 [ init.py: 74]: Load pretraining parameters from pretrained-models/ernie_base_1.0_CN/params.
I0428 16:13:58.409628 2386 parallel_executor.cc:662] The Program will be executed on CUDA using ParallelExecutor, 1 cards are used, so 1 programs are executed in parallel.
I0428 16:13:58.409691 2386 parallel_executor.cc:270] not find NCCLCommunicator in scope, so recreate it!
I0428 16:13:58.409705 2386 parallel_executor.cc:137] nccl comm num:1, nranks:2, num_trainers:2, trainer_id:0
I0428 16:13:58.414196 2386 nccl_helper.h:133] init nccl rank:0, nranks:2, gpu_id:0, dev_id:0
terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
what(): (External) Nccl error, unhandled cuda error, detail: Resource temporarily unavailable
Please try one of the following solutions:
1. export NCCL_SHM_DISABLE=1;
2. export NCCL_P2P_LEVEL=SYS;
3. Increase shared memory by setting the -shm-size option when starting docker container, e.g., setting -shm-size=2g.
(at /paddle/paddle/fluid/platform/nccl_helper.h:72)
--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0 paddle::framework::ParallelExecutor::ParallelExecutor(std::vector<paddle::platform::Place, std::allocator<paddle::platform::Place> > const&, std::vector<std::string, std::allocator<std::string > > const&, std::string const&, paddle::framework::Scope*, std::vector<paddle::framework::Scope*, std::allocator<paddle::framework::Scope*> > const&, paddle::framework::details::ExecutionStrategy const&, paddle::framework::details::BuildStrategy const&, paddle::framework::ir::Graph*)
1 paddle::framework::ParallelExecutorPrivate::InitOrGetNCCLCommunicator(paddle::framework::Scope*, paddle::framework::details::BuildStrategy*)
2 paddle::framework::ParallelExecutorPrivate::InitNCCLCtxs(paddle::framework::Scope*, paddle::framework::details::BuildStrategy const&)
3 paddle::platform::NCCLCommunicator::InitFlatCtxs(std::vector<paddle::platform::Place, std::allocator<paddle::platform::Place> > const&, std::vector<ncclUniqueId*, std::allocator<ncclUniqueId*> > const&, unsigned long, unsigned long)
4 paddle::platform::NCCLContextMap::NCCLContextMap(std::vector<paddle::platform::Place, std::allocator<paddle::platform::Place> > const&, ncclUniqueId*, unsigned long, unsigned long)
5 paddle::framework::SignalHandle(char const*, int)
6 paddle::platform::GetCurrentTraceBackString[abi:cxx11]()
----------------------
Error Message Summary:
----------------------
FatalError: `Process abort signal` is detected by the operating system.
[TimeInfo: *** Aborted at 1651162438 (unix time) try "date -d @1651162438" if you are using GNU date ***]
[SignalInfo: *** SIGABRT (@0x952) received by PID 2386 (TID 0x7f5594df8700) from PID 2386 ***]
请问尝试过这三种解决方案吗
- export NCCL_SHM_DISABLE=1;
- export NCCL_P2P_LEVEL=SYS;
- Increase shared memory by setting the -shm-size option when starting docker container, e.g., setting -shm-size=2g. (at /paddle/paddle/fluid/platform/nccl_helper.h:72)