如何在RTX系列GPU上进行单机多卡运行

Open yysirs opened this issue 3 years ago • 3 comments

安装了nccl，

https://github.com/PaddlePaddle/Paddle/issues/28757 https://github.com/PaddlePaddle/Paddle/issues/29172 https://github.com/PaddlePaddle/Paddle/issues/36608 也尝试上述方法，似乎都不太行

Apr 28 '22 16:04 yysirs

workerlog.1日志：

----------------------
Error Message Summary:
----------------------
FatalError: `Process abort signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1651162139 (unix time) try "date -d @1651162139" if you are using GNU date ***]
  [SignalInfo: *** SIGABRT (@0xbd88) received by PID 48520 (TID 0x7f30cdeaf700) from PID 48520 ***]

grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0428 16:13:33.103967  2391 init.cc:85] Before Parse: argc is 2, Init commandline: dummy --tryfromenv=check_nan_inf,fast_check_nan_inf,benchmark,eager_delete_scope,fraction_of_cpu_memory_to_use,initial_cpu_memory_in_mb,init_allocated_mem,paddle_num_threads,dist_threadpool_size,eager_delete_tensor_gb,fast_eager_deletion_mode,memory_fraction_of_eager_deletion,allocator_strategy,reader_queue_speed_test_mode,print_sub_graph_dir,pe_profile_fname,inner_op_parallelism,enable_parallel_graph,fuse_parameter_groups_size,multiple_of_cupti_buffer_size,fuse_parameter_memory_size,tracer_profile_fname,dygraph_debug,use_system_allocator,enable_unused_var_check,free_idle_chunk,free_when_no_cache_hit,call_stack_level,sort_sum_gradient,max_inplace_grad_add,use_pinned_memory,cpu_deterministic,use_mkldnn,tracer_mkldnn_ops_on,tracer_mkldnn_ops_off,fraction_of_gpu_memory_to_use,initial_gpu_memory_in_mb,reallocate_gpu_memory_in_mb,cudnn_deterministic,enable_cublas_tensor_op_math,conv_workspace_size_limit,cudnn_exhaustive_search,selected_gpus,sync_nccl_allreduce,cudnn_batchnorm_spatial_persistent,gpu_allocator_retry_time,local_exe_sub_scope_limit,gpu_memory_limit_mb 
I0428 16:13:33.104110  2391 init.cc:92] After Parse: argc is 1
2022-04-28 16:13:33,503 - INFO - -----------  Configuration Arguments -----------
[INFO] 2022-04-28 16:13:33,503 [     args.py:   68]:	-----------  Configuration Arguments -----------
2022-04-28 16:13:33,503 - INFO - batch_size: 8
[INFO] 2022-04-28 16:13:33,503 [     args.py:   70]:	batch_size: 8
2022-04-28 16:13:33,503 - INFO - checkpoints: output
[INFO] 2022-04-28 16:13:33,503 [     args.py:   70]:	checkpoints: output
2022-04-28 16:13:33,504 - INFO - chunk_scheme: IOB
[INFO] 2022-04-28 16:13:33,504 [     args.py:   70]:	chunk_scheme: IOB
2022-04-28 16:13:33,504 - INFO - decr_every_n_nan_or_inf: 2
[INFO] 2022-04-28 16:13:33,504 [     args.py:   70]:	decr_every_n_nan_or_inf: 2
2022-04-28 16:13:33,504 - INFO - decr_ratio: 0.8
[INFO] 2022-04-28 16:13:33,504 [     args.py:   70]:	decr_ratio: 0.8
2022-04-28 16:13:33,504 - INFO - dev_set: None
[INFO] 2022-04-28 16:13:33,504 [     args.py:   70]:	dev_set: None
2022-04-28 16:13:33,504 - INFO - diagnostic: None
[INFO] 2022-04-28 16:13:33,504 [     args.py:   70]:	diagnostic: None
2022-04-28 16:13:33,504 - INFO - diagnostic_save: None
[INFO] 2022-04-28 16:13:33,504 [     args.py:   70]:	diagnostic_save: None
2022-04-28 16:13:33,504 - INFO - do_lower_case: True
[INFO] 2022-04-28 16:13:33,504 [     args.py:   70]:	do_lower_case: True
2022-04-28 16:13:33,504 - INFO - do_test: False
[INFO] 2022-04-28 16:13:33,504 [     args.py:   70]:	do_test: False
2022-04-28 16:13:33,504 - INFO - do_train: True
[INFO] 2022-04-28 16:13:33,504 [     args.py:   70]:	do_train: True
2022-04-28 16:13:33,504 - INFO - do_val: False
[INFO] 2022-04-28 16:13:33,504 [     args.py:   70]:	do_val: False
2022-04-28 16:13:33,504 - INFO - doc_stride: 128
[INFO] 2022-04-28 16:13:33,504 [     args.py:   70]:	doc_stride: 128
2022-04-28 16:13:33,505 - INFO - epoch: 3
[INFO] 2022-04-28 16:13:33,505 [     args.py:   70]:	epoch: 3
2022-04-28 16:13:33,505 - INFO - ernie_config_path: pretrained-models/ernie_base_1.0_CN/ernie_config.json
[INFO] 2022-04-28 16:13:33,505 [     args.py:   70]:	ernie_config_path: pretrained-models/ernie_base_1.0_CN/ernie_config.json
2022-04-28 16:13:33,505 - INFO - for_cn: True
[INFO] 2022-04-28 16:13:33,505 [     args.py:   70]:	for_cn: True
2022-04-28 16:13:33,505 - INFO - in_tokens: False
[INFO] 2022-04-28 16:13:33,505 [     args.py:   70]:	in_tokens: False
2022-04-28 16:13:33,505 - INFO - incr_every_n_steps: 100
[INFO] 2022-04-28 16:13:33,505 [     args.py:   70]:	incr_every_n_steps: 100
2022-04-28 16:13:33,505 - INFO - incr_ratio: 2.0
[INFO] 2022-04-28 16:13:33,505 [     args.py:   70]:	incr_ratio: 2.0
2022-04-28 16:13:33,505 - INFO - init_checkpoint: None
[INFO] 2022-04-28 16:13:33,505 [     args.py:   70]:	init_checkpoint: None
2022-04-28 16:13:33,505 - INFO - init_pretraining_params: pretrained-models/ernie_base_1.0_CN/params
[INFO] 2022-04-28 16:13:33,505 [     args.py:   70]:	init_pretraining_params: pretrained-models/ernie_base_1.0_CN/params
2022-04-28 16:13:33,505 - INFO - is_classify: True
[INFO] 2022-04-28 16:13:33,505 [     args.py:   70]:	is_classify: True
2022-04-28 16:13:33,505 - INFO - is_distributed: False
[INFO] 2022-04-28 16:13:33,505 [     args.py:   70]:	is_distributed: False
2022-04-28 16:13:33,505 - INFO - is_regression: False
[INFO] 2022-04-28 16:13:33,505 [     args.py:   70]:	is_regression: False
2022-04-28 16:13:33,506 - INFO - label_map_config: None
[INFO] 2022-04-28 16:13:33,506 [     args.py:   70]:	label_map_config: None
2022-04-28 16:13:33,506 - INFO - learning_rate: 1e-05
[INFO] 2022-04-28 16:13:33,506 [     args.py:   70]:	learning_rate: 1e-05
2022-04-28 16:13:33,506 - INFO - lr_scheduler: linear_warmup_decay
[INFO] 2022-04-28 16:13:33,506 [     args.py:   70]:	lr_scheduler: linear_warmup_decay
2022-04-28 16:13:33,506 - INFO - max_answer_length: 100
[INFO] 2022-04-28 16:13:33,506 [     args.py:   70]:	max_answer_length: 100
2022-04-28 16:13:33,506 - INFO - max_query_length: 64
[INFO] 2022-04-28 16:13:33,506 [     args.py:   70]:	max_query_length: 64
2022-04-28 16:13:33,506 - INFO - max_seq_len: 384
[INFO] 2022-04-28 16:13:33,506 [     args.py:   70]:	max_seq_len: 384
2022-04-28 16:13:33,506 - INFO - metric: simple_accuracy
[INFO] 2022-04-28 16:13:33,506 [     args.py:   70]:	metric: simple_accuracy
2022-04-28 16:13:33,506 - INFO - metrics: True
[INFO] 2022-04-28 16:13:33,506 [     args.py:   70]:	metrics: True
2022-04-28 16:13:33,506 - INFO - n_best_size: 20
[INFO] 2022-04-28 16:13:33,506 [     args.py:   70]:	n_best_size: 20
2022-04-28 16:13:33,506 - INFO - num_iteration_per_drop_scope: 1
[INFO] 2022-04-28 16:13:33,506 [     args.py:   70]:	num_iteration_per_drop_scope: 1
2022-04-28 16:13:33,506 - INFO - num_labels: 2
[INFO] 2022-04-28 16:13:33,506 [     args.py:   70]:	num_labels: 2
2022-04-28 16:13:33,507 - INFO - output_file_name: None
[INFO] 2022-04-28 16:13:33,507 [     args.py:   70]:	output_file_name: None
2022-04-28 16:13:33,507 - INFO - output_item: 3
[INFO] 2022-04-28 16:13:33,507 [     args.py:   70]:	output_item: 3
2022-04-28 16:13:33,507 - INFO - p_max_seq_len: 256
[INFO] 2022-04-28 16:13:33,507 [     args.py:   70]:	p_max_seq_len: 256
2022-04-28 16:13:33,507 - INFO - predict_batch_size: None
[INFO] 2022-04-28 16:13:33,507 [     args.py:   70]:	predict_batch_size: None
2022-04-28 16:13:33,507 - INFO - q_max_seq_len: 32
[INFO] 2022-04-28 16:13:33,507 [     args.py:   70]:	q_max_seq_len: 32
2022-04-28 16:13:33,507 - INFO - random_seed: 1
[INFO] 2022-04-28 16:13:33,507 [     args.py:   70]:	random_seed: 1
2022-04-28 16:13:33,507 - INFO - save_steps: 104247
[INFO] 2022-04-28 16:13:33,507 [     args.py:   70]:	save_steps: 104247
2022-04-28 16:13:33,507 - INFO - shuffle: True
[INFO] 2022-04-28 16:13:33,507 [     args.py:   70]:	shuffle: True
2022-04-28 16:13:33,507 - INFO - skip_steps: 10
[INFO] 2022-04-28 16:13:33,507 [     args.py:   70]:	skip_steps: 10
2022-04-28 16:13:33,507 - INFO - task_id: 0
[INFO] 2022-04-28 16:13:33,507 [     args.py:   70]:	task_id: 0
2022-04-28 16:13:33,507 - INFO - test_data_cnt: 1110000
[INFO] 2022-04-28 16:13:33,507 [     args.py:   70]:	test_data_cnt: 1110000
2022-04-28 16:13:33,508 - INFO - test_save: ./checkpoints/test_result
[INFO] 2022-04-28 16:13:33,508 [     args.py:   70]:	test_save: ./checkpoints/test_result
2022-04-28 16:13:33,508 - INFO - test_set: None
[INFO] 2022-04-28 16:13:33,508 [     args.py:   70]:	test_set: None
2022-04-28 16:13:33,508 - INFO - tokenizer: FullTokenizer
[INFO] 2022-04-28 16:13:33,508 [     args.py:   70]:	tokenizer: FullTokenizer
2022-04-28 16:13:33,508 - INFO - train_data_size: 1111968
[INFO] 2022-04-28 16:13:33,508 [     args.py:   70]:	train_data_size: 1111968
2022-04-28 16:13:33,508 - INFO - train_set: dureader-retrieval-baseline-dataset/train/cross.train.tsv
[INFO] 2022-04-28 16:13:33,508 [     args.py:   70]:	train_set: dureader-retrieval-baseline-dataset/train/cross.train.tsv
2022-04-28 16:13:33,508 - INFO - use_cross_batch: False
[INFO] 2022-04-28 16:13:33,508 [     args.py:   70]:	use_cross_batch: False
2022-04-28 16:13:33,508 - INFO - use_cuda: True
[INFO] 2022-04-28 16:13:33,508 [     args.py:   70]:	use_cuda: True
2022-04-28 16:13:33,508 - INFO - use_dynamic_loss_scaling: True
[INFO] 2022-04-28 16:13:33,508 [     args.py:   70]:	use_dynamic_loss_scaling: True
2022-04-28 16:13:33,508 - INFO - use_fast_executor: False
[INFO] 2022-04-28 16:13:33,508 [     args.py:   70]:	use_fast_executor: False
2022-04-28 16:13:33,508 - INFO - use_lamb: False
[INFO] 2022-04-28 16:13:33,508 [     args.py:   70]:	use_lamb: False
2022-04-28 16:13:33,508 - INFO - use_mix_precision: False
[INFO] 2022-04-28 16:13:33,508 [     args.py:   70]:	use_mix_precision: False
2022-04-28 16:13:33,509 - INFO - use_multi_gpu_test: False
[INFO] 2022-04-28 16:13:33,509 [     args.py:   70]:	use_multi_gpu_test: False
2022-04-28 16:13:33,509 - INFO - use_recompute: False
[INFO] 2022-04-28 16:13:33,509 [     args.py:   70]:	use_recompute: False
2022-04-28 16:13:33,509 - INFO - validation_steps: 104247
[INFO] 2022-04-28 16:13:33,509 [     args.py:   70]:	validation_steps: 104247
2022-04-28 16:13:33,509 - INFO - verbose: True
[INFO] 2022-04-28 16:13:33,509 [     args.py:   70]:	verbose: True
2022-04-28 16:13:33,509 - INFO - vocab_path: pretrained-models/ernie_base_1.0_CN/vocab.txt
[INFO] 2022-04-28 16:13:33,509 [     args.py:   70]:	vocab_path: pretrained-models/ernie_base_1.0_CN/vocab.txt
2022-04-28 16:13:33,509 - INFO - warmup_proportion: 0.0
[INFO] 2022-04-28 16:13:33,509 [     args.py:   70]:	warmup_proportion: 0.0
2022-04-28 16:13:33,509 - INFO - weight_decay: 0.01
[INFO] 2022-04-28 16:13:33,509 [     args.py:   70]:	weight_decay: 0.01
2022-04-28 16:13:33,509 - INFO - ------------------------------------------------
[INFO] 2022-04-28 16:13:33,509 [     args.py:   71]:	------------------------------------------------
2022-04-28 16:13:33,509 - INFO - attention_probs_dropout_prob: 0.1
[INFO] 2022-04-28 16:13:33,509 [    ernie.py:   51]:	attention_probs_dropout_prob: 0.1
2022-04-28 16:13:33,510 - INFO - hidden_act: relu
[INFO] 2022-04-28 16:13:33,510 [    ernie.py:   51]:	hidden_act: relu
2022-04-28 16:13:33,510 - INFO - hidden_dropout_prob: 0.1
[INFO] 2022-04-28 16:13:33,510 [    ernie.py:   51]:	hidden_dropout_prob: 0.1
2022-04-28 16:13:33,510 - INFO - hidden_size: 768
[INFO] 2022-04-28 16:13:33,510 [    ernie.py:   51]:	hidden_size: 768
2022-04-28 16:13:33,510 - INFO - initializer_range: 0.02
[INFO] 2022-04-28 16:13:33,510 [    ernie.py:   51]:	initializer_range: 0.02
2022-04-28 16:13:33,510 - INFO - max_position_embeddings: 513
[INFO] 2022-04-28 16:13:33,510 [    ernie.py:   51]:	max_position_embeddings: 513
2022-04-28 16:13:33,510 - INFO - num_attention_heads: 12
[INFO] 2022-04-28 16:13:33,510 [    ernie.py:   51]:	num_attention_heads: 12
2022-04-28 16:13:33,510 - INFO - num_hidden_layers: 12
[INFO] 2022-04-28 16:13:33,510 [    ernie.py:   51]:	num_hidden_layers: 12
2022-04-28 16:13:33,510 - INFO - type_vocab_size: 2
[INFO] 2022-04-28 16:13:33,510 [    ernie.py:   51]:	type_vocab_size: 2
2022-04-28 16:13:33,510 - INFO - vocab_size: 18000
[INFO] 2022-04-28 16:13:33,510 [    ernie.py:   51]:	vocab_size: 18000
2022-04-28 16:13:33,510 - INFO - ------------------------------------------------
[INFO] 2022-04-28 16:13:33,510 [    ernie.py:   52]:	------------------------------------------------
2022-04-28 16:13:42,372 - INFO - apply sharding 1/2
[INFO] 2022-04-28 16:13:42,372 [reader_ce.py:  251]:	apply sharding 1/2
2022-04-28 16:13:42,372 - INFO - Device count: 2
[INFO] 2022-04-28 16:13:42,372 [ train_ce.py:  116]:	Device count: 2
2022-04-28 16:13:42,372 - INFO - Num train examples: 1111968
[INFO] 2022-04-28 16:13:42,372 [ train_ce.py:  117]:	Num train examples: 1111968
2022-04-28 16:13:42,372 - INFO - Max train steps: 208494
[INFO] 2022-04-28 16:13:42,372 [ train_ce.py:  118]:	Max train steps: 208494
2022-04-28 16:13:42,373 - INFO - Num warmup steps: 0
[INFO] 2022-04-28 16:13:42,373 [ train_ce.py:  119]:	Num warmup steps: 0
2022-04-28 16:13:42,374 - WARNING - paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead.
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/layers/io.py:721: DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead
  'paddle.fluid.layers.py_reader() may be deprecated in the near future. '
[WARNING] 2022-04-28 16:13:42,374 [       io.py:  721]:	paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead.
I0428 16:13:42.374289  2391 reader_py.cc:385] init_lod_tensor_blocking_queue
2022-04-28 16:13:46,214 - WARNING - set use_hierarchical_allreduce=False since you only have 1 node.
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /work/src/model/ernie.py:128
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
  op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /work/src/model/ernie.py:129
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
  op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /work/src/model/transformer_encoder.py:118
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
  op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /work/src/model/transformer_encoder.py:207
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
  op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/clip.py:631: UserWarning: Caution! 'set_gradient_clip' is not recommended and may be deprecated in future! We recommend a new strategy: set 'grad_clip' when initializing the 'optimizer'. This method can reduce the mistakes, please refer to documention of 'optimizer'.
  warnings.warn("Caution! 'set_gradient_clip' is not recommended "
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/incubate/fleet/collective/__init__.py:394: DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead
  "set use_hierarchical_allreduce=False since you only have 1 node."
[WARNING] 2022-04-28 16:13:46,214 [ __init__.py:  394]:	set use_hierarchical_allreduce=False since you only have 1 node.
2022-04-28 16:13:46,509 - INFO - Theoretical memory usage in training: 31304.218 - 32794.895 MB


API is deprecated since 2.0.0 Please use FleetAPI instead.
WIKI: https://github.com/PaddlePaddle/Fleet/blob/develop/markdown_doc/transpiler

        
[INFO] 2022-04-28 16:13:46,509 [ train_ce.py:  174]:	Theoretical memory usage in training: 31304.218 - 32794.895 MB
W0428 16:13:46.585139  2391 device_context.cc:362] Please NOTE: device: 1, GPU Compute Capability: 7.5, Driver API Version: 10.1, Runtime API Version: 10.1
W0428 16:13:46.589557  2391 device_context.cc:372] device: 1, cuDNN Version: 7.6.
I0428 16:13:51.732564  2391 gen_nccl_id_op.cc:92] trainer_id:1, use_hierarchical_allreduce:0, nccl_comm_num:1, inter_nranks:0, inter_trainer_id:-1, exter_trainer_id:-1, trainers:127.0.0.1:54640,127.0.0.1:43155,
I0428 16:13:51.732654  2391 gen_nccl_id_op_helper.cc:176] Server listening on: 127.0.0.1:43155 successful.
2022-04-28 16:13:58,030 - INFO - Load pretraining parameters from pretrained-models/ernie_base_1.0_CN/params.
[INFO] 2022-04-28 16:13:58,030 [     init.py:   74]:	Load pretraining parameters from pretrained-models/ernie_base_1.0_CN/params.
I0428 16:13:58.467259  2391 parallel_executor.cc:662] The Program will be executed on CUDA using ParallelExecutor, 1 cards are used, so 1 programs are executed in parallel.
I0428 16:13:58.467324  2391 parallel_executor.cc:270] not find NCCLCommunicator in scope, so recreate it!
I0428 16:13:58.467339  2391 parallel_executor.cc:137] nccl comm num:1, nranks:2, num_trainers:2, trainer_id:1
I0428 16:13:58.473305  2391 nccl_helper.h:133] init nccl rank:1, nranks:2, gpu_id:1, dev_id:1
terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
  what():  (External)  Nccl error, unhandled cuda error, detail: Resource temporarily unavailable
Please try one of the following solutions:
1. export NCCL_SHM_DISABLE=1;
2. export NCCL_P2P_LEVEL=SYS;
3. Increase shared memory by setting the -shm-size option when starting docker container, e.g., setting  -shm-size=2g.
  (at /paddle/paddle/fluid/platform/nccl_helper.h:72)



--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::framework::ParallelExecutor::ParallelExecutor(std::vector<paddle::platform::Place, std::allocator<paddle::platform::Place> > const&, std::vector<std::string, std::allocator<std::string > > const&, std::string const&, paddle::framework::Scope*, std::vector<paddle::framework::Scope*, std::allocator<paddle::framework::Scope*> > const&, paddle::framework::details::ExecutionStrategy const&, paddle::framework::details::BuildStrategy const&, paddle::framework::ir::Graph*)
1   paddle::framework::ParallelExecutorPrivate::InitOrGetNCCLCommunicator(paddle::framework::Scope*, paddle::framework::details::BuildStrategy*)
2   paddle::framework::ParallelExecutorPrivate::InitNCCLCtxs(paddle::framework::Scope*, paddle::framework::details::BuildStrategy const&)
3   paddle::platform::NCCLCommunicator::InitFlatCtxs(std::vector<paddle::platform::Place, std::allocator<paddle::platform::Place> > const&, std::vector<ncclUniqueId*, std::allocator<ncclUniqueId*> > const&, unsigned long, unsigned long)
4   paddle::platform::NCCLContextMap::NCCLContextMap(std::vector<paddle::platform::Place, std::allocator<paddle::platform::Place> > const&, ncclUniqueId*, unsigned long, unsigned long)
5   paddle::framework::SignalHandle(char const*, int)
6   paddle::platform::GetCurrentTraceBackString[abi:cxx11]()

----------------------
Error Message Summary:
----------------------
FatalError: `Process abort signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1651162438 (unix time) try "date -d @1651162438" if you are using GNU date ***]
  [SignalInfo: *** SIGABRT (@0x957) received by PID 2391 (TID 0x7ff2c863f700) from PID 2391 ***]

Apr 28 '22 16:04 yysirs

workerlog.0日志

----------------------
Error Message Summary:
----------------------
FatalError: `Process abort signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1651162139 (unix time) try "date -d @1651162139" if you are using GNU date ***]
  [SignalInfo: *** SIGABRT (@0xbd83) received by PID 48515 (TID 0x7f3976375700) from PID 48515 ***]

grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0428 16:13:33.089520  2386 init.cc:85] Before Parse: argc is 2, Init commandline: dummy --tryfromenv=check_nan_inf,fast_check_nan_inf,benchmark,eager_delete_scope,fraction_of_cpu_memory_to_use,initial_cpu_memory_in_mb,init_allocated_mem,paddle_num_threads,dist_threadpool_size,eager_delete_tensor_gb,fast_eager_deletion_mode,memory_fraction_of_eager_deletion,allocator_strategy,reader_queue_speed_test_mode,print_sub_graph_dir,pe_profile_fname,inner_op_parallelism,enable_parallel_graph,fuse_parameter_groups_size,multiple_of_cupti_buffer_size,fuse_parameter_memory_size,tracer_profile_fname,dygraph_debug,use_system_allocator,enable_unused_var_check,free_idle_chunk,free_when_no_cache_hit,call_stack_level,sort_sum_gradient,max_inplace_grad_add,use_pinned_memory,cpu_deterministic,use_mkldnn,tracer_mkldnn_ops_on,tracer_mkldnn_ops_off,fraction_of_gpu_memory_to_use,initial_gpu_memory_in_mb,reallocate_gpu_memory_in_mb,cudnn_deterministic,enable_cublas_tensor_op_math,conv_workspace_size_limit,cudnn_exhaustive_search,selected_gpus,sync_nccl_allreduce,cudnn_batchnorm_spatial_persistent,gpu_allocator_retry_time,local_exe_sub_scope_limit,gpu_memory_limit_mb 
I0428 16:13:33.089696  2386 init.cc:92] After Parse: argc is 1
2022-04-28 16:13:33,511 - INFO - -----------  Configuration Arguments -----------
[INFO] 2022-04-28 16:13:33,511 [     args.py:   68]:	-----------  Configuration Arguments -----------
2022-04-28 16:13:33,511 - INFO - batch_size: 8
[INFO] 2022-04-28 16:13:33,511 [     args.py:   70]:	batch_size: 8
2022-04-28 16:13:33,511 - INFO - checkpoints: output
[INFO] 2022-04-28 16:13:33,511 [     args.py:   70]:	checkpoints: output
2022-04-28 16:13:33,511 - INFO - chunk_scheme: IOB
[INFO] 2022-04-28 16:13:33,511 [     args.py:   70]:	chunk_scheme: IOB
2022-04-28 16:13:33,511 - INFO - decr_every_n_nan_or_inf: 2
[INFO] 2022-04-28 16:13:33,511 [     args.py:   70]:	decr_every_n_nan_or_inf: 2
2022-04-28 16:13:33,512 - INFO - decr_ratio: 0.8
[INFO] 2022-04-28 16:13:33,512 [     args.py:   70]:	decr_ratio: 0.8
2022-04-28 16:13:33,512 - INFO - dev_set: None
[INFO] 2022-04-28 16:13:33,512 [     args.py:   70]:	dev_set: None
2022-04-28 16:13:33,512 - INFO - diagnostic: None
[INFO] 2022-04-28 16:13:33,512 [     args.py:   70]:	diagnostic: None
2022-04-28 16:13:33,512 - INFO - diagnostic_save: None
[INFO] 2022-04-28 16:13:33,512 [     args.py:   70]:	diagnostic_save: None
2022-04-28 16:13:33,512 - INFO - do_lower_case: True
[INFO] 2022-04-28 16:13:33,512 [     args.py:   70]:	do_lower_case: True
2022-04-28 16:13:33,512 - INFO - do_test: False
[INFO] 2022-04-28 16:13:33,512 [     args.py:   70]:	do_test: False
2022-04-28 16:13:33,512 - INFO - do_train: True
[INFO] 2022-04-28 16:13:33,512 [     args.py:   70]:	do_train: True
2022-04-28 16:13:33,512 - INFO - do_val: False
[INFO] 2022-04-28 16:13:33,512 [     args.py:   70]:	do_val: False
2022-04-28 16:13:33,512 - INFO - doc_stride: 128
[INFO] 2022-04-28 16:13:33,512 [     args.py:   70]:	doc_stride: 128
2022-04-28 16:13:33,512 - INFO - epoch: 3
[INFO] 2022-04-28 16:13:33,512 [     args.py:   70]:	epoch: 3
2022-04-28 16:13:33,512 - INFO - ernie_config_path: pretrained-models/ernie_base_1.0_CN/ernie_config.json
[INFO] 2022-04-28 16:13:33,512 [     args.py:   70]:	ernie_config_path: pretrained-models/ernie_base_1.0_CN/ernie_config.json
2022-04-28 16:13:33,513 - INFO - for_cn: True
[INFO] 2022-04-28 16:13:33,513 [     args.py:   70]:	for_cn: True
2022-04-28 16:13:33,513 - INFO - in_tokens: False
[INFO] 2022-04-28 16:13:33,513 [     args.py:   70]:	in_tokens: False
2022-04-28 16:13:33,513 - INFO - incr_every_n_steps: 100
[INFO] 2022-04-28 16:13:33,513 [     args.py:   70]:	incr_every_n_steps: 100
2022-04-28 16:13:33,513 - INFO - incr_ratio: 2.0
[INFO] 2022-04-28 16:13:33,513 [     args.py:   70]:	incr_ratio: 2.0
2022-04-28 16:13:33,513 - INFO - init_checkpoint: None
[INFO] 2022-04-28 16:13:33,513 [     args.py:   70]:	init_checkpoint: None
2022-04-28 16:13:33,513 - INFO - init_pretraining_params: pretrained-models/ernie_base_1.0_CN/params
[INFO] 2022-04-28 16:13:33,513 [     args.py:   70]:	init_pretraining_params: pretrained-models/ernie_base_1.0_CN/params
2022-04-28 16:13:33,513 - INFO - is_classify: True
[INFO] 2022-04-28 16:13:33,513 [     args.py:   70]:	is_classify: True
2022-04-28 16:13:33,513 - INFO - is_distributed: False
[INFO] 2022-04-28 16:13:33,513 [     args.py:   70]:	is_distributed: False
2022-04-28 16:13:33,513 - INFO - is_regression: False
[INFO] 2022-04-28 16:13:33,513 [     args.py:   70]:	is_regression: False
2022-04-28 16:13:33,513 - INFO - label_map_config: None
[INFO] 2022-04-28 16:13:33,513 [     args.py:   70]:	label_map_config: None
2022-04-28 16:13:33,513 - INFO - learning_rate: 1e-05
[INFO] 2022-04-28 16:13:33,513 [     args.py:   70]:	learning_rate: 1e-05
2022-04-28 16:13:33,514 - INFO - lr_scheduler: linear_warmup_decay
[INFO] 2022-04-28 16:13:33,514 [     args.py:   70]:	lr_scheduler: linear_warmup_decay
2022-04-28 16:13:33,514 - INFO - max_answer_length: 100
[INFO] 2022-04-28 16:13:33,514 [     args.py:   70]:	max_answer_length: 100
2022-04-28 16:13:33,514 - INFO - max_query_length: 64
[INFO] 2022-04-28 16:13:33,514 [     args.py:   70]:	max_query_length: 64
2022-04-28 16:13:33,514 - INFO - max_seq_len: 384
[INFO] 2022-04-28 16:13:33,514 [     args.py:   70]:	max_seq_len: 384
2022-04-28 16:13:33,514 - INFO - metric: simple_accuracy
[INFO] 2022-04-28 16:13:33,514 [     args.py:   70]:	metric: simple_accuracy
2022-04-28 16:13:33,514 - INFO - metrics: True
[INFO] 2022-04-28 16:13:33,514 [     args.py:   70]:	metrics: True
2022-04-28 16:13:33,514 - INFO - n_best_size: 20
[INFO] 2022-04-28 16:13:33,514 [     args.py:   70]:	n_best_size: 20
2022-04-28 16:13:33,514 - INFO - num_iteration_per_drop_scope: 1
[INFO] 2022-04-28 16:13:33,514 [     args.py:   70]:	num_iteration_per_drop_scope: 1
2022-04-28 16:13:33,514 - INFO - num_labels: 2
[INFO] 2022-04-28 16:13:33,514 [     args.py:   70]:	num_labels: 2
2022-04-28 16:13:33,514 - INFO - output_file_name: None
[INFO] 2022-04-28 16:13:33,514 [     args.py:   70]:	output_file_name: None
2022-04-28 16:13:33,514 - INFO - output_item: 3
[INFO] 2022-04-28 16:13:33,514 [     args.py:   70]:	output_item: 3
2022-04-28 16:13:33,515 - INFO - p_max_seq_len: 256
[INFO] 2022-04-28 16:13:33,515 [     args.py:   70]:	p_max_seq_len: 256
2022-04-28 16:13:33,515 - INFO - predict_batch_size: None
[INFO] 2022-04-28 16:13:33,515 [     args.py:   70]:	predict_batch_size: None
2022-04-28 16:13:33,515 - INFO - q_max_seq_len: 32
[INFO] 2022-04-28 16:13:33,515 [     args.py:   70]:	q_max_seq_len: 32
2022-04-28 16:13:33,515 - INFO - random_seed: 1
[INFO] 2022-04-28 16:13:33,515 [     args.py:   70]:	random_seed: 1
2022-04-28 16:13:33,515 - INFO - save_steps: 104247
[INFO] 2022-04-28 16:13:33,515 [     args.py:   70]:	save_steps: 104247
2022-04-28 16:13:33,515 - INFO - shuffle: True
[INFO] 2022-04-28 16:13:33,515 [     args.py:   70]:	shuffle: True
2022-04-28 16:13:33,515 - INFO - skip_steps: 10
[INFO] 2022-04-28 16:13:33,515 [     args.py:   70]:	skip_steps: 10
2022-04-28 16:13:33,515 - INFO - task_id: 0
[INFO] 2022-04-28 16:13:33,515 [     args.py:   70]:	task_id: 0
2022-04-28 16:13:33,515 - INFO - test_data_cnt: 1110000
[INFO] 2022-04-28 16:13:33,515 [     args.py:   70]:	test_data_cnt: 1110000
2022-04-28 16:13:33,515 - INFO - test_save: ./checkpoints/test_result
[INFO] 2022-04-28 16:13:33,515 [     args.py:   70]:	test_save: ./checkpoints/test_result
2022-04-28 16:13:33,515 - INFO - test_set: None
[INFO] 2022-04-28 16:13:33,515 [     args.py:   70]:	test_set: None
2022-04-28 16:13:33,516 - INFO - tokenizer: FullTokenizer
[INFO] 2022-04-28 16:13:33,516 [     args.py:   70]:	tokenizer: FullTokenizer
2022-04-28 16:13:33,516 - INFO - train_data_size: 1111968
[INFO] 2022-04-28 16:13:33,516 [     args.py:   70]:	train_data_size: 1111968
2022-04-28 16:13:33,516 - INFO - train_set: dureader-retrieval-baseline-dataset/train/cross.train.tsv
[INFO] 2022-04-28 16:13:33,516 [     args.py:   70]:	train_set: dureader-retrieval-baseline-dataset/train/cross.train.tsv
2022-04-28 16:13:33,516 - INFO - use_cross_batch: False
[INFO] 2022-04-28 16:13:33,516 [     args.py:   70]:	use_cross_batch: False
2022-04-28 16:13:33,516 - INFO - use_cuda: True
[INFO] 2022-04-28 16:13:33,516 [     args.py:   70]:	use_cuda: True
2022-04-28 16:13:33,516 - INFO - use_dynamic_loss_scaling: True
[INFO] 2022-04-28 16:13:33,516 [     args.py:   70]:	use_dynamic_loss_scaling: True
2022-04-28 16:13:33,516 - INFO - use_fast_executor: False
[INFO] 2022-04-28 16:13:33,516 [     args.py:   70]:	use_fast_executor: False
2022-04-28 16:13:33,516 - INFO - use_lamb: False
[INFO] 2022-04-28 16:13:33,516 [     args.py:   70]:	use_lamb: False
2022-04-28 16:13:33,516 - INFO - use_mix_precision: False
[INFO] 2022-04-28 16:13:33,516 [     args.py:   70]:	use_mix_precision: False
2022-04-28 16:13:33,516 - INFO - use_multi_gpu_test: False
[INFO] 2022-04-28 16:13:33,516 [     args.py:   70]:	use_multi_gpu_test: False
2022-04-28 16:13:33,517 - INFO - use_recompute: False
[INFO] 2022-04-28 16:13:33,517 [     args.py:   70]:	use_recompute: False
2022-04-28 16:13:33,517 - INFO - validation_steps: 104247
[INFO] 2022-04-28 16:13:33,517 [     args.py:   70]:	validation_steps: 104247
2022-04-28 16:13:33,517 - INFO - verbose: True
[INFO] 2022-04-28 16:13:33,517 [     args.py:   70]:	verbose: True
2022-04-28 16:13:33,517 - INFO - vocab_path: pretrained-models/ernie_base_1.0_CN/vocab.txt
[INFO] 2022-04-28 16:13:33,517 [     args.py:   70]:	vocab_path: pretrained-models/ernie_base_1.0_CN/vocab.txt
2022-04-28 16:13:33,517 - INFO - warmup_proportion: 0.0
[INFO] 2022-04-28 16:13:33,517 [     args.py:   70]:	warmup_proportion: 0.0
2022-04-28 16:13:33,517 - INFO - weight_decay: 0.01
[INFO] 2022-04-28 16:13:33,517 [     args.py:   70]:	weight_decay: 0.01
2022-04-28 16:13:33,517 - INFO - ------------------------------------------------
[INFO] 2022-04-28 16:13:33,517 [     args.py:   71]:	------------------------------------------------
2022-04-28 16:13:33,517 - INFO - attention_probs_dropout_prob: 0.1
[INFO] 2022-04-28 16:13:33,517 [    ernie.py:   51]:	attention_probs_dropout_prob: 0.1
2022-04-28 16:13:33,517 - INFO - hidden_act: relu
[INFO] 2022-04-28 16:13:33,517 [    ernie.py:   51]:	hidden_act: relu
2022-04-28 16:13:33,518 - INFO - hidden_dropout_prob: 0.1
[INFO] 2022-04-28 16:13:33,518 [    ernie.py:   51]:	hidden_dropout_prob: 0.1
2022-04-28 16:13:33,518 - INFO - hidden_size: 768
[INFO] 2022-04-28 16:13:33,518 [    ernie.py:   51]:	hidden_size: 768
2022-04-28 16:13:33,518 - INFO - initializer_range: 0.02
[INFO] 2022-04-28 16:13:33,518 [    ernie.py:   51]:	initializer_range: 0.02
2022-04-28 16:13:33,518 - INFO - max_position_embeddings: 513
[INFO] 2022-04-28 16:13:33,518 [    ernie.py:   51]:	max_position_embeddings: 513
2022-04-28 16:13:33,518 - INFO - num_attention_heads: 12
[INFO] 2022-04-28 16:13:33,518 [    ernie.py:   51]:	num_attention_heads: 12
2022-04-28 16:13:33,518 - INFO - num_hidden_layers: 12
[INFO] 2022-04-28 16:13:33,518 [    ernie.py:   51]:	num_hidden_layers: 12
2022-04-28 16:13:33,518 - INFO - type_vocab_size: 2
[INFO] 2022-04-28 16:13:33,518 [    ernie.py:   51]:	type_vocab_size: 2
2022-04-28 16:13:33,518 - INFO - vocab_size: 18000
[INFO] 2022-04-28 16:13:33,518 [    ernie.py:   51]:	vocab_size: 18000
2022-04-28 16:13:33,518 - INFO - ------------------------------------------------
[INFO] 2022-04-28 16:13:33,518 [    ernie.py:   52]:	------------------------------------------------
2022-04-28 16:13:42,480 - INFO - apply sharding 0/2
[INFO] 2022-04-28 16:13:42,480 [reader_ce.py:  251]:	apply sharding 0/2
2022-04-28 16:13:42,480 - INFO - Device count: 2
[INFO] 2022-04-28 16:13:42,480 [ train_ce.py:  116]:	Device count: 2
2022-04-28 16:13:42,480 - INFO - Num train examples: 1111968
[INFO] 2022-04-28 16:13:42,480 [ train_ce.py:  117]:	Num train examples: 1111968
2022-04-28 16:13:42,481 - INFO - Max train steps: 208494
[INFO] 2022-04-28 16:13:42,481 [ train_ce.py:  118]:	Max train steps: 208494
2022-04-28 16:13:42,481 - INFO - Num warmup steps: 0
[INFO] 2022-04-28 16:13:42,481 [ train_ce.py:  119]:	Num warmup steps: 0
2022-04-28 16:13:42,481 - WARNING - paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead.
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/layers/io.py:721: DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead
  'paddle.fluid.layers.py_reader() may be deprecated in the near future. '
[WARNING] 2022-04-28 16:13:42,481 [       io.py:  721]:	paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead.
I0428 16:13:42.482144  2386 reader_py.cc:385] init_lod_tensor_blocking_queue
2022-04-28 16:13:46,320 - WARNING - set use_hierarchical_allreduce=False since you only have 1 node.
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /work/src/model/ernie.py:128
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
  op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /work/src/model/ernie.py:129
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
  op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /work/src/model/transformer_encoder.py:118
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
  op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/layers/math_op_patch.py:298: UserWarning: /work/src/model/transformer_encoder.py:207
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
  op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/clip.py:631: UserWarning: Caution! 'set_gradient_clip' is not recommended and may be deprecated in future! We recommend a new strategy: set 'grad_clip' when initializing the 'optimizer'. This method can reduce the mistakes, please refer to documention of 'optimizer'.
  warnings.warn("Caution! 'set_gradient_clip' is not recommended "
/usr/local/python3.5.1/lib/python3.5/site-packages/paddle/fluid/incubate/fleet/collective/__init__.py:394: DeprecationWarning: The 'warn' function is deprecated, use 'warning' instead
  "set use_hierarchical_allreduce=False since you only have 1 node."
[WARNING] 2022-04-28 16:13:46,320 [ __init__.py:  394]:	set use_hierarchical_allreduce=False since you only have 1 node.


API is deprecated since 2.0.0 Please use FleetAPI instead.
WIKI: https://github.com/PaddlePaddle/Fleet/blob/develop/markdown_doc/transpiler

        
server not ready, wait 3 sec to retry...
not ready endpoints:['127.0.0.1:43155']
server not ready, wait 3 sec to retry...
not ready endpoints:['127.0.0.1:43155']
2022-04-28 16:13:52,735 - INFO - Theoretical memory usage in training: 31304.218 - 32794.895 MB
[INFO] 2022-04-28 16:13:52,735 [ train_ce.py:  174]:	Theoretical memory usage in training: 31304.218 - 32794.895 MB
W0428 16:13:52.836230  2386 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 10.1, Runtime API Version: 10.1
W0428 16:13:52.841756  2386 device_context.cc:372] device: 0, cuDNN Version: 7.6.
I0428 16:13:57.318223  2386 gen_nccl_id_op.cc:92] trainer_id:0, use_hierarchical_allreduce:0, nccl_comm_num:1, inter_nranks:0, inter_trainer_id:-1, exter_trainer_id:-1, trainers:127.0.0.1:54640,127.0.0.1:43155,
2022-04-28 16:13:58,097 - INFO - Load pretraining parameters from pretrained-models/ernie_base_1.0_CN/params.
[INFO] 2022-04-28 16:13:58,097 [     init.py:   74]:	Load pretraining parameters from pretrained-models/ernie_base_1.0_CN/params.
I0428 16:13:58.409628  2386 parallel_executor.cc:662] The Program will be executed on CUDA using ParallelExecutor, 1 cards are used, so 1 programs are executed in parallel.
I0428 16:13:58.409691  2386 parallel_executor.cc:270] not find NCCLCommunicator in scope, so recreate it!
I0428 16:13:58.409705  2386 parallel_executor.cc:137] nccl comm num:1, nranks:2, num_trainers:2, trainer_id:0
I0428 16:13:58.414196  2386 nccl_helper.h:133] init nccl rank:0, nranks:2, gpu_id:0, dev_id:0
terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
  what():  (External)  Nccl error, unhandled cuda error, detail: Resource temporarily unavailable
Please try one of the following solutions:
1. export NCCL_SHM_DISABLE=1;
2. export NCCL_P2P_LEVEL=SYS;
3. Increase shared memory by setting the -shm-size option when starting docker container, e.g., setting  -shm-size=2g.
  (at /paddle/paddle/fluid/platform/nccl_helper.h:72)



--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::framework::ParallelExecutor::ParallelExecutor(std::vector<paddle::platform::Place, std::allocator<paddle::platform::Place> > const&, std::vector<std::string, std::allocator<std::string > > const&, std::string const&, paddle::framework::Scope*, std::vector<paddle::framework::Scope*, std::allocator<paddle::framework::Scope*> > const&, paddle::framework::details::ExecutionStrategy const&, paddle::framework::details::BuildStrategy const&, paddle::framework::ir::Graph*)
1   paddle::framework::ParallelExecutorPrivate::InitOrGetNCCLCommunicator(paddle::framework::Scope*, paddle::framework::details::BuildStrategy*)
2   paddle::framework::ParallelExecutorPrivate::InitNCCLCtxs(paddle::framework::Scope*, paddle::framework::details::BuildStrategy const&)
3   paddle::platform::NCCLCommunicator::InitFlatCtxs(std::vector<paddle::platform::Place, std::allocator<paddle::platform::Place> > const&, std::vector<ncclUniqueId*, std::allocator<ncclUniqueId*> > const&, unsigned long, unsigned long)
4   paddle::platform::NCCLContextMap::NCCLContextMap(std::vector<paddle::platform::Place, std::allocator<paddle::platform::Place> > const&, ncclUniqueId*, unsigned long, unsigned long)
5   paddle::framework::SignalHandle(char const*, int)
6   paddle::platform::GetCurrentTraceBackString[abi:cxx11]()

----------------------
Error Message Summary:
----------------------
FatalError: `Process abort signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1651162438 (unix time) try "date -d @1651162438" if you are using GNU date ***]
  [SignalInfo: *** SIGABRT (@0x952) received by PID 2386 (TID 0x7f5594df8700) from PID 2386 ***]

Apr 28 '22 16:04 yysirs

请问尝试过这三种解决方案吗

export NCCL_SHM_DISABLE=1;
export NCCL_P2P_LEVEL=SYS;
Increase shared memory by setting the -shm-size option when starting docker container, e.g., setting -shm-size=2g. (at /paddle/paddle/fluid/platform/nccl_helper.h:72)

May 05 '22 09:05 sfwydyc