graphstorm icon indicating copy to clipboard operation
graphstorm copied to clipboard

Multi-processing error when dataset is large and using distributed node embeddings

Open CongWeilin opened this issue 2 years ago • 5 comments

Hello, does anyone have an idea why this "child fail" error is raised with num_trainer=1, num_sampler=16. This issue only occurs when working with a large graph and using distributed node embeddings. Interestingly, when switching to a smaller subsampled dataset, the error does not occur. It doesn't seem to be a memory problem as checking the available memory using free -h shows that there is still plenty of memory available.

This is all the error msg I have:

  warnings.warn(
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 3300512) of binary: /home/ubuntu/pytorch-1-12-0/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/pytorch-1-12-0/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/ubuntu/pytorch-1-12-0/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/ubuntu/pytorch-1-12-0/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/ubuntu/pytorch-1-12-0/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/ubuntu/pytorch-1-12-0/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/pytorch-1-12-0/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=========================================================
main_ssl.py FAILED
---------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
---------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-06_05:08:41
  host      : ip-172-31-11-219.us-west-2.compute.internal
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 3300512)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 3300512

Using latest graphstorm==0.2 with pytorch==1.12.0

CongWeilin avatar Jul 06 '23 05:07 CongWeilin

When using num_sampler=0 without multi-processing in sampling, another error raised

bash: line 1: 3309141 Bus error               (core dumped) /home/ubuntu/pytorch-1-12-0/bin/python3 main_ssl.py --cf /tmp/graphstorm_train_script_ssl_config_gpu1.yaml --ip-config /tmp/ip_list_single_machine.txt --part-config /home/ubuntu/7days_fullgraph_constructed_without_node_feats/Cramer.json --verbose False
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 22 127.0.0.1 'cd /home/ubuntu/workspace/step_by_step_dev; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=1 DGL_CONF_PATH=/home/ubuntu/7days_fullgraph_constructed_without_node_feats/Cramer.json DGL_IP_CONFIG=/tmp/ip_list_single_machine.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc  DGL_SERVER_ID=0; /home/ubuntu/pytorch-1-12-0/bin/python3 main_ssl.py --cf /tmp/graphstorm_train_script_ssl_config_gpu1.yaml --ip-config /tmp/ip_list_single_machine.txt --part-config /home/ubuntu/7days_fullgraph_constructed_without_node_feats/Cramer.json --verbose False)'' returned non-zero exit status 135.
^C2023-07-06 14:12:36,683 INFO Stop launcher
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 22 127.0.0.1 'cd /home/ubuntu/workspace/step_by_step_dev; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=1 DGL_CONF_PATH=/home/ubuntu/7days_fullgraph_constructed_without_node_feats/Cramer.json DGL_IP_CONFIG=/tmp/ip_list_single_machine.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=96 DGL_GROUP_ID=0 ; /home/ubuntu/pytorch-1-12-0/bin/python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --master_port=1234 main_ssl.py --cf /tmp/graphstorm_train_script_ssl_config_gpu1.yaml --ip-config /tmp/ip_list_single_machine.txt --part-config /home/ubuntu/7days_fullgraph_constructed_without_node_feats/Cramer.json --verbose False)'' died with <Signals.SIGINT: 2>.
kill process 3309196 on 127.0.0.1:22
kill process 3309197 on 127.0.0.1:22
kill process 3309198 on 127.0.0.1:22
kill process 3309268 on 127.0.0.1:22
kill process 3309198 on 127.0.0.1:22
kill process 3309268 on 127.0.0.1:22
kill process 3309268 on 127.0.0.1:22
cleanup process exits

CongWeilin avatar Jul 06 '23 14:07 CongWeilin

can you try num_samplers=0?

zheng-da avatar Jul 06 '23 21:07 zheng-da

can you try num_samplers=0?

https://github.com/awslabs/graphstorm/issues/314#issuecomment-1623781826 it gives another Bus Error issue without other informaiton

CongWeilin avatar Jul 06 '23 21:07 CongWeilin

is it an out-of-memory error?

zheng-da avatar Jul 06 '23 21:07 zheng-da

is it an out-of-memory error?

I was using free -h to check the memory before the error appears, it seems like there are still a large amount of free memory (around 180G).

CongWeilin avatar Jul 06 '23 21:07 CongWeilin