alpa icon indicating copy to clipboard operation
alpa copied to clipboard

Fail to run alpa test

Open gaow0007 opened this issue 1 year ago • 2 comments

Please describe the bug

Please describe the expected behavior

System information and environment

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04, docker):Linux Ubuntu 18.04,
  • Python version:3.9
  • CUDA version:11.1
  • NCCL version:2.8.4
  • cupy version:11.1
  • GPU model and memory:RTX2080, 11264MiB
  • Alpa version: 0.2.3
  • TensorFlow version:
  • JAX version:0.3.22

To Reproduce Steps to reproduce the behavior:

  1. python3 -m alpa.test_install
  2. See error

Screenshots If applicable, add screenshots to help explain your problem.

2023-06-17 22:59:20,085	INFO worker.py:1342 -- Connecting to existing Ray cluster at address: 155.69.142.146:6379...
2023-06-17 22:59:20,120	INFO worker.py:1528 -- Connected to Ray cluster.
(raylet) [2023-06-17 22:59:27,687 E 25332 25478] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2023-06-17_22-09-42_273283_25013 is over 95% full, available space: 21533958144; capacity: 730542596096. Object creation will fail if spilling is required.
EException ignored in: <function PipeshardDriverExecutable.__del__ at 0x7fe295cbc940>
Traceback (most recent call last):
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/pipeline_parallel/pipeshard_executable.py", line 434, in __del__
2023-06-17 22:59:29,665	ERROR worker.py:400 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::MeshHostWorker.init_p2p_communicator() (pid=16323, ip=155.69.142.146, repr=<alpa.device_mesh.MeshHostWorker object at 0x7fbdcf679430>)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/device_mesh.py", line 391, in init_p2p_communicator
    g.create_p2p_communicator(my_gpu_idx, peer_rank, peer_gpu_idx, nccl_uid)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 662, in create_p2p_communicator
    self._get_nccl_p2p_communicator(comm_key, my_gpu_idx, peer_rank,
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 532, in _get_nccl_p2p_communicator
    comm = nccl_util.create_nccl_communicator(2, nccl_uid, my_p2p_rank)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_util.py", line 115, in create_nccl_communicator
    comm = NcclCommunicator(world_size, nccl_unique_id, rank)
  File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
  File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error
    mesh.delete_remote_executable(self.exec_uuid)
AttributeError: 'PipeshardDriverExecutable' object has no attribute 'exec_uuid'

======================================================================
ERROR: test_2_pipeline_parallel (__main__.InstallationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/test_install.py", line 65, in <module>
    runner.run(suite())
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/unittest/runner.py", line 176, in run
    test(result)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/unittest/suite.py", line 122, in run
    test(result)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/unittest/case.py", line 736, in __call__
    return self.run(*args, **kwds)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/unittest/case.py", line 676, in run
    self._callTestMethod(testMethod)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/unittest/case.py", line 633, in _callTestMethod
    method()
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/test_install.py", line 49, in test_2_pipeline_parallel
    actual_output = p_train_step(state, batch)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
    return fun(*args, **kwargs)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/api.py", line 121, in __call__
    self._decode_args_and_get_executable(*args))
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/api.py", line 191, in _decode_args_and_get_executable
    executable = _compile_parallel_executable(f, in_tree, out_tree_hashable,
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/jax/linear_util.py", line 309, in memoized_fun
    ans = call(fun, *args)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/api.py", line 223, in _compile_parallel_executable
    return method.compile_executable(fun, in_tree, out_tree_thunk,
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/parallel_method.py", line 240, in compile_executable
    return compile_pipeshard_executable(
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 118, in compile_pipeshard_executable
    executable = PipeshardDriverExecutable(
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/pipeline_parallel/pipeshard_executable.py", line 105, in __init__
    task.create_resharding_communicators()
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 292, in create_resharding_communicators
    ray.get(task_dones)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/ray/_private/worker.py", line 2289, in get
    raise value.as_instanceof_cause()
jax._src.traceback_util.UnfilteredStackTrace: ray.exceptions.RayTaskError(NcclError): ray::MeshHostWorker.init_p2p_communicator() (pid=16322, ip=155.69.142.146, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f9ab240f460>)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/device_mesh.py", line 391, in init_p2p_communicator
    g.create_p2p_communicator(my_gpu_idx, peer_rank, peer_gpu_idx, nccl_uid)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 662, in create_p2p_communicator
    self._get_nccl_p2p_communicator(comm_key, my_gpu_idx, peer_rank,
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 532, in _get_nccl_p2p_communicator
    comm = nccl_util.create_nccl_communicator(2, nccl_uid, my_p2p_rank)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_util.py", line 115, in create_nccl_communicator
    comm = NcclCommunicator(world_size, nccl_unique_id, rank)
  File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
  File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/test_install.py", line 49, in test_2_pipeline_parallel
    actual_output = p_train_step(state, batch)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 118, in compile_pipeshard_executable
    executable = PipeshardDriverExecutable(
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 292, in create_resharding_communicators
    ray.get(task_dones)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/ray/_private/worker.py", line 2289, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(NcclError): ray::MeshHostWorker.init_p2p_communicator() (pid=16322, ip=155.69.142.146, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f9ab240f460>)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/device_mesh.py", line 391, in init_p2p_communicator
    g.create_p2p_communicator(my_gpu_idx, peer_rank, peer_gpu_idx, nccl_uid)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 662, in create_p2p_communicator
    self._get_nccl_p2p_communicator(comm_key, my_gpu_idx, peer_rank,
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 532, in _get_nccl_p2p_communicator
    comm = nccl_util.create_nccl_communicator(2, nccl_uid, my_p2p_rank)
  File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_util.py", line 115, in create_nccl_communicator
    comm = NcclCommunicator(world_size, nccl_unique_id, rank)
  File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
  File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error

----------------------------------------------------------------------
Ran 2 tests in 20.923s

FAILED (errors=1)

Code snippet to reproduce the problem

Additional information Add any other context about the problem here or include any logs that would be helpful to diagnose the problem.

gaow0007 avatar Jun 17 '23 15:06 gaow0007