alpa
alpa copied to clipboard
Fail to run alpa test
Please describe the bug
Please describe the expected behavior
System information and environment
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04, docker):Linux Ubuntu 18.04,
- Python version:3.9
- CUDA version:11.1
- NCCL version:2.8.4
- cupy version:11.1
- GPU model and memory:RTX2080, 11264MiB
- Alpa version: 0.2.3
- TensorFlow version:
- JAX version:0.3.22
To Reproduce Steps to reproduce the behavior:
- python3 -m alpa.test_install
- See error
Screenshots If applicable, add screenshots to help explain your problem.
2023-06-17 22:59:20,085 INFO worker.py:1342 -- Connecting to existing Ray cluster at address: 155.69.142.146:6379...
2023-06-17 22:59:20,120 INFO worker.py:1528 -- Connected to Ray cluster.
(raylet) [2023-06-17 22:59:27,687 E 25332 25478] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2023-06-17_22-09-42_273283_25013 is over 95% full, available space: 21533958144; capacity: 730542596096. Object creation will fail if spilling is required.
EException ignored in: <function PipeshardDriverExecutable.__del__ at 0x7fe295cbc940>
Traceback (most recent call last):
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/pipeline_parallel/pipeshard_executable.py", line 434, in __del__
2023-06-17 22:59:29,665 ERROR worker.py:400 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::MeshHostWorker.init_p2p_communicator() (pid=16323, ip=155.69.142.146, repr=<alpa.device_mesh.MeshHostWorker object at 0x7fbdcf679430>)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/device_mesh.py", line 391, in init_p2p_communicator
g.create_p2p_communicator(my_gpu_idx, peer_rank, peer_gpu_idx, nccl_uid)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 662, in create_p2p_communicator
self._get_nccl_p2p_communicator(comm_key, my_gpu_idx, peer_rank,
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 532, in _get_nccl_p2p_communicator
comm = nccl_util.create_nccl_communicator(2, nccl_uid, my_p2p_rank)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_util.py", line 115, in create_nccl_communicator
comm = NcclCommunicator(world_size, nccl_unique_id, rank)
File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error
mesh.delete_remote_executable(self.exec_uuid)
AttributeError: 'PipeshardDriverExecutable' object has no attribute 'exec_uuid'
======================================================================
ERROR: test_2_pipeline_parallel (__main__.InstallationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/test_install.py", line 65, in <module>
runner.run(suite())
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/unittest/runner.py", line 176, in run
test(result)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/unittest/suite.py", line 84, in __call__
return self.run(*args, **kwds)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/unittest/suite.py", line 122, in run
test(result)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/unittest/case.py", line 736, in __call__
return self.run(*args, **kwds)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/unittest/case.py", line 676, in run
self._callTestMethod(testMethod)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/unittest/case.py", line 633, in _callTestMethod
method()
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/test_install.py", line 49, in test_2_pipeline_parallel
actual_output = p_train_step(state, batch)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
return fun(*args, **kwargs)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/api.py", line 121, in __call__
self._decode_args_and_get_executable(*args))
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/api.py", line 191, in _decode_args_and_get_executable
executable = _compile_parallel_executable(f, in_tree, out_tree_hashable,
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/jax/linear_util.py", line 309, in memoized_fun
ans = call(fun, *args)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/api.py", line 223, in _compile_parallel_executable
return method.compile_executable(fun, in_tree, out_tree_thunk,
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/parallel_method.py", line 240, in compile_executable
return compile_pipeshard_executable(
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 118, in compile_pipeshard_executable
executable = PipeshardDriverExecutable(
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/pipeline_parallel/pipeshard_executable.py", line 105, in __init__
task.create_resharding_communicators()
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 292, in create_resharding_communicators
ray.get(task_dones)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/ray/_private/worker.py", line 2289, in get
raise value.as_instanceof_cause()
jax._src.traceback_util.UnfilteredStackTrace: ray.exceptions.RayTaskError(NcclError): ray::MeshHostWorker.init_p2p_communicator() (pid=16322, ip=155.69.142.146, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f9ab240f460>)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/device_mesh.py", line 391, in init_p2p_communicator
g.create_p2p_communicator(my_gpu_idx, peer_rank, peer_gpu_idx, nccl_uid)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 662, in create_p2p_communicator
self._get_nccl_p2p_communicator(comm_key, my_gpu_idx, peer_rank,
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 532, in _get_nccl_p2p_communicator
comm = nccl_util.create_nccl_communicator(2, nccl_uid, my_p2p_rank)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_util.py", line 115, in create_nccl_communicator
comm = NcclCommunicator(world_size, nccl_unique_id, rank)
File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error
The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.
--------------------
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/test_install.py", line 49, in test_2_pipeline_parallel
actual_output = p_train_step(state, batch)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/pipeline_parallel/compile_executable.py", line 118, in compile_pipeshard_executable
executable = PipeshardDriverExecutable(
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/pipeline_parallel/cross_mesh_resharding.py", line 292, in create_resharding_communicators
ray.get(task_dones)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/ray/_private/worker.py", line 2289, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(NcclError): ray::MeshHostWorker.init_p2p_communicator() (pid=16322, ip=155.69.142.146, repr=<alpa.device_mesh.MeshHostWorker object at 0x7f9ab240f460>)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/device_mesh.py", line 391, in init_p2p_communicator
g.create_p2p_communicator(my_gpu_idx, peer_rank, peer_gpu_idx, nccl_uid)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 662, in create_p2p_communicator
self._get_nccl_p2p_communicator(comm_key, my_gpu_idx, peer_rank,
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_collective_group.py", line 532, in _get_nccl_p2p_communicator
comm = nccl_util.create_nccl_communicator(2, nccl_uid, my_p2p_rank)
File "/home/gaowei/miniconda3/envs/alpa/lib/python3.8/site-packages/alpa/collective/collective_group/nccl_util.py", line 115, in create_nccl_communicator
comm = NcclCommunicator(world_size, nccl_unique_id, rank)
File "cupy_backends/cuda/libs/nccl.pyx", line 283, in cupy_backends.cuda.libs.nccl.NcclCommunicator.__init__
File "cupy_backends/cuda/libs/nccl.pyx", line 129, in cupy_backends.cuda.libs.nccl.check_status
cupy_backends.cuda.libs.nccl.NcclError: NCCL_ERROR_UNHANDLED_CUDA_ERROR: unhandled cuda error
----------------------------------------------------------------------
Ran 2 tests in 20.923s
FAILED (errors=1)
Code snippet to reproduce the problem
Additional information Add any other context about the problem here or include any logs that would be helpful to diagnose the problem.