ray_lightning
ray_lightning copied to clipboard
GPU memory not cleaned properly when using multiple workers in DataLoader
Hi all!
When I try to use multiple workers in the DataLoader
by specifying num_workers
some of the processes stay alive after the run and occupy GPU memory.
For my tests I am using this script and the following setup
python3.8.10
pytorch=1.9.0=py3.8_cuda11.1_cudnn8.0.5_0
pytorch-lightning==1.4.2
ray==1.6.0
ray_lightning==0.1.1
When setting e.g. num_workers=4
here I also increase num_cpus_per_worker
in the RayPlugin to 4 ( I also tried a larger number then num _workers
specified in the DataLoader
). I was not able to make this issue fully reproducible but in 3 out of 4 runs the GPU memory is not cleaned and several ray::RayExecutor.execute()
processes are sleeping.
# output of ps aux | grep ray
markus.+ 21130 1.2 0.1 210983644 2969908 pts/8 S 10:36 0:00 ray::RayExecutor.execute()
markus.+ 21131 1.4 0.1 211112136 2963912 pts/8 S 10:36 0:01 ray::RayExecutor.execute()
markus.+ 21275 1.4 0.1 210982456 2967680 pts/8 S 10:36 0:01 ray::RayExecutor.execute()
I have managed to get rid of the problem by setting persistent_workers=True
in the DataLoader
for now, but these processes should be cleaned in any case.
Quick update on this:
Even though I thought persistent_workers=True
cleans the processes properly I found that something very weird happens.
Namely, the BAR1 Memory Usage
is not released. In the end, I was not even able to put a very simple Tensor on the GPU
GPU 00000000:1B:00.0
FB Memory Usage
Total : 15109 MiB
Used : 3 MiB
Free : 15106 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 255 MiB
Free : 1 MiB
>>> import torch
>>> torch.tensor([1], device=0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
>>>
After I found this, I also noticed that the same happens when I kill the ray::RayExecutor.execute()
processes that are not cleaned properly without persistent_workers=True
.
I tried deleting the DataLoaders
and called GC manually to see if this helps but without success.
I also noticed, when an error occurs, everything is cleaned.
However, I think I can narrow down where the problem is coming from. If the session is killed here before the training has finished, post_dispatch from the RayPlugin is not called.
It gets called when I use the "on_test_end" hook without implementing trainer.test()
. It also gets called "on_train_end" but this leads to
ray::RayExecutor.execute() (pid=42260, ip=10.50.0.19, repr=<ray_lightning.ray_ddp.RayExecutor object at 0x7f903d26c7f0>)
File "/iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 54, in execute
return fn(*args, **kwargs)
File "/iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 298, in execute_remote
super(RayPlugin, self).new_process(
File "/iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 201, in new_process
results = trainer.run_stage()
File "/iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 996, in run_stage
return self._run_train()
File "/iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_train
self.fit_loop.run()
File "/iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
epoch_output = self.epoch_loop.run(train_dataloader)
File "/iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 118, in advance
_, (batch, is_last) = next(dataloader_iter)
File "/iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/pytorch_lightning/profiler/base.py", line 104, in profile_iterable
value = next(iterator)
File "/iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 629, in prefetch_iterator
for val in it:
File "/iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 546, in __next__
return self.request_next_batch(self.loader_iters)
File "/iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 574, in request_next_batch
return apply_to_collection(loader_iters, Iterator, next_fn)
File "/iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 96, in apply_to_collection
return function(data, *args, **kwargs)
File "/iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/pytorch_lightning/trainer/supporters.py", line 561, in next_fn
batch = next(iterator)
File "/iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
data = self._next_data()
File "/iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data
idx, data = self._get_data()
File "/iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1152, in _get_data
success, data = self._try_get_data()
File "/iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1003, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 49241) exited unexpectedly
(pid=42260) terminate called after throwing an instance of 'c10::CUDAError'
(pid=42260) what(): CUDA error: initialization error
(pid=42260) CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
(pid=42260) For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
(pid=42260) Exception raised from insert_events at /opt/conda/conda-bld/pytorch_1623448278899/work/c10/cuda/CUDACachingAllocator.cpp:1089 (most recent call first):
(pid=42260) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f617b972a22 in /iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/torch/lib/libc10.so)
(pid=42260) frame #1: <unknown function> + 0x10ebe (0x7f617bbd4ebe in /iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
(pid=42260) frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7f617bbd6167 in /iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
(pid=42260) frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f617b95c5a4 in /iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/torch/lib/libc10.so)
(pid=42260) frame #4: <unknown function> + 0xa249ba (0x7f616b2ff9ba in /iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
(pid=42260) frame #5: )( + 0x6857e (0x55f02952c57e in ray::RayExecutor.execute)
(pid=42260) frame #6: )( + 0x1670f2 (0x55f02962b0f2 in ray::RayExecutor.execute)
(pid=42260) frame #7: )( + 0x1e9455 (0x55f0296ad455 in ray::RayExecutor.execute)
(pid=42260) frame #8: )(_PyObject_GC_New + 0xc6 (0x55f0296378d6 in ray::RayExecutor.execute)
(pid=42260) frame #9: )(PyWeakref_NewRef + 0x57 (0x55f02965e847 in ray::RayExecutor.execute)
(pid=42260) frame #10: )(PyType_Ready + 0x827 (0x55f02962c577 in ray::RayExecutor.execute)
(pid=42260) frame #11: )( + 0x1b5da2 (0x55f029679da2 in ray::RayExecutor.execute)
(pid=42260) frame #12: )( + 0x1a2d55 (0x55f029666d55 in ray::RayExecutor.execute)
(pid=42260) frame #13: )(_PyObject_MakeTpCall + 0x158 (0x55f02962fda8 in ray::RayExecutor.execute)
(pid=42260) frame #14: )(_PyObject_FastCallDict + 0xa1 (0x55f029647211 in ray::RayExecutor.execute)
(pid=42260) frame #15: )( + 0x1a5dc2 (0x55f029669dc2 in ray::RayExecutor.execute)
(pid=42260) frame #16: )( + 0x16f753 (0x55f029633753 in ray::RayExecutor.execute)
(pid=42260) frame #17: )( + 0x145e14 (0x55f029609e14 in ray::RayExecutor.execute)
(pid=42260) frame #18: )(_PyEval_EvalCodeWithName + 0x952 (0x55f02965c132 in ray::RayExecutor.execute)
(pid=42260) frame #19: )(_PyFunction_Vectorcall + 0x1ff (0x55f02965c85f in ray::RayExecutor.execute)
(pid=42260) frame #20: )( + 0x1a3549 (0x55f029667549 in ray::RayExecutor.execute)
(pid=42260) frame #21: )( + 0x88daf (0x55f02954cdaf in ray::RayExecutor.execute)
(pid=42260) frame #22: )(_PyObject_CallFunction_SizeT + 0x99 (0x55f029630499 in ray::RayExecutor.execute)
(pid=42260) frame #23: <unknown function> + 0x6d7fb (0x7f9051c9c7fb in /iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so)
(pid=42260) frame #24: <unknown function> + 0x71927 (0x7f9051ca0927 in /iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so)
(pid=42260) frame #25: <unknown function> + 0x58eb5 (0x7f9051c87eb5 in /iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so)
(pid=42260) frame #26: <unknown function> + 0x599f3 (0x7f9051c889f3 in /iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so)
(pid=42260) frame #27: <unknown function> + 0x719ea (0x7f9051ca09ea in /iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so)
(pid=42260) frame #28: <unknown function> + 0x71f19 (0x7f9051ca0f19 in /iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so)
(pid=42260) frame #29: <unknown function> + 0x10483b (0x7f9051d3383b in /iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-x86_64-linux-gnu.so)
(pid=42260) frame #30: )(PyCFunction_Call + 0x54 (0x55f02966e834 in ray::RayExecutor.execute)
(pid=42260) frame #31: )(_PyObject_MakeTpCall + 0x158 (0x55f02962fda8 in ray::RayExecutor.execute)
(pid=42260) frame #32: )(_PyEval_EvalFrameDefault + 0x1bf6 (0x55f02968bea6 in ray::RayExecutor.execute)
(pid=42260) frame #33: )(_PyFunction_Vectorcall + 0x108 (0x55f02965c768 in ray::RayExecutor.execute)
(pid=42260) frame #34: )( + 0x146149 (0x55f02960a149 in ray::RayExecutor.execute)
(pid=42260) frame #35: )(_PyFunction_Vectorcall + 0x108 (0x55f02965c768 in ray::RayExecutor.execute)
(pid=42260) frame #36: )(_PyObject_FastCallDict + 0x56 (0x55f0296471c6 in ray::RayExecutor.execute)
(pid=42260) frame #37: )(_PyObject_Call_Prepend + 0x67 (0x55f0296b4a27 in ray::RayExecutor.execute)
(pid=42260) frame #38: )( + 0x1f0aa8 (0x55f0296b4aa8 in ray::RayExecutor.execute)
(pid=42260) frame #39: )(_PyObject_MakeTpCall + 0x158 (0x55f02962fda8 in ray::RayExecutor.execute)
(pid=42260) frame #40: )(_PyEval_EvalFrameDefault + 0x3c3 (0x55f02968a673 in ray::RayExecutor.execute)
(pid=42260) frame #41: )(_PyFunction_Vectorcall + 0x108 (0x55f02965c768 in ray::RayExecutor.execute)
(pid=42260) frame #42: )(_PyObject_FastCallDict + 0x56 (0x55f0296471c6 in ray::RayExecutor.execute)
(pid=42260) frame #43: )(_PyObject_Call_Prepend + 0x67 (0x55f0296b4a27 in ray::RayExecutor.execute)
(pid=42260) frame #44: )( + 0x1f0aa8 (0x55f0296b4aa8 in ray::RayExecutor.execute)
(pid=42260) frame #45: )(_PyObject_MakeTpCall + 0x158 (0x55f02962fda8 in ray::RayExecutor.execute)
(pid=42260) frame #46: )(_PyEval_EvalFrameDefault + 0x4448 (0x55f02968e6f8 in ray::RayExecutor.execute)
(pid=42260) frame #47: )(_PyFunction_Vectorcall + 0x108 (0x55f02965c768 in ray::RayExecutor.execute)
(pid=42260) frame #48: )( + 0x889fd (0x55f02954c9fd in ray::RayExecutor.execute)
(pid=42260) frame #49: )(_PyObject_FastCall_Prepend + 0x63 (0x55f029551ee3 in ray::RayExecutor.execute)
(pid=42260) frame #50: )( + 0x182bb4 (0x55f029646bb4 in ray::RayExecutor.execute)
(pid=42260) frame #51: )(PyObject_GetItem + 0x50 (0x55f0296679c0 in ray::RayExecutor.execute)
(pid=42260) frame #52: )(_PyEval_EvalFrameDefault + 0xf95 (0x55f02968b245 in ray::RayExecutor.execute)
(pid=42260) frame #53: )(_PyEval_EvalCodeWithName + 0x952 (0x55f02965c132 in ray::RayExecutor.execute)
(pid=42260) frame #54: )(_PyFunction_Vectorcall + 0x19b (0x55f02965c7fb in ray::RayExecutor.execute)
(pid=42260) frame #55: )( + 0x145e14 (0x55f029609e14 in ray::RayExecutor.execute)
(pid=42260) frame #56: )(_PyEval_EvalCodeWithName + 0x952 (0x55f02965c132 in ray::RayExecutor.execute)
(pid=42260) frame #57: )(_PyFunction_Vectorcall + 0x19b (0x55f02965c7fb in ray::RayExecutor.execute)
(pid=42260) frame #58: )( + 0x146254 (0x55f02960a254 in ray::RayExecutor.execute)
(pid=42260) frame #59: )(_PyFunction_Vectorcall + 0x108 (0x55f02965c768 in ray::RayExecutor.execute)
(pid=42260) frame #60: )(PyVectorcall_Call + 0x6e (0x55f029630ede in ray::RayExecutor.execute)
(pid=42260) frame #61: )(_PyEval_EvalFrameDefault + 0x4e70 (0x55f02968f120 in ray::RayExecutor.execute)
(pid=42260) frame #62: )(_PyFunction_Vectorcall + 0x108 (0x55f02965c768 in ray::RayExecutor.execute)
(pid=42260) frame #63: )( + 0x146254 (0x55f02960a254 in ray::RayExecutor.execute)
(pid=42260)
(pid=42260) *** SIGABRT received at time=1633531373 on cpu 30 ***
(pid=42260) PC: @ 0x7f90551e5ce1 (unknown) raise
(pid=42260) @ 0x7f9055383140 (unknown) (unknown)
(pid=42260) @ 0x74696e69203a726f (unknown) (unknown)
(pid=42265) ##################################### KILL THEM
(pid=42260) terminate called after throwing an instance of 'c10::CUDAError'
(pid=42260) what(): CUDA error: initialization error
(pid=42260) CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
(pid=42260) For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
(pid=42260) Exception raised from insert_events at /opt/conda/conda-bld/pytorch_1623448278899/work/c10/cuda/CUDACachingAllocator.cpp:1089 (most recent call first):
(pid=42260) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f617b972a22 in /iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/torch/lib/libc10.so)
(pid=42260) frame #1: <unknown function> + 0x10ebe (0x7f617bbd4ebe in /iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
(pid=42260) frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7f617bbd6167 in /iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
(pid=42260) frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f617b95c5a4 in /iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/torch/lib/libc10.so)
(pid=42260) frame #4: <unknown function> + 0xa249ba (0x7f616b2ff9ba in /iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
(pid=42260) frame #5: )( + 0x6857e (0x55f02952c57e in ray::RayExecutor.execute)
(pid=42260) frame #6: )( + 0x1670f2 (0x55f02962b0f2 in ray::RayExecutor.execute)
(pid=42260) frame #7: )( + 0x1e9455 (0x55f0296ad455 in ray::RayExecutor.execute)
(pid=42260) frame #8: )(PyType_GenericAlloc + 0x203 (0x55f02962bd33 in ray::RayExecutor.execute)
(pid=42260) frame #9: <unknown function> + 0xa24870 (0x7f616b2ff870 in /iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
(pid=42260) frame #10: THPVariable_Wrap(at::Tensor) + 0x6a (0x7f616b2ffb6a in /iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
(pid=42260) frame #11: <unknown function> + 0x654f35 (0x7f616af2ff35 in /iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
(pid=42260) frame #12: <unknown function> + 0x5fba77 (0x7f616aed6a77 in /iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
(pid=42260) frame #13: <unknown function> + 0x5fc3f6 (0x7f616aed73f6 in /iarai/home/markus.spanring/.conda/envs/torch/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
(pid=42260) frame #14: )( + 0x17f385 (0x55f029643385 in ray::RayExecutor.execute)
(pid=42260) frame #15: )( + 0x935f9 (0x55f0295575f9 in ray::RayExecutor.execute)
(pid=42260) frame #16: )(PyObject_RichCompare + 0x78 (0x55f029632b68 in ray::RayExecutor.execute)
(pid=42260) frame #17: )(_PyEval_EvalFrameDefault + 0x4303 (0x55f02968e5b3 in ray::RayExecutor.execute)
(pid=42260) frame #18: )(_PyEval_EvalCodeWithName + 0x1e9 (0x55f02965b9c9 in ray::RayExecutor.execute)
(pid=42260) frame #19: )(_PyFunction_Vectorcall + 0x1ff (0x55f02965c85f in ray::RayExecutor.execute)
(pid=42260) frame #20: )( + 0x146149 (0x55f02960a149 in ray::RayExecutor.execute)
(pid=42260) frame #21: )(_PyFunction_Vectorcall + 0x108 (0x55f02965c768 in ray::RayExecutor.execute)
(pid=42260) frame #22: )( + 0x1a3549 (0x55f029667549 in ray::RayExecutor.execute)
(pid=42260) frame #23: )(PyVectorcall_Call + 0x6e (0x55f029630ede in ray::RayExecutor.execute)
(pid=42260) frame #24: )(_PyEval_EvalFrameDefault + 0x4e70 (0x55f02968f120 in ray::RayExecutor.execute)
(pid=42260) frame #25: )(_PyEval_EvalCodeWithName + 0x1e9 (0x55f02965b9c9 in ray::RayExecutor.execute)
(pid=42260) frame #26: )(_PyFunction_Vectorcall + 0x19b (0x55f02965c7fb in ray::RayExecutor.execute)
(pid=42260) frame #27: )(_PyObject_FastCallDict + 0x56 (0x55f0296471c6 in ray::RayExecutor.execute)
(pid=42260) frame #28: )(_PyObject_Call_Prepend + 0x67 (0x55f0296b4a27 in ray::RayExecutor.execute)
(pid=42260) frame #29: )( + 0x1f0aa8 (0x55f0296b4aa8 in ray::RayExecutor.execute)
(pid=42260) frame #30: )(_PyObject_MakeTpCall + 0x158 (0x55f02962fda8 in ray::RayExecutor.execute)
(pid=42260) frame #31: )(_PyEval_EvalFrameDefault + 0x3c3 (0x55f02968a673 in ray::RayExecutor.execute)
(pid=42260) frame #32: )(_PyFunction_Vectorcall + 0x108 (0x55f02965c768 in ray::RayExecutor.execute)
(pid=42260) frame #33: )(_PyObject_FastCallDict + 0x56 (0x55f0296471c6 in ray::RayExecutor.execute)
(pid=42260) frame #34: )(_PyObject_Call_Prepend + 0x67 (0x55f0296b4a27 in ray::RayExecutor.execute)
(pid=42260) frame #35: )( + 0x1f0aa8 (0x55f0296b4aa8 in ray::RayExecutor.execute)
(pid=42260) frame #36: )(_PyObject_MakeTpCall + 0x158 (0x55f02962fda8 in ray::RayExecutor.execute)
(pid=42260) frame #37: )(_PyEval_EvalFrameDefault + 0x4448 (0x55f02968e6f8 in ray::RayExecutor.execute)
(pid=42260) frame #38: )(_PyFunction_Vectorcall + 0x108 (0x55f02965c768 in ray::RayExecutor.execute)
(pid=42260) frame #39: )( + 0x889fd (0x55f02954c9fd in ray::RayExecutor.execute)
(pid=42260) frame #40: )(_PyObject_FastCall_Prepend + 0x63 (0x55f029551ee3 in ray::RayExecutor.execute)
(pid=42260) frame #41: )( + 0x182bb4 (0x55f029646bb4 in ray::RayExecutor.execute)
(pid=42260) frame #42: )(PyObject_GetItem + 0x50 (0x55f0296679c0 in ray::RayExecutor.execute)
(pid=42260) frame #43: )(_PyEval_EvalFrameDefault + 0xf95 (0x55f02968b245 in ray::RayExecutor.execute)
(pid=42260) frame #44: )(_PyEval_EvalCodeWithName + 0x952 (0x55f02965c132 in ray::RayExecutor.execute)
(pid=42260) frame #45: )(_PyFunction_Vectorcall + 0x19b (0x55f02965c7fb in ray::RayExecutor.execute)
(pid=42260) frame #46: )( + 0x145e14 (0x55f029609e14 in ray::RayExecutor.execute)
(pid=42260) frame #47: )(_PyEval_EvalCodeWithName + 0x952 (0x55f02965c132 in ray::RayExecutor.execute)
(pid=42260) frame #48: )(_PyFunction_Vectorcall + 0x19b (0x55f02965c7fb in ray::RayExecutor.execute)
(pid=42260) frame #49: )( + 0x146254 (0x55f02960a254 in ray::RayExecutor.execute)
(pid=42260) frame #50: )(_PyFunction_Vectorcall + 0x108 (0x55f02965c768 in ray::RayExecutor.execute)
(pid=42260) frame #51: )(PyVectorcall_Call + 0x6e (0x55f029630ede in ray::RayExecutor.execute)
(pid=42260) frame #52: )(_PyEval_EvalFrameDefault + 0x4e70 (0x55f02968f120 in ray::RayExecutor.execute)
(pid=42260) frame #53: )(_PyFunction_Vectorcall + 0x108 (0x55f02965c768 in ray::RayExecutor.execute)
(pid=42260) frame #54: )( + 0x146254 (0x55f02960a254 in ray::RayExecutor.execute)
(pid=42260) frame #55: )(_PyEval_EvalCodeWithName + 0x886 (0x55f02965c066 in ray::RayExecutor.execute)
(pid=42260) frame #56: )( + 0x1a34c6 (0x55f0296674c6 in ray::RayExecutor.execute)
(pid=42260) frame #57: )( + 0x145cc7 (0x55f029609cc7 in ray::RayExecutor.execute)
(pid=42260) frame #58: )(_PyFunction_Vectorcall + 0x108 (0x55f02965c768 in ray::RayExecutor.execute)
(pid=42260) frame #59: )( + 0x146254 (0x55f02960a254 in ray::RayExecutor.execute)
(pid=42260) frame #60: )(_PyFunction_Vectorcall + 0x108 (0x55f02965c768 in ray::RayExecutor.execute)
(pid=42260) frame #61: )(_PyObject_FastCallDict + 0x56 (0x55f0296471c6 in ray::RayExecutor.execute)
(pid=42260) frame #62: )( + 0x184fef (0x55f029648fef in ray::RayExecutor.execute)
(pid=42260) frame #63: )( + 0x1a2dea (0x55f029666dea in ray::RayExecutor.execute)
(pid=42260)
(pid=42260) *** SIGABRT received at time=1633531373 on cpu 55 ***
(pid=42260) PC: @ 0x7f90551e5ce1 (unknown) raise
(pid=42260) @ 0x7f9055383140 (unknown) (unknown)
(pid=42260) @ 0x74696e69203a726f (unknown) (unknown)
When I use "on_validation_end" I get the zombie processes.
@MarkusSpanring by any chance you got this issue resolved?
@scv119 not yet. FYI, I was able to boil it down to the PyTorch DataLoader. I have opened an issue already but there is no comment/fix yet.
Thanks, I also encounter this issue. Hope it will be fixed soon
I get this issue as well. Currently I solve it with ray stop
and ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --metrics-export-port=8080
but this is harder when you have multiple nodes