mars icon indicating copy to clipboard operation
mars copied to clipboard

[BUG] mars storage get/fetch/delete hang

Open chaokunyang opened this issue 3 years ago • 0 comments

Describe the bug A clear and concise description of what the bug is.

To Reproduce To help us reproducing this bug, please provide information below:

  1. Your Python version
  2. The version of Mars you use
  3. Versions of crucial packages, such as numpy, scipy and pandas
  4. Full stack of the error.
  • fetch hang
(pid=61609) 2022-03-09 13:24:28,105     WARNING debug.py:73 -- Process message SendMessage(actor_ref=ActorRef(uid=b'storage_handler_numa-0', address='ray://ray-cluster-1646803369/1/0'), content=('fetch_batch', 0, ('sRQxBOzl6jhjJjM9c4tRLpn9', ['fbfbd2c5cef47dab064f7816044325e5'], None, 'numa-0', None, 'raise'), {}), protocol=0, message_id=b'\x0b\x1f{\x96\x8c\x9dl\xe3\x9a\xaa\x8e?\xecr\xe2\xc2=\xfd^\xc9\xa1\xe34\xbfg\xf3\xee\xd3\xc88\x86T', message_trace=[MessageTraceItem(uid=b'SubtaskExecutionActor', address='ray://ray-cluster-1646803369/1/0', method='internal_run_subtask')], profiling_context=None) of channel <mars.oscar.backends.communication.dummy.DummyChannel object at 0x7f89ac3aa6d0> timeout.(timeout for 90.0045 seconds).

  • delete hang:
/tmp/ray/session_latest/logs//worker-1aee3c0e48a647058afbb07d3b7f8189f5a3e1cac61e10b9f6f62dcb-01000000-61607.err:18:2022-03-09 13:23:08,497     WARNING debug.py:73 -- Client sent message is SendMessage(actor_ref=ActorRef(uid=b'storage_handler_numa-0', address='ray://ray-cluster-1646803369/1/0'), content=('delete', 0, ('sRQxBOzl6jhjJjM9c4tRLpn9', 'ed694754869a9bd34e3c78c4cefa2174'), {'error': 'ignore'}), protocol=0, message_id=b'\x15z\xf3\xa8QE\xc8_e\x1a\xca\x16\xf5XnI\x95\x84\xf6%=\x80\xb9\xb18\x05x\xfe=\xb9\x94\xee', message_trace=[], profiling_context=None)(timeout for 10.0011 seconds).
/tmp/ray/session_latest/logs//worker-1aee3c0e48a647058afbb07d3b7f8189f5a3e1cac61e10b9f6f62dcb-01000000-61607.err:36:2022-03-09 13:23:28,499     WARNING debug.py:73 -- Client sent message is SendMessage(actor_ref=ActorRef(uid=b'storage_handler_numa-0', address='ray://ray-cluster-1646803369/1/0'), content=('delete', 0, ('sRQxBOzl6jhjJjM9c4tRLpn9', 'ed694754869a9bd34e3c78c4cefa2174'), {'error': 'ignore'}), protocol=0, message_id=b'\x15z\xf3\xa8QE\xc8_e\x1a\xca\x16\xf5XnI\x95\x84\xf6%=\x80\xb9\xb18\x05x\xfe=\xb9\x94\xee', message_trace=[], profiling_context=None)(timeout for 30.0025 seconds).
/tmp/ray/session_latest/logs//worker-1aee3c0e48a647058afbb07d3b7f8189f5a3e1cac61e10b9f6f62dcb-01000000-61607.err:45:2022-03-09 13:23:58,501     WARNING debug.py:73 -- Client sent message is SendMessage(actor_ref=ActorRef(uid=b'storage_handler_numa-0', address='ray://ray-cluster-1646803369/1/0'), content=('delete', 0, ('sRQxBOzl6jhjJjM9c4tRLpn9', 'ed694754869a9bd34e3c78c4cefa2174'), {'error': 'ignore'}), protocol=0, message_id=b'\x15z\xf3\xa8QE\xc8_e\x1a\xca\x16\xf5XnI\x95\x84\xf6%=\x80\xb9\xb18\x05x\xfe=\xb9\x94\xee', message_trace=[], profiling_context=None)(timeout for 60.0048 seconds).
/tmp/ray/session_latest/logs//worker-1aee3c0e48a647058afbb07d3b7f8189f5a3e1cac61e10b9f6f62dcb-01000000-61607.err:65:2022-03-09 13:24:38,504     WARNING debug.py:73 -- Client sent message is SendMessage(actor_ref=ActorRef(uid=b'storage_handler_numa-0', address='ray://ray-cluster-1646803369/1/0'), content=('delete', 0, ('sRQxBOzl6jhjJjM9c4tRLpn9', 'ed694754869a9bd34e3c78c4cefa2174'), {'error': 'ignore'}), protocol=0, message_id=b'\x15z\xf3\xa8QE\xc8_e\x1a\xca\x16\xf5XnI\x95\x84\xf6%=\x80\xb9\xb18\x05x\xfe=\xb9\x94\xee', message_trace=[], profiling_context=None)(timeout for 100.0076 seconds).
/tmp/ray/session_latest/logs//worker-1aee3c0e48a647058afbb07d3b7f8189f5a3e1cac61e10b9f6f62dcb-01000000-61607.err:75:2022-03-09 13:25:10,187     WARNING debug.py:73 -- Client sent message is SendMessage(actor_ref=ActorRef(uid=b'storage_handler_numa-0', address='ray://ray-cluster-1646803369/1/0'), content=('delete', 1, ([('sRQxBOzl6jhjJjM9c4tRLpn9', '83f7ba3864bb7182a983228cc7737441'), ('sRQxBOzl6jhjJjM9c4tRLpn9', 'e46c44dcc779cd6580aac01d589cd0a8')], [{'error': 'ignore'}, {'error': 'ignore'}]), {}), protocol=0, message_id=b'\x04\xe2yF\xa4\x11\xf0\xc8\x9a\x95\x00\xbf\x99\x04\n\xb3z\x98\xdf\x97\x81\xdafM\xec\x90m\xfa\xfc\x80\xd5\xec', message_trace=[], profiling_context=None)(timeout for 10.0019 seconds).
/tmp/ray/session_latest/logs//worker-1aee3c0e48a647058afbb07d3b7f8189f5a3e1cac61e10b9f6f62dcb-01000000-61607.err:89:2022-03-09 13:25:30,190     WARNING debug.py:73 -- Client sent message is SendMessage(actor_ref=ActorRef(uid=b'storage_handler_numa-0', address='ray://ray-cluster-1646803369/1/0'), content=('delete', 1, ([('sRQxBOzl6jhjJjM9c4tRLpn9', '83f7ba3864bb7182a983228cc7737441'), ('sRQxBOzl6jhjJjM9c4tRLpn9', 'e46c44dcc779cd6580aac01d589cd0a8')], [{'error': 'ignore'}, {'error': 'ignore'}]), {}), protocol=0, message_id=b'\x04\xe2yF\xa4\x11\xf0\xc8\x9a\x95\x00\xbf\x99\x04\n\xb3z\x98\xdf\x97\x81\xdafM\xec\x90m\xfa\xfc\x80\xd5\xec', message_trace=[], profiling_context=None)(timeout for 30.0053 seconds).
/tmp/ray/session_latest/logs//worker-1aee3c0e48a647058afbb07d3b7f8189f5a3e1cac61e10b9f6f62dcb-01000000-61607.err:102:2022-03-09 13:26:00,191    WARNING debug.py:73 -- Client sent message is SendMessage(actor_ref=ActorRef(uid=b'storage_handler_numa-0', address='ray://ray-cluster-1646803369/1/0'), content=('delete', 1, ([('sRQxBOzl6jhjJjM9c4tRLpn9', '83f7ba3864bb7182a983228cc7737441'), ('sRQxBOzl6jhjJjM9c4tRLpn9', 'e46c44dcc779cd6580aac01d589cd0a8')], [{'error': 'ignore'}, {'error': 'ignore'}]), {}), protocol=0, message_id=b'\x04\xe2yF\xa4\x11\xf0\xc8\x9a\x95\x00\xbf\x99\x04\n\xb3z\x98\xdf\x97\x81\xdafM\xec\x90m\xfa\xfc\x80\xd5\xec', message_trace=[], profiling_context=None)(timeout for 60.0061 seconds).
/tmp/ray/session_latest/logs//worker-1aee3c0e48a647058afbb07d3b7f8189f5a3e1cac61e10b9f6f62dcb-01000000-61607.err:118:2022-03-09 13:26:40,195    WARNING debug.py:73 -- Client sent message is SendMessage(actor_ref=ActorRef(uid=b'storage_handler_numa-0', address='ray://ray-cluster-1646803369/1/0'), content=('delete', 1, ([('sRQxBOzl6jhjJjM9c4tRLpn9', '83f7ba3864bb7182a983228cc7737441'), ('sRQxBOzl6jhjJjM9c4tRLpn9', 'e46c44dcc779cd6580aac01d589cd0a8')], [{'error': 'ignore'}, {'error': 'ignore'}]), {}), protocol=0, message_id=b'\x04\xe2yF\xa4\x11\xf0\xc8\x9a\x95\x00\xbf\x99\x04\n\xb3z\x98\xdf\x97\x81\xdafM\xec\x90m\xfa\xfc\x80\xd5\xec', message_trace=[], profiling_context=None)(timeout for 100.0098 seconds).
/tmp/ray/session_latest/logs//worker-1aee3c0e48a647058afbb07d3b7f8189f5a3e1cac61e10b9f6f62dcb-01000000-61607.err:129:2022-03-09 13:27:30,219    WARNING debug.py:73 -- Client sent message is SendMessage(actor_ref=ActorRef(uid=b'storage_handler_numa-0', address='ray://ray-cluster-1646803369/1/0'), content=('delete', 1, ([('sRQxBOzl6jhjJjM9c4tRLpn9', '83f7ba3864bb7182a983228cc7737441'), ('sRQxBOzl6jhjJjM9c4tRLpn9', 'e46c44dcc779cd6580aac01d589cd0a8')], [{'error': 'ignore'}, {'error': 'ignore'}]), {}), protocol=0, message_id=b'\x04\xe2yF\xa4\x11\xf0\xc8\x9a\x95\x00\xbf\x99\x04\n\xb3z\x98\xdf\x97\x81\xdafM\xec\x90m\xfa\xfc\x80\xd5\xec', message_trace=[], profiling_context=None)(timeout for 150.0337 seconds).
/tmp/ray/session_latest/logs//worker-1aee3c0e48a647058afbb07d3b7f8189f5a3e1cac61e10b9f6f62dcb-01000000-61607.err:150:2022-03-09 13:28:30,232    WARNING debug.py:73 -- Client sent message is SendMessage(actor_ref=ActorRef(uid=b'storage_handler_numa-0', address='ray://ray-cluster-1646803369/1/0'), content=('delete', 1, ([('sRQxBOzl6jhjJjM9c4tRLpn9', '83f7ba3864bb7182a983228cc7737441'), ('sRQxBOzl6jhjJjM9c4tRLpn9', 'e46c44dcc779cd6580aac01d589cd0a8')], [{'error': 'ignore'}, {'error': 'ignore'}]), {}), protocol=0, message_id=b'\x04\xe2yF\xa4\x11\xf0\xc8\x9a\x95\x00\xbf\x99\x04\n\xb3z\x98\xdf\x97\x81\xdafM\xec\x90m\xfa\xfc\x80\xd5\xec', message_trace=[], profiling_context=None)(timeout for 210.0471 seconds).
/tmp/ray/session_latest/logs//worker-1aee3c0e48a647058afbb07d3b7f8189f5a3e1cac61e10b9f6f62dcb-01000000-61607.err:164:2022-03-09 13:29:40,236    WARNING debug.py:73 -- Client sent message is SendMessage(actor_ref=ActorRef(uid=b'storage_handler_numa-0', address='ray://ray-cluster-1646803369/1/0'), content=('delete', 1, ([('sRQxBOzl6jhjJjM9c4tRLpn9', '83f7ba3864bb7182a983228cc7737441'), ('sRQxBOzl6jhjJjM9c4tRLpn9', 'e46c44dcc779cd6580aac01d589cd0a8')], [{'error': 'ignore'}, {'error': 'ignore'}]), {}), protocol=0, message_id=b'\x04\xe2yF\xa4\x11\xf0\xc8\x9a\x95\x00\xbf\x99\x04\n\xb3z\x98\xdf\x97\x81\xdafM\xec\x90m\xfa\xfc\x80\xd5\xec', message_trace=[], profiling_context=None)(timeout for 280.0512 seconds).

  1. Reproduce:
pytest -v -s --log-level=DEBUG --timeout=1500 -W ignore::PendingDeprecationWarning --cov-config=setup.cfg --cov-report= --cov=mars --durations=0 -m ray mars/dataframe/contrib/raydataset/tests/test_raydataset.py::test_convert_to_ray_dataset

chaokunyang avatar Mar 09 '22 05:03 chaokunyang