xorbits
xorbits copied to clipboard
BUG: matmul faces OOM issue
trafficstars
Describe the bug
matmul faces Out-of-Memory issue. The same matrix size can run on Dask.
To Reproduce
git clone [email protected]:xorbitsai/benchmarks.git
cd array/xorbits
Run the matmul workload, you can specify an Xorbits cluster endpoint via ${address} parameter.
python workloads.py --endpoint ${address} \
--workloads matmul \
--size xl
Expected behavior
Xorbits can run matmul on large matrix.
When the matrix is large, say: 100_000 * 100_000, using more nodes may help. Now I use 10 nodes. Each is with 512GB memory and 256GB /dev/shm. I do get the actual calculation results. But I got the following error. It seems that the actors cannot shutdown properly.
2023-09-25 17:11:07,579 xorbits._mars.services.web.core 17333 ERROR ActorNotExist when handling request with LifecycleWebAPIHandler.decref_tileables
Traceback (most recent call last):
File "/fs/fast/u20200002/envs/ds/lib/python3.10/site-packages/xorbits/_mars/services/web/core.py", line 69, in wrapped
res = await func(self, *args, **kwargs)
File "/fs/fast/u20200002/envs/ds/lib/python3.10/site-packages/xorbits/_mars/services/lifecycle/api/web.py", line 39, in decref_tileables
await oscar_api.decref_tileables(tileable_keys, counts=counts)
File "/fs/fast/u20200002/envs/ds/lib/python3.10/site-packages/xorbits/_mars/services/lifecycle/api/oscar.py", line 108, in decref_tileables
return await self._lifecycle_tracker_ref.decref_tileables(tileable_keys)
File "/fs/fast/u20200002/envs/ds/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send
return self._process_result_message(result)
File "/fs/fast/u20200002/envs/ds/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/fs/fast/u20200002/envs/ds/lib/python3.10/site-packages/xoscar/backends/pool.py", line 657, in send
result = await self._run_coro(message.message_id, coro)
File "/fs/fast/u20200002/envs/ds/lib/python3.10/site-packages/xoscar/backends/pool.py", line 368, in _run_coro
return await coro
File "/fs/fast/u20200002/envs/ds/lib/python3.10/site-packages/xoscar/api.py", line 306, in __on_receive__
return await super().__on_receive__(message) # type: ignore
File "xoscar/core.pyx", line 558, in __on_receive__
raise ex
File "xoscar/core.pyx", line 550, in xoscar.core._BaseActor.__on_receive__
return await self._handle_actor_result(result)
File "xoscar/core.pyx", line 422, in _handle_actor_result
task_result = await coros[0]
File "xoscar/core.pyx", line 465, in xoscar.core._BaseActor._run_actor_async_generator
async with self._lock:
File "xoscar/core.pyx", line 466, in xoscar.core._BaseActor._run_actor_async_generator
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 471, in xoscar.core._BaseActor._run_actor_async_generator
res = await gen.athrow(*res)
File "/fs/fast/u20200002/envs/ds/lib/python3.10/site-packages/xorbits/_mars/services/lifecycle/supervisor/tracker.py", line 255, in decref_tileables
yield asyncio.gather(*coros)
File "xoscar/core.pyx", line 476, in xoscar.core._BaseActor._run_actor_async_generator
res = await self._handle_actor_result(res)
File "xoscar/core.pyx", line 396, in _handle_actor_result
result = await result
File "/fs/fast/u20200002/envs/ds/lib/python3.10/site-packages/xorbits/_mars/services/lifecycle/supervisor/tracker.py", line 174, in _remove_chunks
await self._meta_api.del_chunk_meta.batch(*delete_metas)
File "/fs/fast/u20200002/envs/ds/lib/python3.10/site-packages/xoscar/batch.py", line 151, in _async_batch
return await self.batch_func(args_list, kwargs_list)
File "/fs/fast/u20200002/envs/ds/lib/python3.10/site-packages/xorbits/_mars/services/meta/api/oscar.py", line 204, in batch_del_chunk_meta
del_chunk_metas.append(self._meta_store.del_meta.delay(*args, **kwargs))
File "xoscar/core.pyx", line 259, in xoscar.core.LocalActorRef.__getattr__
raise ActorNotExist(f"Actor {self.uid} does not exist") from None
xoscar.errors.ActorNotExist: [address=cpu64c-3:39783, pid=17469] Actor b'rorsdfNHuVApDnQuJePe8bJR_meta' does not exist
2023-09-25 17:11:07,586 tornado.access 17333 ERROR 500 POST /api/session/rorsdfNHuVApDnQuJePe8bJR/lifecycle?action=decref_tileables (192.168.0.77) 867.54ms