mars
mars copied to clipboard
[BUG] mars client timeout after cancel subtask in notebook
Describe the bug When running following code in nodebook cell and cancel in the middle and re-execute it again, mars will throw timeout error::
urldf = df.groupby(["id"])["trd_longitude","trd_latitude","id"].apply(lambda x: x.sum()).reset_index().execute()
print(urldf.head(2).execute())
timeout error stack:
/root/miniconda3/lib/python3.7/site-packages/mars/dataframe/groupby/getitem.py:48: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
indexed = groupby.op.build_mock_groupby()[self.selection]
2022-04-13 20:38:17,152 - mars.deploy.oscar.session - INFO - Time consuming to generate a tileable graph is 0.004700660705566406s with address http://11.72.5.50:56741, session id xofsti2ZB62dPMp5X8w5BHEx
2022-04-13 20:38:21,303 - mars.services.web.core - WARNING - Request http://11.72.5.50:56741/api/session/xofsti2ZB62dPMp5X8w5BHEx/task/3bsProIFVZFc3VZTqYqToBy1 timeout, requests params is {'params': {'action': 'progress'}}, ex is 'Timeout during request'. sleep 20 seconds and retry 2rd times again.
---------------------------------------------------------------------------
TimeoutError Traceback (most recent call last)
/tmp/ipykernel_88934/2699780381.py in <module>
18 #lambda x: pd.Series([0, 0], index=['pred', 'scores'])
19
---> 20 main()
/tmp/ipykernel_88934/2699780381.py in main()
2
3
----> 4 urldf = df.groupby(["id"])["trd_longitude","trd_latitude","id"].apply(lambda x: x.sum()).reset_index().execute()
5 print(urldf.head(2).execute())
6
~/miniconda3/lib/python3.7/site-packages/mars/core/entity/tileables.py in execute(self, session, **kw)
462
463 def execute(self, session=None, **kw):
--> 464 result = self.data.execute(session=session, **kw)
465 if isinstance(result, TILEABLE_TYPE):
466 return self
~/miniconda3/lib/python3.7/site-packages/mars/core/entity/executable.py in execute(self, session, **kw)
136
137 session = _get_session(self, session)
--> 138 return execute(self, session=session, **kw)
139
140 def _check_session(self, session: SessionType, action: str):
~/miniconda3/lib/python3.7/site-packages/mars/deploy/oscar/session.py in execute(tileable, session, wait, new_session_kwargs, show_progress, progress_update_interval, *tileables, **kwargs)
1865 show_progress=show_progress,
1866 progress_update_interval=progress_update_interval,
-> 1867 **kwargs,
1868 )
1869
~/miniconda3/lib/python3.7/site-packages/mars/deploy/oscar/session.py in execute(self, tileable, show_progress, warn_duplicated_execution, *tileables, **kwargs)
1654 try:
1655 execution_info: ExecutionInfo = fut.result(
-> 1656 timeout=self._isolated_session.timeout
1657 )
1658 except KeyboardInterrupt: # pragma: no cover
~/miniconda3/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
433 raise CancelledError()
434 elif self._state == FINISHED:
--> 435 return self.__get_result()
436 else:
437 raise TimeoutError()
~/miniconda3/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
~/miniconda3/lib/python3.7/site-packages/mars/deploy/oscar/session.py in _execute(session, wait, show_progress, progress_update_interval, cancelled, *tileables, **kwargs)
1811 **kwargs,
1812 ):
-> 1813 execution_info = await session.execute(*tileables, **kwargs)
1814
1815 def _attach_session(future: asyncio.Future):
~/miniconda3/lib/python3.7/site-packages/mars/deploy/oscar/session.py in execute(self, *tileables, **kwargs)
997 task_name=task_name,
998 fuse_enabled=fuse_enabled,
--> 999 extra_config=extra_config,
1000 )
1001
~/miniconda3/lib/python3.7/site-packages/mars/services/task/api/web.py in submit_tileable_graph(self, graph, task_name, fuse_enabled, extra_config)
229 method="POST",
230 headers={"Content-Type": "application/octet-stream"},
--> 231 data=body,
232 )
233 return res.body.decode().strip()
~/miniconda3/lib/python3.7/site-packages/mars/services/web/core.py in _request_url(self, method, path, wrap_timeout_exception, **kwargs)
240 except HTTPTimeoutError as ex:
241 if wrap_timeout_exception:
--> 242 raise TimeoutError(str(ex)) from None
243 else:
244 raise ex
TimeoutError: Timeout during request
The mars cluter is a 100 * 8core cluster, and I'm using mars web api to conect to the mars cluster.
To Reproduce To help us reproducing this bug, please provide information below:
- Your Python version: 3.7.9
- The version of Mars you use: master
- Versions of crucial packages, such as numpy, scipy and pandas
- Full stack of the error.
- Minimized code to reproduce the error.
Expected behavior A clear and concise description of what you expected to happen.