mars icon indicating copy to clipboard operation
mars copied to clipboard

[BUG] mars client timeout after cancel subtask in notebook

Open chaokunyang opened this issue 3 years ago • 0 comments

Describe the bug When running following code in nodebook cell and cancel in the middle and re-execute it again, mars will throw timeout error::

urldf = df.groupby(["id"])["trd_longitude","trd_latitude","id"].apply(lambda x: x.sum()).reset_index().execute()
print(urldf.head(2).execute())

timeout error stack:

/root/miniconda3/lib/python3.7/site-packages/mars/dataframe/groupby/getitem.py:48: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
  indexed = groupby.op.build_mock_groupby()[self.selection]
2022-04-13 20:38:17,152 - mars.deploy.oscar.session - INFO - Time consuming to generate a tileable graph is 0.004700660705566406s with address http://11.72.5.50:56741, session id xofsti2ZB62dPMp5X8w5BHEx
2022-04-13 20:38:21,303 - mars.services.web.core - WARNING - Request http://11.72.5.50:56741/api/session/xofsti2ZB62dPMp5X8w5BHEx/task/3bsProIFVZFc3VZTqYqToBy1 timeout, requests params is {'params': {'action': 'progress'}}, ex is 'Timeout during request'. sleep 20 seconds and retry 2rd times again.
---------------------------------------------------------------------------
TimeoutError                              Traceback (most recent call last)
/tmp/ipykernel_88934/2699780381.py in <module>
     18 #lambda x: pd.Series([0, 0], index=['pred', 'scores'])
     19 
---> 20 main()

/tmp/ipykernel_88934/2699780381.py in main()
      2 
      3 
----> 4     urldf = df.groupby(["id"])["trd_longitude","trd_latitude","id"].apply(lambda x: x.sum()).reset_index().execute()
      5     print(urldf.head(2).execute())
      6 

~/miniconda3/lib/python3.7/site-packages/mars/core/entity/tileables.py in execute(self, session, **kw)
    462 
    463     def execute(self, session=None, **kw):
--> 464         result = self.data.execute(session=session, **kw)
    465         if isinstance(result, TILEABLE_TYPE):
    466             return self

~/miniconda3/lib/python3.7/site-packages/mars/core/entity/executable.py in execute(self, session, **kw)
    136 
    137         session = _get_session(self, session)
--> 138         return execute(self, session=session, **kw)
    139 
    140     def _check_session(self, session: SessionType, action: str):

~/miniconda3/lib/python3.7/site-packages/mars/deploy/oscar/session.py in execute(tileable, session, wait, new_session_kwargs, show_progress, progress_update_interval, *tileables, **kwargs)
   1865         show_progress=show_progress,
   1866         progress_update_interval=progress_update_interval,
-> 1867         **kwargs,
   1868     )
   1869 

~/miniconda3/lib/python3.7/site-packages/mars/deploy/oscar/session.py in execute(self, tileable, show_progress, warn_duplicated_execution, *tileables, **kwargs)
   1654         try:
   1655             execution_info: ExecutionInfo = fut.result(
-> 1656                 timeout=self._isolated_session.timeout
   1657             )
   1658         except KeyboardInterrupt:  # pragma: no cover

~/miniconda3/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
    433                 raise CancelledError()
    434             elif self._state == FINISHED:
--> 435                 return self.__get_result()
    436             else:
    437                 raise TimeoutError()

~/miniconda3/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

~/miniconda3/lib/python3.7/site-packages/mars/deploy/oscar/session.py in _execute(session, wait, show_progress, progress_update_interval, cancelled, *tileables, **kwargs)
   1811     **kwargs,
   1812 ):
-> 1813     execution_info = await session.execute(*tileables, **kwargs)
   1814 
   1815     def _attach_session(future: asyncio.Future):

~/miniconda3/lib/python3.7/site-packages/mars/deploy/oscar/session.py in execute(self, *tileables, **kwargs)
    997             task_name=task_name,
    998             fuse_enabled=fuse_enabled,
--> 999             extra_config=extra_config,
   1000         )
   1001 

~/miniconda3/lib/python3.7/site-packages/mars/services/task/api/web.py in submit_tileable_graph(self, graph, task_name, fuse_enabled, extra_config)
    229             method="POST",
    230             headers={"Content-Type": "application/octet-stream"},
--> 231             data=body,
    232         )
    233         return res.body.decode().strip()

~/miniconda3/lib/python3.7/site-packages/mars/services/web/core.py in _request_url(self, method, path, wrap_timeout_exception, **kwargs)
    240         except HTTPTimeoutError as ex:
    241             if wrap_timeout_exception:
--> 242                 raise TimeoutError(str(ex)) from None
    243             else:
    244                 raise ex

TimeoutError: Timeout during request

The mars cluter is a 100 * 8core cluster, and I'm using mars web api to conect to the mars cluster.

To Reproduce To help us reproducing this bug, please provide information below:

  1. Your Python version: 3.7.9
  2. The version of Mars you use: master
  3. Versions of crucial packages, such as numpy, scipy and pandas
  4. Full stack of the error.
  5. Minimized code to reproduce the error.

Expected behavior A clear and concise description of what you expected to happen.

chaokunyang avatar Apr 13 '22 12:04 chaokunyang