connectors
connectors copied to clipboard
[SPO] harden error handling for single-document issues
Problem Description
SPO connector will fail if a single request errors.
We should:
- [ ] make sure that this specific error does not occur
- [ ] make the whole connector more flexible and resilient to these types of errors
Example of another error we should be able to retry or move past:
Received 400 response from https://graph.microsoft.com/v1.0/sites/<site_id>/lists/<list_id>/items?$select=createdDateTime,id,lastModifiedDateTime,weburl,createdBy,lastModifiedBy,contentType&$expand=fields($select=Title,Link,Attachments,LinkTitle,LinkFilename,Description,Conversation)
full stack trace
[FMWK][12:44:01][WARNING] [Sync Job id: DjoWlIoB26kkwmCbnNr-, connector id: DToVlIoB26kkwmCb0dpW, index name: search-retail] Received 400 response from https://graph.microsoft.com/v1.0/sites/<site_id>/lists/<list_id>/items?$select=createdDateTime,id,lastModifiedDateTime,weburl,createdBy,lastModifiedBy,contentType&$expand=fields($select=Title,Link,Attachments,LinkTitle,LinkFilename,Description,Conversation)
[FMWK][12:44:01][CRITICAL] [Sync Job id: DjoWlIoB26kkwmCbnNr-, connector id: DToVlIoB26kkwmCb0dpW, index name: search-retail] The document fetcher failed
Traceback (most recent call last):
File "/path/to/connectors-python/connectors/sources/sharepoint_online.py", line 402, in _get
async with self._http_session.get(
File "/path/to/connectors-python/lib/python3.10/site-packages/aiohttp/client.py", line 1141, in __aenter__
self._resp = await self._coro
File "/path/to/connectors-python/lib/python3.10/site-packages/aiohttp/client.py", line 643, in _request
resp.raise_for_status()
File "/path/to/connectors-python/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status
raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 400, message='Bad Request', url=URL('https://graph.microsoft.com/v1.0/sites/<site_id>/lists/<list_id>/items?$select=createdDateTime,id,lastModifiedDateTime,weburl,createdBy,lastModifiedBy,contentType&$expand=fields($select=Title,Link,Attachments,LinkTitle,LinkFilename,Description,Conversation)')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/path/to/connectors-python/connectors/es/sink.py", line 387, in get_docs
async for count, doc in aenumerate(generator):
File "/path/to/connectors-python/connectors/utils.py", line 689, in aenumerate
async for elem in asequence:
File "/path/to/connectors-python/connectors/logger.py", line 134, in __anext__
return await self.gen.__anext__()
File "/path/to/connectors-python/connectors/es/sink.py", line 360, in _decorate_with_metrics_span
async for doc in generator:
File "/path/to/connectors-python/connectors/sync_job_runner.py", line 310, in prepare_docs
async for doc, lazy_download, operation in self.generator():
File "/path/to/connectors-python/connectors/sync_job_runner.py", line 342, in generator
async for doc, lazy_download in self.data_provider.get_docs(
File "/path/to/connectors-python/connectors/sources/sharepoint_online.py", line 1547, in get_docs
async for list_item, download_func in self.site_list_items(
File "/path/to/connectors-python/connectors/sources/sharepoint_online.py", line 1777, in site_list_items
async for list_item in self.client.site_list_items(site_id, site_list_id):
File "/path/to/connectors-python/connectors/sources/sharepoint_online.py", line 745, in site_list_items
async for page in self._graph_api_client.scroll(
File "/path/to/connectors-python/connectors/sources/sharepoint_online.py", line 347, in scroll
graph_data = await self._get_json(scroll_url)
File "/path/to/connectors-python/connectors/sources/sharepoint_online.py", line 371, in _get_json
async with self._get(absolute_url) as resp:
File "/Users/gustavollermalylarrain/miniforge3/envs/connector/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/path/to/connectors-python/connectors/sources/sharepoint_online.py", line 295, in wrapped
async for item in func(*args, **kwargs):
File "/path/to/connectors-python/connectors/sources/sharepoint_online.py", line 413, in _get
await self._handle_client_response_error(absolute_url, e)
File "/path/to/connectors-python/connectors/sources/sharepoint_online.py", line 443, in _handle_client_response_error
raise BadRequestError from e
connectors.sources.sharepoint_online.BadRequestError
Why should we not consider this a real 400? Because it's really not. SPO lies.
There's a draft PR here: https://github.com/elastic/connectors-python/pull/1584, but it's no where near ready, and there are other priorities right now. I'm going to un-assign myself, and remove it from the current sprint until this can be prioritized.
The urgent piece has been fixed.
another single-document failure issue: https://github.com/elastic/enterprise-search-team/issues/7044