nesis icon indicating copy to clipboard operation
nesis copied to clipboard

[BUG] Sharepoint ingestion fails with remote end closed connection without response

Open mawandm opened this issue 1 year ago • 0 comments

Nesis version

0.1.0

Describe the bug

During a long running Sharepoint ingestion process, an error

[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] 2024-05-04 01:44:36.695 [WARNING ] nesis.api.core.document_loaders.sharepoint - Error when getting and ingesting file Stock Market Wizards (Jack D. Schwager) (z-lib.org).pdf - ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Generating embeddings:   0%|          | 0/14 [00:00<?, ?it/s]Killed
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] 2024-05-04 01:44:37.469 [ERROR   ] nesis.api.core.document_loaders.sharepoint - Error fetching and updating documents - Error: (None, None, "401 Client Error: Unauthorized for url: https://site.sharepoint.com/sites/nesis-test/_api/Web/GetFolderById('d5bc341a-8557-4c67-8c40-1cb0e085def9')?$select=Files&$expand=Files")
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] Traceback (most recent call last):
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]   File "/app/.venv/lib/python3.11/site-packages/office365/runtime/client_request.py", line 38, in execute_query
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]     response.raise_for_status()
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]   File "/app/.venv/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]     raise HTTPError(http_error_msg, response=self)
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://site.sharepoint.com/sites/nesis-test/_api/Web/GetFolderById('d5bc341a-8557-4c67-8c40-1cb0e085def9')?$select=Files&$expand=Files
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] During handling of the above exception, another exception occurred:
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] Traceback (most recent call last):
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]   File "/app/nesis/api/core/document_loaders/sharepoint.py", line 117, in _sync_sharepoint_documents
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]     _process_folder_files(
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]   File "/app/nesis/api/core/document_loaders/sharepoint.py", line 168, in _process_folder_files
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]     _files = folder.get_files(False).execute_query()
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]   File "/app/.venv/lib/python3.11/site-packages/office365/runtime/client_object.py", line 52, in execute_query
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]     self.context.execute_query()
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]   File "/app/.venv/lib/python3.11/site-packages/office365/runtime/client_runtime_context.py", line 183, in execute_query
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]     self.pending_request().execute_query(qry)
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]   File "/app/.venv/lib/python3.11/site-packages/office365/runtime/client_request.py", line 42, in execute_query
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis]     raise ClientRequestException(*e.args, response=e.response)
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] office365.runtime.client_request_exception.ClientRequestException: (None, None, "401 Client Error: Unauthorized for url: https://site.sharepoint.com/sites/nesis-test/_api/Web/GetFolderById('d5bc341a-8557-4c67-8c40-1cb0e085def9')?$select=Files&$expand=Files")
[resource-1584063407-nesis-api-6c6f84957f-gflsf nesis] 2024-05-04 01:44:37.503 [INFO    ] apscheduler.executors.default - Job "ingest_datasource (trigger: date[2024-05-04 00:19:28 UTC], next run at: 2024-05-04 00:19:28 UTC)" executed successfully

Shows

To reproduce

  1. Create a sharepoint datasource
  2. Add multiple large documents to the Sharepoint
  3. Run the ingestion... after a while, the API service logs show a 401 Client Error: Unauthorized for url...

Expected behavior

The ingestion should run continuously. It seems like a refresh of the Sharepoint client authentication is needed

Screenshots

No response

Additional context

No response

mawandm avatar May 04 '24 01:05 mawandm