databricks-sdk-py icon indicating copy to clipboard operation
databricks-sdk-py copied to clipboard

[ISSUE] Passing a `BinaryIO` to the Workspace `upload` results in an empty notebook if the API call needs to be retried

Open jdavidheiser opened this issue 1 year ago • 0 comments

Description For months, we have been hitting spurious issues when uploading Databricks notebooks to a Workspace using the SDK. Occasionally, the notebooks would be empty, then a job would run and succeed, but not do any work (because the notebook was empty). We believe we have traced this to an issue in the SDK - specifically, the workspace.upload method supports a BinaryIO input, which is a streaming file-like interface. However, an IO interface in Python can only be read once - a second attempt to read from it will result in an empty string. This means that, if for any reason the API call fails, the second attempt will result in an empty notebook.

Reproduction run this in a fresh REPL session:

import databricks.sdk
from databricks.sdk.service.workspace import Language
import io
import logging
logging.basicConfig(level=logging.DEBUG)
w = databricks.sdk.WorkspaceClient(profile='your-profile-here')

Now, turn off network access so your connection times out and has to retry

w.workspace.upload("/path/to/file", io.BytesIO(b'test'), language=Language.PYTHON, overwrite=True)

After one or two retries on the failed network connection, which look like the following, re-enable network access.

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): <workspace>.cloud.databricks.com:443
DEBUG:databricks.sdk.retries:Retrying: cannot connect (sleeping ~1s)

The job will now complete, but the file will be blank.

Expected behavior The file should not be blank.

Is it a regression? It has been broken since at least 0.13. I tested it and it fails in 0.20 and 0.30.

Other Information

  • OS: macOS
  • Version: 0.13, 0.20, 0.30

Additional context This caused major data quality issues that spanned a several-month period.

jdavidheiser avatar Oct 21 '24 15:10 jdavidheiser