filesystem_spec 429 Client Error: Too Many Requests for url azuredatabricks.net/api/2.0/dbfs/mkdirs

I'm using the "fsspec.implementations.dbfs import DatabricksFileSystem" with pyarrow to write a parquet dataset on the DatabricksFilesystem DBFS, but when using the DatabricksFilesystem implementation with the write_to_dataset method from pyarrow I'm getting the following error:

429 Client Error: Too Many Requests for url: https://adb-<databricks_instance>.azuredatabricks.net/api/2.0/dbfs/mkdirs

The code used to write to DBFS is the following:

""" import pyarrow as pa import pyarrow.dataset as ds import pyarrow.parquet as pq from fsspec.implementations.dbfs import DatabricksFileSystem

base_path = "/FileStore/write" test_df = pd.read_csv("../data/diabetes/csv/nopart/diabetes.csv")

filesystem = DatabricksFileSystem( instance=<databricks_instance>, token=<databricks_token> )

pq.write_to_dataset(arr_table, filesystem=filesystem, compression='none', existing_data_behavior='error', partition_cols=["Pregnancies"], root_path=f"{base_path}/parquet/part", use_threads= False) """

And the full stack trace is:

""" Traceback (most recent call last): File "/venv/lib/python3.9/site-packages/requests/models.py", line 971, in json return complexjson.loads(self.text, **kwargs) File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/json/init.py", line 346, in loads return _default_decoder.decode(s) File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/venv/lib/python3.9/site-packages/fsspec/implementations/dbfs.py", line 255, in _send_to_api exception_json = e.response.json() File "/venv/lib/python3.9/site-packages/requests/models.py", line 975, in json raise RequestsJSONDecodeError(e.msg, e.doc, e.pos) requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/tests/dbfs/test_dbfs_dataset.py", line 35, in setUpClass pq.write_to_dataset(arr_table, filesystem=cls._filesystem, compression='none', File "/venv/lib/python3.9/site-packages/pyarrow/parquet/core.py", line 3422, in write_to_dataset ds.write_dataset( File "/venv/lib/python3.9/site-packages/pyarrow/dataset.py", line 1018, in write_dataset _filesystemdataset_write( File "pyarrow/_dataset.pyx", line 3919, in pyarrow._dataset._filesystemdataset_write File "pyarrow/types.pxi", line 88, in pyarrow.lib._datatype_to_pep3118 File "pyarrow/_fs.pyx", line 1529, in pyarrow._fs._cb_create_dir File "/venv/lib/python3.9/site-packages/pyarrow/fs.py", line 374, in create_dir self.fs.mkdir(path, create_parents=recursive) File "/venv/lib/python3.9/site-packages/fsspec/implementations/dbfs.py", line 138, in mkdir self.mkdirs(path, **kwargs) File "/venv/lib/python3.9/site-packages/fsspec/spec.py", line 1498, in mkdirs return self.makedirs(path, exist_ok=exist_ok) File "/venv/lib/python3.9/site-packages/fsspec/implementations/dbfs.py", line 115, in makedirs self._send_to_api(method="post", endpoint="mkdirs", json={"path": path}) File "/venv/lib/python3.9/site-packages/fsspec/implementations/dbfs.py", line 257, in _send_to_api raise e File "/venv/lib/python3.9/site-packages/fsspec/implementations/dbfs.py", line 250, in _send_to_api r.raise_for_status() File "/venv/lib/python3.9/site-packages/requests/models.py", line 1021, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://adb-<databricks_instance>.azuredatabricks.net/api/2.0/dbfs/mkdirs """

Jan 04 '24 14:01 a24lorie

I have found a thread on StackOverflow on how to handle the 429 error that could be a potential fix for this

https://stackoverflow.com/questions/22786068/how-to-avoid-http-error-429-too-many-requests-python

""" Receiving a status 429 is not an error, it is the other server "kindly" asking you to please stop spamming requests. Obviously, your rate of requests has been too high and the server is not willing to accept this.

You should not seek to "dodge" this, or even try to circumvent server security settings by trying to spoof your IP, you should simply respect the server's answer by not sending too many requests.

If everything is set up properly, you will also have received a "Retry-after" header along with the 429 response. This header specifies the number of seconds you should wait before making another call. The proper way to deal with this "problem" is to read this header and to sleep your process for that many seconds.

You can find more information on status 429 here: https://www.rfc-editor.org/rfc/rfc6585#page-3 """

Jan 04 '24 14:01 a24lorie

fsspec.implementations.DatabricksFileSystem._send_to_api does not currently handle retries. Retries are used in other implementations, with exponential backoff schemes, to cope with this kind of "should work but not right now" message. If some one were to apply it to this FS, it would be appreciated.

Jan 04 '24 14:01 martindurant

@martindurant Could you point to some implementation that uses those handles the retires to take a look and see if I'm able to implement a fix

Jan 04 '24 14:01 a24lorie

Here's a complete, complex version that can almost be applied to this case as-is: https://github.com/fsspec/gcsfs/blob/main/gcsfs/retry.py#L118

It would actually be very reasonable to have a retry decorator in this repo, that can be applied to a number of "call this remote thing" methods.

Jan 04 '24 15:01 martindurant

@martindurant I have created a decorator to manage the 429 HTTP errors, and have tested for reading and writing parquet files using the pyarrow library. Still, it doesn't work for some unknown reason when reading from a partitioned root directory (using the hive schema). Do you believe it wise to merge the current fix into the repo at least and later on try to find the issue when reading from a partitioned directory structure?

Jan 05 '24 21:01 a24lorie

I am happy to look at your fix and see if I can suggest some generalisation of it.

Jan 05 '24 21:01 martindurant

Should I create a pull request or do U prefer any other method?

Jan 05 '24 21:01 a24lorie

PR is best, yes

Jan 05 '24 21:01 martindurant

PR created

Jan 07 '24 13:01 a24lorie

filesystem_spec filesystem_spec copied to clipboard

429 Client Error: Too Many Requests for url azuredatabricks.net/api/2.0/dbfs/mkdirs

filesystem_spec
filesystem_spec copied to clipboard