filesystem_spec
filesystem_spec copied to clipboard
429 Client Error: Too Many Requests for url azuredatabricks.net/api/2.0/dbfs/mkdirs
I'm using the "fsspec.implementations.dbfs import DatabricksFileSystem" with pyarrow to write a parquet dataset on the DatabricksFilesystem DBFS, but when using the DatabricksFilesystem implementation with the write_to_dataset method from pyarrow I'm getting the following error:
429 Client Error: Too Many Requests for url: https://adb-<databricks_instance>.azuredatabricks.net/api/2.0/dbfs/mkdirs
The code used to write to DBFS is the following:
""" import pyarrow as pa import pyarrow.dataset as ds import pyarrow.parquet as pq from fsspec.implementations.dbfs import DatabricksFileSystem
base_path = "/FileStore/write" test_df = pd.read_csv("../data/diabetes/csv/nopart/diabetes.csv")
filesystem = DatabricksFileSystem( instance=<databricks_instance>, token=<databricks_token> )
pq.write_to_dataset(arr_table, filesystem=filesystem, compression='none', existing_data_behavior='error', partition_cols=["Pregnancies"], root_path=f"{base_path}/parquet/part", use_threads= False) """
And the full stack trace is:
"""
Traceback (most recent call last):
File "
I have found a thread on StackOverflow on how to handle the 429 error that could be a potential fix for this
https://stackoverflow.com/questions/22786068/how-to-avoid-http-error-429-too-many-requests-python
""" Receiving a status 429 is not an error, it is the other server "kindly" asking you to please stop spamming requests. Obviously, your rate of requests has been too high and the server is not willing to accept this.
You should not seek to "dodge" this, or even try to circumvent server security settings by trying to spoof your IP, you should simply respect the server's answer by not sending too many requests.
If everything is set up properly, you will also have received a "Retry-after" header along with the 429 response. This header specifies the number of seconds you should wait before making another call. The proper way to deal with this "problem" is to read this header and to sleep your process for that many seconds.
You can find more information on status 429 here: https://www.rfc-editor.org/rfc/rfc6585#page-3 """
fsspec.implementations.DatabricksFileSystem._send_to_api does not currently handle retries. Retries are used in other implementations, with exponential backoff schemes, to cope with this kind of "should work but not right now" message. If some one were to apply it to this FS, it would be appreciated.
@martindurant Could you point to some implementation that uses those handles the retires to take a look and see if I'm able to implement a fix
Here's a complete, complex version that can almost be applied to this case as-is: https://github.com/fsspec/gcsfs/blob/main/gcsfs/retry.py#L118
It would actually be very reasonable to have a retry decorator in this repo, that can be applied to a number of "call this remote thing" methods.
@martindurant I have created a decorator to manage the 429 HTTP errors, and have tested for reading and writing parquet files using the pyarrow library. Still, it doesn't work for some unknown reason when reading from a partitioned root directory (using the hive schema). Do you believe it wise to merge the current fix into the repo at least and later on try to find the issue when reading from a partitioned directory structure?
I am happy to look at your fix and see if I can suggest some generalisation of it.
Should I create a pull request or do U prefer any other method?
PR is best, yes
PR created