dbx
dbx copied to clipboard
[FEATURE] support larger file uploads in dbx sync
Expected Behavior
When uploading a local project to databricks (running: dbx sync dbfs --source=.), if there is a file that is bigger than some specific size, I would expect a warning and that this file would be ignored (at least that could be a flag option). Also it would be nice to be able to increase the max size allowed.
Current Behavior
Following error is raised and process stop: [dbx][2022-07-22 16:41:09.057] HTTP 400: {"error_code":"MAX_BLOCK_SIZE_EXCEEDED","message":"The 'contents' data cannot exceed 1048576 bytes. Found: 15 70331 bytes. You might want to use streaming upload instead."}
Steps to Reproduce (for bugs)
Running following command on a folder with a file bigger than 1048576 bytes dbx sync dbfs --source=.
Context
Your Environment
- dbx version used: 0.6.8
- Databricks Runtime version: 10.4
hi @jbpolle , As a quick fix, have you tried using exclude-pattern or exclude dirs for this specific file?
@renardeinside Thanks for the reply. This is actually what I did and it worked but it could be annoying on other projects.
I would expect a warning and that this file would be ignored
Hey @jbpolle ,
this approach contradicts the Zen of Python, since Explicit is better than implicit..
Approach with warning essentially silences the error from the end user, which might lead to unexpected behaviour.
Therefore I'm closing this ticket, I don't see a valid change that will improve the situation.
@renardeinside why do we need to have such a small max_size and why cannot it be configurable? This is quite annoying to have to manually exclude files that are too big.
hi @jbpolle ,
Decision on the max_size is not something which is made by dbx team, rather than a limitation of the platform.
What we can do is to workaround the limitation by using DBFS APIs. I'll re-open the ticket accordingly
@matthayes could you please take a look into this?
We can overcome the max_size limitation by using the following approach - use the same client to perform Create + while { AddBlock(1MB) } + Close instead of direct upload with hitting the max_size limitation.
Relevant APIs are here:
Not sure how this fits into the async nature of the code, but hopefully could be implemented.
Three thoughts on this:
- Instead of giving a warning by default, how about a flag that controls the error behavior (e.g.,
raise(default),warn)? This way, the user must be explicit about the behavior. The warning could be printed in the summary of operations after the command executes. - API limitations have been mentioned. These large files can be transferred using the
databricks-cli(e.g.databricks fs cp. Coulddbxbe extended to wrap this functionality when the file sizes exceed the capacity of the currently used API? - When
dbx syncfails due to large file size, the error includes "You might want to use streaming upload instead". I'm not familiar with the streaming upload API, but is it possible thatdbxcould fallback to streaming when the default method fails?
Just curious if this is on the roadmap? I was working with pex files today, which can easily be several hundred mb, with dbx sync and it timed out, presumably due to the file size?
So wondering whether there will be a better way to work with larger files coming.