dbx icon indicating copy to clipboard operation
dbx copied to clipboard

[FEATURE] support larger file uploads in dbx sync

Open jbpolle opened this issue 3 years ago • 8 comments
trafficstars

Expected Behavior

When uploading a local project to databricks (running: dbx sync dbfs --source=.), if there is a file that is bigger than some specific size, I would expect a warning and that this file would be ignored (at least that could be a flag option). Also it would be nice to be able to increase the max size allowed.

Current Behavior

Following error is raised and process stop: [dbx][2022-07-22 16:41:09.057] HTTP 400: {"error_code":"MAX_BLOCK_SIZE_EXCEEDED","message":"The 'contents' data cannot exceed 1048576 bytes. Found: 15 70331 bytes. You might want to use streaming upload instead."}

Steps to Reproduce (for bugs)

Running following command on a folder with a file bigger than 1048576 bytes dbx sync dbfs --source=.

Context

Your Environment

  • dbx version used: 0.6.8
  • Databricks Runtime version: 10.4

jbpolle avatar Jul 22 '22 21:07 jbpolle

hi @jbpolle , As a quick fix, have you tried using exclude-pattern or exclude dirs for this specific file?

renardeinside avatar Jul 27 '22 09:07 renardeinside

@renardeinside Thanks for the reply. This is actually what I did and it worked but it could be annoying on other projects.

jbpolle avatar Jul 27 '22 13:07 jbpolle

I would expect a warning and that this file would be ignored

Hey @jbpolle ,

this approach contradicts the Zen of Python, since Explicit is better than implicit.. Approach with warning essentially silences the error from the end user, which might lead to unexpected behaviour. Therefore I'm closing this ticket, I don't see a valid change that will improve the situation.

renardeinside avatar Jul 28 '22 15:07 renardeinside

@renardeinside why do we need to have such a small max_size and why cannot it be configurable? This is quite annoying to have to manually exclude files that are too big.

jbpolle avatar Jul 28 '22 16:07 jbpolle

hi @jbpolle , Decision on the max_size is not something which is made by dbx team, rather than a limitation of the platform. What we can do is to workaround the limitation by using DBFS APIs. I'll re-open the ticket accordingly

renardeinside avatar Jul 28 '22 17:07 renardeinside

@matthayes could you please take a look into this?

We can overcome the max_size limitation by using the following approach - use the same client to perform Create + while { AddBlock(1MB) } + Close instead of direct upload with hitting the max_size limitation.

Relevant APIs are here:

Not sure how this fits into the async nature of the code, but hopefully could be implemented.

renardeinside avatar Jul 28 '22 17:07 renardeinside

Three thoughts on this:

  1. Instead of giving a warning by default, how about a flag that controls the error behavior (e.g., raise (default), warn)? This way, the user must be explicit about the behavior. The warning could be printed in the summary of operations after the command executes.
  2. API limitations have been mentioned. These large files can be transferred using the databricks-cli (e.g. databricks fs cp. Could dbx be extended to wrap this functionality when the file sizes exceed the capacity of the currently used API?
  3. When dbx sync fails due to large file size, the error includes "You might want to use streaming upload instead". I'm not familiar with the streaming upload API, but is it possible that dbx could fallback to streaming when the default method fails?

lukeSmth avatar Nov 20 '22 21:11 lukeSmth

Just curious if this is on the roadmap? I was working with pex files today, which can easily be several hundred mb, with dbx sync and it timed out, presumably due to the file size?

So wondering whether there will be a better way to work with larger files coming.

ericfeunekes avatar Dec 08 '22 21:12 ericfeunekes