mrjob icon indicating copy to clipboard operation
mrjob copied to clipboard

switch to "pull" model for uploads and working directory

Open coyotemarin opened this issue 6 years ago • 2 comments

Currently, mrjob's code goes to a fair amount of work to ensure that everything that needs to appear in a job's working directory or be accessible by Hadoop/spark gets named and uploaded. Things don't usually slip between the cracks, but that's mostly thanks to extensive unit tests, and the code is hard to maintain.

We should switch to a declarative model, something like:

  1. decide what the job's steps should be
  2. pick final names for un-named files etc.
  3. upload files
  4. submit the job's steps

It's a bit more complex than that; for example, to build a setup wrapper script, you need to know what the other files in the working directory are named, and then name the setup script itself.

Really, though, unless the user explicitly specifies a name for a file, mrjob can pick any name it likes. So maybe the steps are more like:

  1. nail down the names of any file specified by the user
  2. build the job's steps, picking names for other files on-the-fly
  3. upload files
  4. submit the job's steps.

coyotemarin avatar Dec 19 '18 20:12 coyotemarin

See https://github.com/Yelp/mrjob/issues/1376#issuecomment-448051630 for an example of why we might want to re-work the working dir/upload model.

coyotemarin avatar Dec 19 '18 21:12 coyotemarin

Did #1922 without switching to a pull model, instead de-coupling management of the working directory from uploading.

coyotemarin avatar Mar 22 '19 21:03 coyotemarin