mrjob switch to "pull" model for uploads and working directory

switch to "pull" model for uploads and working directory

Open coyotemarin opened this issue 6 years ago • 2 comments

Currently, mrjob's code goes to a fair amount of work to ensure that everything that needs to appear in a job's working directory or be accessible by Hadoop/spark gets named and uploaded. Things don't usually slip between the cracks, but that's mostly thanks to extensive unit tests, and the code is hard to maintain.

We should switch to a declarative model, something like:

decide what the job's steps should be
pick final names for un-named files etc.
upload files
submit the job's steps

It's a bit more complex than that; for example, to build a setup wrapper script, you need to know what the other files in the working directory are named, and then name the setup script itself.

Really, though, unless the user explicitly specifies a name for a file, mrjob can pick any name it likes. So maybe the steps are more like:

nail down the names of any file specified by the user
build the job's steps, picking names for other files on-the-fly
upload files
submit the job's steps.

Dec 19 '18 20:12 coyotemarin

See https://github.com/Yelp/mrjob/issues/1376#issuecomment-448051630 for an example of why we might want to re-work the working dir/upload model.

Dec 19 '18 21:12 coyotemarin

Did #1922 without switching to a pull model, instead de-coupling management of the working directory from uploading.

Mar 22 '19 21:03 coyotemarin

mrjob mrjob copied to clipboard

switch to "pull" model for uploads and working directory

mrjob
mrjob copied to clipboard