spotify-tensorflow icon indicating copy to clipboard operation
spotify-tensorflow copied to clipboard

#192: (WIP: Needs Work) Replace PythonDataflowTask With BeamDataflowJobTask

Open rclough opened this issue 5 years ago • 2 comments

Removes PythonDataflowTask and related files/tests to replace it with the luigi.contrib BeamDataflowJobTask

rclough avatar Jun 18 '19 20:06 rclough

Codecov Report

Merging #199 into master will decrease coverage by 1.19%. The diff coverage is 80%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #199     +/-   ##
=========================================
- Coverage   89.11%   87.91%   -1.2%     
=========================================
  Files          12       11      -1     
  Lines         744      596    -148     
=========================================
- Hits          663      524    -139     
+ Misses         81       72      -9
Impacted Files Coverage Δ
spotify_tensorflow/luigi/tfx_task.py 66.66% <80%> (-27.78%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 8e807b9...7bb29e7. Read the comment docs.

codecov[bot] avatar Jun 18 '19 20:06 codecov[bot]

There is still work to be done on this ticket. Documenting some discussions on how to proceed:

Switching to BeamDataflowJobTask is just a bit more involved than what I'd initially realized, with the refactoring that was done for Dataflow input arguments.

The most pertinent point is that now there is a class called DataflowParamKeys that needs to be passed to the BeamDataflowJobTask on initialization. Historically, beam tasks have simply initialized the object and directly accessed the variables in the class to set parameters.

There are, generally, 2 approaches that can be taken here:

  1. Pass the responsibility of passing in the DataflowParamKeys to the end user, and use the new BeamDataflowJobTask more or less as a drop in replacement with a few extra TF specific parameters. This would be a pretty big API break that would require a decent amount of work from end users to upgrade to
  2. Keep the current API, and find a way to convert it into DataflowParamKeys or otherwise work around the requirement. The former would require some magic (ie not directly inheriting the BeamDataflowJobTask, but rather initializing it on run when all the parameters have been set the old way). The latter seems like the easiest approach: pass an empty DataflowParamKeys on init, and allow the user to edit the params directly. If theres a refactored name, you can use a setter for that value to make sure its mapped to the new name.

rclough avatar Jun 21 '19 19:06 rclough