spotify-tensorflow
spotify-tensorflow copied to clipboard
#192: (WIP: Needs Work) Replace PythonDataflowTask With BeamDataflowJobTask
Removes PythonDataflowTask
and related files/tests to replace it with the luigi.contrib BeamDataflowJobTask
Codecov Report
Merging #199 into master will decrease coverage by
1.19%
. The diff coverage is80%
.
@@ Coverage Diff @@
## master #199 +/- ##
=========================================
- Coverage 89.11% 87.91% -1.2%
=========================================
Files 12 11 -1
Lines 744 596 -148
=========================================
- Hits 663 524 -139
+ Misses 81 72 -9
Impacted Files | Coverage Δ | |
---|---|---|
spotify_tensorflow/luigi/tfx_task.py | 66.66% <80%> (-27.78%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 8e807b9...7bb29e7. Read the comment docs.
There is still work to be done on this ticket. Documenting some discussions on how to proceed:
Switching to BeamDataflowJobTask
is just a bit more involved than what I'd initially realized, with the refactoring that was done for Dataflow input arguments.
The most pertinent point is that now there is a class called DataflowParamKeys
that needs to be passed to the BeamDataflowJobTask
on initialization. Historically, beam tasks have simply initialized the object and directly accessed the variables in the class to set parameters.
There are, generally, 2 approaches that can be taken here:
- Pass the responsibility of passing in the
DataflowParamKeys
to the end user, and use the newBeamDataflowJobTask
more or less as a drop in replacement with a few extra TF specific parameters. This would be a pretty big API break that would require a decent amount of work from end users to upgrade to - Keep the current API, and find a way to convert it into
DataflowParamKeys
or otherwise work around the requirement. The former would require some magic (ie not directly inheriting theBeamDataflowJobTask
, but rather initializing it on run when all the parameters have been set the old way). The latter seems like the easiest approach: pass an emptyDataflowParamKeys
on init, and allow the user to edit the params directly. If theres a refactored name, you can use a setter for that value to make sure its mapped to the new name.