marklogic-data-hub
marklogic-data-hub copied to clipboard
[FEATURE] Small QOL changes to RunFlowTask to make extending the class slightly easier
What gap would you like Data Hub Framework to address?
We have a custom class that currently replaces RunFlowTask in that we've needed to copy the code out and rework another extension of HubTask for our needs. We'd ideally like to replace it with an extends
call, and there are two small changes we'd need to have made to make that possible. I'm fairly certain both can be built in a non-breaking manner, so I figured I'd pose them and see if either of them are agreeable.
Describe the solution you'd like
First: add an options
class variable on RunFlowTask, so that a new if
case is introduced at the beginning of this block to pull from that. Seems to follow a paradigm where all the other variables can be set externally, in properties files, or via the command-line, but can get overwritten by attaching them to the class. Putting this block first and defaulting the variable to null shouldn't break the current setup, if I understand right.
Use case for this is where multiple RunFlowTasks
are chained together within the same gradle
call, and the developer wants to use different options for each RunFlowTask. We have a bit of an unorthodox setup, but I have a feeling there are valid situations where calling multiple RunFlowTasks would be recommended/required. A workaround we used to use with RunFlowTask as it is now is to just re-set the options
property between executions, but that feels a bit weird.
Second: add a printOutput
class variable on RunFlowTask, so that these printouts can be prevented. Defaulting it to true
shouldn't break the current setup, if I understand right.
Use case for this is that when a significant quantity of documents processed during a step encounter an error, the stepOutput
block becomes rather large and unwieldy. Asking for that to be solved is a can of worms unto itself, but we've built a post-processor in our new HubTask extension that de-dupes all the error printouts; the reason we can't use RunFlowTask for this is that the large block will get printed out regardless, and we don't want to duplicate that.
Again, posing these because both should be relatively low-impact. Our current solution by extending HubTask seems to be stable and functional, so this isn't high priority by any means. Let me know if any more information is needed!
Thanks, this is really good feedback. Couple immediate questions:
- When you chain multiple executions of RunFlowTask together, are you running the same flow with different sets of steps, and/or are you running different flows?
- What is launching Gradle? Is it invoked via a shell script that is e.g. executed by cron or some other job scheduler? And is there benefit to only calling Gradle once, as opposed to multiple calls to Gradle, with each call running a single instance of RunFlowTask?
When you chain multiple executions of RunFlowTask together, are you running the same flow with different sets of steps,
Maybe like 5% of the cases, but there are a few of these.
and/or are you running different flows?
This is far more frequent. The "unorthodox setup" is that almost all of our RunFlowTask calls are one-step-at-a-time, I think because our options
values are dynamic. The complementary 95% to the above 5% are all one-step flows. (I don't know if this is actually weird, but it predates me being here, may be switchable but would be a lot of work to redo when it's functional now.)
So yup, both.
What is launching Gradle? Is it invoked via a shell script that is e.g. executed by cron or some other job scheduler? And is there benefit to only calling Gradle once, as opposed to multiple calls to Gradle, with each call running a single instance of RunFlowTask?
Command-line calls (during development) and through NiFi (on QA/Prod). We'd have to queue all of the different RunFlowTasks together in a particular order at some point, and it's just easier to do it in Gradle using dependsOn/mustRunAfter
. Streamlines the work and also makes it so that developers are reliably running the same command (and order) during development that NiFi does on our servers.
One new feature for 5.5 is that we'll have support for step-specific options in the runtime options you pass in. Sounds like that might help simplify things?
If "runtime options" means the options we pass in to Gradle, I still think that's not going to be enough for our setup. I'm gonna try to base-case this:
- We have a
processData
Gradle task that runsingestData
andharmonizeData
for a combination of "source" and "instance." -
ingestData
is a flow with destination collection/raw/${source}/${instance}
-
harmonizeData
is a flow with source collection/raw/${source}/${instance}
and/prepared/${source}/${instance}
We need to be able to run processData -Psource=source1 -Pinstance=detailInstance
, as well as processData -Psource=source2 -Pinstance=headerInstance
. This means we'd either need to define separate flows for all four combinations (source1/detail, source1/header etc.) and then define collections
/sourceQuery
in the options eight times, or we could just define one task, one ingest flow step, one harmonization flow step, and dynamically populate the options in Gradle.
Does that make sense, and did I understand what you were describing correctly?