samza
samza copied to clipboard
SAMZA-2465: Task inputs information lost when enabled `RegExTopicGenerator ` or specified 'task.inputs' explicitly in non-legacy application
Symptom
If the user’s non-legacy application enabled RegExTopicGenerator
or specified task.inputs
explicitly and specified the input streams in its application descriptor, the expectation from the user side should be that the application can consume messages from specified input streams and Kafka topics that matched specified regex patterns.
However, in current logic seems the input information from the application descriptor will be overrided by the information from RegExTopicGenerator
or task.inputs
in the config file, which means the user’s application can only consume from matched Kafka topics or the inputs specified in task.inputs
.
Cause
The generated task inputs from the application descriptor are overrided by JobNodeConfigurationGenerator.mergeConfig function.
Changes
Merge generated inputs and original inputs before doing JobNodeConfigurationGenerator.mergeConfig
function call.
Tests
- [x] All unit tests and integration tests are passed
API Changes
None
Upgrade Instructions
None
Usage Instructions
Noe
@bkonold can you take a look at this PR?
It is my understanding that we override generated configs to allow for job deployment to be reconfigured without building a new binary of the job. This is useful, for example, when managing issues in production and a job needs to be quickly reconfigured; the job's configuration can be modified and the job redeployed with the same binary vs needing to touch the job's app descriptor and building a new version of the binary.
For that reason it seems contradictory to merge values for the same key between original and generated config...
A bit more generally begs the question of how we treat precedence between original config, rewritten config, and generated configs (from app descriptor). @kw2542 Since you are working on the deployment flow, can you comment on what the touch points are in the system currently for rewriting configs? Do we have a clear picture of what this precedence is now?
".... input information from the application descriptor will be overrided by the information from RegExTopicGenerator or task.inputs...." Isnt this desired behavior ? If a user provides conflicting values for inputs in the two places, we resolve the conflict in favor of the one in the config. I'm not sure why this is problematic.
".... input information from the application descriptor will be overrided by the information from RegExTopicGenerator or task.inputs...." Isnt this desired behavior ? If a user provides conflicting values for inputs in the two places, we resolve the conflict in favor of the one in the config. I'm not sure why this is problematic.
+1; This PR negates - https://github.com/apache/samza/pull/1065
Clarify the issue here:
If the user’s non-legacy application enabled RegExTopicGenerator
and specified the input streams in its application descriptor, the expectation from the user side should be the application that can consume messages from specified input streams and Kafka topics that matched specified regex pattern. However, in current logic seems the input information from the application descriptor will be overdried by the information from RegExTopicGenerator
, which means the user’s application can only consume from matched Kafka topics.