samza icon indicating copy to clipboard operation
samza copied to clipboard

SAMZA-2465: Task inputs information lost when enabled `RegExTopicGenerator ` or specified 'task.inputs' explicitly in non-legacy application

Open alnzng opened this issue 5 years ago • 5 comments

Symptom

If the user’s non-legacy application enabled RegExTopicGenerator or specified task.inputs explicitly and specified the input streams in its application descriptor, the expectation from the user side should be that the application can consume messages from specified input streams and Kafka topics that matched specified regex patterns.

However, in current logic seems the input information from the application descriptor will be overrided by the information from RegExTopicGenerator or task.inputs in the config file, which means the user’s application can only consume from matched Kafka topics or the inputs specified in task.inputs.

Cause

The generated task inputs from the application descriptor are overrided by JobNodeConfigurationGenerator.mergeConfig function.

Changes

Merge generated inputs and original inputs before doing JobNodeConfigurationGenerator.mergeConfig function call.

Tests

  • [x] All unit tests and integration tests are passed

API Changes

None

Upgrade Instructions

None

Usage Instructions

Noe

alnzng avatar Feb 21 '20 00:02 alnzng

@bkonold can you take a look at this PR?

mynameborat avatar Jun 11 '20 04:06 mynameborat

It is my understanding that we override generated configs to allow for job deployment to be reconfigured without building a new binary of the job. This is useful, for example, when managing issues in production and a job needs to be quickly reconfigured; the job's configuration can be modified and the job redeployed with the same binary vs needing to touch the job's app descriptor and building a new version of the binary.

For that reason it seems contradictory to merge values for the same key between original and generated config...

A bit more generally begs the question of how we treat precedence between original config, rewritten config, and generated configs (from app descriptor). @kw2542 Since you are working on the deployment flow, can you comment on what the touch points are in the system currently for rewriting configs? Do we have a clear picture of what this precedence is now?

bkonold avatar Jun 11 '20 07:06 bkonold

".... input information from the application descriptor will be overrided by the information from RegExTopicGenerator or task.inputs...." Isnt this desired behavior ? If a user provides conflicting values for inputs in the two places, we resolve the conflict in favor of the one in the config. I'm not sure why this is problematic.

rmatharu-zz avatar Jun 11 '20 17:06 rmatharu-zz

".... input information from the application descriptor will be overrided by the information from RegExTopicGenerator or task.inputs...." Isnt this desired behavior ? If a user provides conflicting values for inputs in the two places, we resolve the conflict in favor of the one in the config. I'm not sure why this is problematic.

+1; This PR negates - https://github.com/apache/samza/pull/1065

mynameborat avatar Jun 11 '20 17:06 mynameborat

Clarify the issue here:

If the user’s non-legacy application enabled RegExTopicGenerator and specified the input streams in its application descriptor, the expectation from the user side should be the application that can consume messages from specified input streams and Kafka topics that matched specified regex pattern. However, in current logic seems the input information from the application descriptor will be overdried by the information from RegExTopicGenerator, which means the user’s application can only consume from matched Kafka topics.

alnzng avatar Jun 11 '20 17:06 alnzng