hive
hive copied to clipboard
HIVE-26459: ReduceRecordProcessor: move to using a timeout version of waitForAllInputsReady(TEZ-3302)
What changes were proposed in this pull request?
Use a timeout version of waitForAllInputsReady to avoid tez task stuck, ant this may trigger task attempt to keep the job running normally. Please refer to https://issues.apache.org/jira/browse/HIVE-26459 for details.
Why are the changes needed?
Does this PR introduce any user-facing change?
NO
How was this patch tested?
Kudos, SonarCloud Quality Gate passed!
103 Bugs
0 Vulnerabilities
33 Security Hotspots
1767 Code Smells
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the [email protected] list if the patch is in need of reviews.
Waiting till all inputs are ready is intensional and having this to timeout based approach could destabilize and cause corner case issues.
Waiting till all inputs are ready is intensional and having this to timeout based approach could destabilize and cause corner case issues.
@rbalamohan I agree. That's why the timeout configuration is turned off (-1ms) by default. But maybe we can add this configuration to give a workaround to fix some weird problem, like task stucking in HIVE-26459.
I'm wondering how could weproceed with this, trying to understand TEZ-3302 in practice at the same time @zhangbutao , @rbalamohan : can you explain a scenario when this timeout is dangerous? if so, depending on the risk, we should be able to decide whether to approve this change (disabled) or abandon at all
maybe it sounds weird, but I'm fine with an expert-level setting that can even lead to problems when used incorrectly (that's what we have everywhere in HiveConf :) )
I feel that if we can agree on this, that can let us proceed with TEZ-4445 too
I'm wondering how could weproceed with this, trying to understand TEZ-3302 in practice at the same time @zhangbutao , @rbalamohan : can you explain a scenario when this timeout is dangerous? if so, depending on the risk, we should be able to decide whether to approve this change (disabled) or abandon at all
maybe it sounds weird, but I'm fine with an expert-level setting that can even lead to problems when used incorrectly (that's what we have everywhere in HiveConf :) )
I feel that if we can agree on this, that can let us proceed with TEZ-4445 too
@abstractdog Both this PR and https://issues.apache.org/jira/browse/TEZ-4445 were weird problems occasionally occuring in our busy cluster. I have no good luck to find these root causes, and just gave a workroud which adding timeout configuration. To be honest, i have no idea which specific danger could be introduced by this change so i disabled this by default. But as you said and i also definitily agreed, we can define it expert-level setting and let user choose to enable or disable it.
I'd like to hear your opinion too. @rbalamohan
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the [email protected] list if the patch is in need of reviews.