hive icon indicating copy to clipboard operation
hive copied to clipboard

HIVE-26459: ReduceRecordProcessor: move to using a timeout version of waitForAllInputsReady(TEZ-3302)

Open zhangbutao opened this issue 2 years ago • 1 comments

What changes were proposed in this pull request?

Use a timeout version of waitForAllInputsReady to avoid tez task stuck, ant this may trigger task attempt to keep the job running normally. Please refer to https://issues.apache.org/jira/browse/HIVE-26459 for details.

Why are the changes needed?

Does this PR introduce any user-facing change?

NO

How was this patch tested?

zhangbutao avatar Aug 16 '22 09:08 zhangbutao

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug E 103 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot E 33 Security Hotspots
Code Smell A 1767 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

sonarqubecloud[bot] avatar Aug 16 '22 10:08 sonarqubecloud[bot]

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the [email protected] list if the patch is in need of reviews.

github-actions[bot] avatar Oct 17 '22 00:10 github-actions[bot]

Waiting till all inputs are ready is intensional and having this to timeout based approach could destabilize and cause corner case issues.

rbalamohan avatar Oct 24 '22 11:10 rbalamohan

Waiting till all inputs are ready is intensional and having this to timeout based approach could destabilize and cause corner case issues.

@rbalamohan I agree. That's why the timeout configuration is turned off (-1ms) by default. But maybe we can add this configuration to give a workaround to fix some weird problem, like task stucking in HIVE-26459.

zhangbutao avatar Oct 24 '22 12:10 zhangbutao

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 2 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

sonarqubecloud[bot] avatar Oct 24 '22 14:10 sonarqubecloud[bot]

I'm wondering how could weproceed with this, trying to understand TEZ-3302 in practice at the same time @zhangbutao , @rbalamohan : can you explain a scenario when this timeout is dangerous? if so, depending on the risk, we should be able to decide whether to approve this change (disabled) or abandon at all

maybe it sounds weird, but I'm fine with an expert-level setting that can even lead to problems when used incorrectly (that's what we have everywhere in HiveConf :) )

I feel that if we can agree on this, that can let us proceed with TEZ-4445 too

abstractdog avatar Nov 22 '22 13:11 abstractdog

I'm wondering how could weproceed with this, trying to understand TEZ-3302 in practice at the same time @zhangbutao , @rbalamohan : can you explain a scenario when this timeout is dangerous? if so, depending on the risk, we should be able to decide whether to approve this change (disabled) or abandon at all

maybe it sounds weird, but I'm fine with an expert-level setting that can even lead to problems when used incorrectly (that's what we have everywhere in HiveConf :) )

I feel that if we can agree on this, that can let us proceed with TEZ-4445 too

@abstractdog Both this PR and https://issues.apache.org/jira/browse/TEZ-4445 were weird problems occasionally occuring in our busy cluster. I have no good luck to find these root causes, and just gave a workroud which adding timeout configuration. To be honest, i have no idea which specific danger could be introduced by this change so i disabled this by default. But as you said and i also definitily agreed, we can define it expert-level setting and let user choose to enable or disable it.

I'd like to hear your opinion too. @rbalamohan

zhangbutao avatar Nov 22 '22 16:11 zhangbutao

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the [email protected] list if the patch is in need of reviews.

github-actions[bot] avatar Jan 22 '23 00:01 github-actions[bot]