starrocks icon indicating copy to clipboard operation
starrocks copied to clipboard

[BugFix] fix query stuck issue in spill restore phase

Open silverbullet233 opened this issue 1 year ago • 3 comments
trafficstars

Why I'm doing:

What I'm doing:

Fixes #47673

Root Cause

The spill restore task supports a yield mechanism, and if a task runs for more than a time slice, it will actively trigger a yield.

In order to avoid an excessive number of IO tasks, the restore phase will control the number of simultaneously submitted restore tasks to not exceed a certain threshold. image

_running_restore_tasks will only be updated at the beginning and end of IO tasks, and will not be updated if yield occurs.

In the implementation of YieldRestoreTask:: do_read, there is a possibility that yield_point==total_yield_point_cnt triggers yield at the same time. (suppose total_yield_point_cnt = 1 and the first time runs for more than a time slice.)

image

At this point, yield_ctx.need_yield=true, the defer defined in L226 will not be executed, and _running_restore_tasks will never decrease.

In the hash join probe stage, if a partition has no output and no restore task in progress, a restore task will be submitted to retrieve the data. In some cases, the above bug can cause the restore task to never actually execute, resulting in the query getting stuck.

But I couldn't find a stable way to reproduce this problem. In my environment, tpch q09 may get stuck once after running dozens of times. After this change, I executed it more than a hundred times without any problems.

What type of PR is this:

  • [x] BugFix
  • [ ] Feature
  • [ ] Enhancement
  • [ ] Refactor
  • [ ] UT
  • [ ] Doc
  • [ ] Tool

Does this PR entail a change in behavior?

  • [ ] Yes, this PR will result in a change in behavior.
  • [x] No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • [ ] Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • [ ] Parameter changes: default values, similar parameters but with different default values
  • [ ] Policy changes: use new policy to replace old one, functionality automatically enabled
  • [ ] Feature removed
  • [ ] Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • [ ] I have added test cases for my bug fix or my new feature
  • [ ] This pr needs user documentation (for new or modified features or behaviors)
    • [ ] I have added documentation for my new feature or new function
  • [ ] This is a backport pr

Bugfix cherry-pick branch check:

  • [x] I have checked the version labels which the pr will be auto-backported to the target branch
    • [x] 3.3
    • [ ] 3.2
    • [ ] 3.1
    • [ ] 3.0
    • [ ] 2.5

silverbullet233 avatar Jul 01 '24 03:07 silverbullet233