pinot icon indicating copy to clipboard operation
pinot copied to clipboard

[multi-stage] allow configurable timeout

Open walterddr opened this issue 3 years ago • 6 comments
trafficstars

this enables configuration for timeout on a

  • server config level via broker timeout or query runner timeout
    • either config key is fine, one is in nano second one is in millis
  • or via query options

TODO add test to test query option override.

walterddr avatar Oct 11 '22 23:10 walterddr

Codecov Report

Merging #9571 (4afef37) into master (d50f9ee) will decrease coverage by 0.02%. The diff coverage is 84.37%.

@@             Coverage Diff              @@
##             master    #9571      +/-   ##
============================================
- Coverage     70.40%   70.38%   -0.03%     
+ Complexity     5012     4932      -80     
============================================
  Files          1973     1973              
  Lines        105815   105822       +7     
  Branches      16032    16033       +1     
============================================
- Hits          74503    74478      -25     
- Misses        26131    26162      +31     
- Partials       5181     5182       +1     
Flag Coverage Δ
integration1 25.10% <9.37%> (-0.04%) :arrow_down:
integration2 24.87% <0.00%> (+0.19%) :arrow_up:
unittests1 67.83% <85.71%> (-0.01%) :arrow_down:
unittests2 15.81% <75.00%> (+<0.01%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...rg/apache/pinot/query/service/QueryDispatcher.java 84.37% <50.00%> (-0.32%) :arrow_down:
...requesthandler/MultiStageBrokerRequestHandler.java 59.45% <75.00%> (+0.30%) :arrow_up:
...va/org/apache/pinot/query/runtime/QueryRunner.java 86.74% <100.00%> (+0.16%) :arrow_up:
...query/runtime/operator/MailboxReceiveOperator.java 91.52% <100.00%> (-0.15%) :arrow_down:
.../pinot/query/runtime/plan/PhysicalPlanVisitor.java 96.87% <100.00%> (ø)
...e/pinot/query/runtime/plan/PlanRequestContext.java 93.75% <100.00%> (+0.89%) :arrow_up:
...t/query/runtime/plan/ServerRequestPlanVisitor.java 79.20% <100.00%> (+0.63%) :arrow_up:
.../runtime/plan/server/ServerPlanRequestContext.java 100.00% <100.00%> (ø)
...va/org/apache/pinot/query/service/QueryServer.java 76.08% <100.00%> (+0.53%) :arrow_up:
...ntroller/helix/core/minion/TaskMetricsEmitter.java 34.88% <0.00%> (-51.17%) :arrow_down:
... and 31 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov-commenter avatar Oct 12 '22 00:10 codecov-commenter

Correct me if I am wrong, the queryTimeoutMs in queryOption only applies to query server. And this one applies to broker?

So the query one can be tuned per query and this one is tuned per table config? I feel this is a little weird..

61yao avatar Oct 13 '22 18:10 61yao

@61yao unfortunately there's no direct equivalence. the currently query timeout logic is:

  1. query timeout can be overwritten in this order
  • user config inside the query payload
  • table config
  • cluster config
  • cluster default (10sec)
  1. query timeout is being used since the beginning of the query planning
  • only the remainder part of the timeout (e.g. total - plan finished duration ) is send to server as the server timeout

from a user perspective I think the user-config override should still apply as the top level (as from user's view, a query runs for at maximum X many seconds) however

  1. there's no direct equivalence of table config override
  • consider a join of 2 tables, which table's override should we apply?
  • do we apply on per stage or per query?
  1. if user is specifying for the entire query, how do we split between stages

here is my proposal

  • user-config override should always be considered as the total time (maybe minus planning time, i will explain later)
    • with this in mind, all the current timeout configuration during query should also be considered as total time
  • all stages have a fixed timeout at an absolutely ts value = sys.currentTimeNano() + desiredTimeoutNano
    • if any stage hits this timeout, they will all agree to terminate.

walterddr avatar Oct 13 '22 19:10 walterddr

@61yao unfortunately there's no direct equivalence. the currently query timeout logic is:

  1. query timeout can be overwritten in this order
  • user config inside the query payload
  • table config
  • cluster config
  • cluster default (10sec)
  1. query timeout is being used since the beginning of the query planning
  • only the remainder part of the timeout (e.g. total - plan finished duration ) is send to server as the server timeout

from a user perspective I think the user-config override should still apply as the top level (as from user's view, a query runs for at maximum X many seconds) however

  1. there's no direct equivalence of table config override
  • consider a join of 2 tables, which table's override should we apply?
  • do we apply on per stage or per query?
  1. if user is specifying for the entire query, how do we split between stages

here is my proposal

  • user-config override should always be considered as the total time (maybe minus planning time, i will explain later)

    • with this in mind, all the current timeout configuration during query should also be considered as total time
  • all stages have a fixed timeout at an absolutely ts value = sys.currentTimeNano() + desiredTimeoutNano

    • if any stage hits this timeout, they will all agree to terminate.

Maybe after the proposal is done, we can document this somewhere? Let me file an issue to track this.

61yao avatar Oct 13 '22 20:10 61yao

Do we agree that this is the end state ?

User specifies a query timeOut T which from user's perspective is the end to end query timeOut.

If Pinot does not finish processing the query within T, it should terminate execution (and ideally not leave any lingering execution resources)

In the current engine, T can be specified at Instance level / cluster level (broker config) or query option level or table level. The last one probably does not make sense for a multi table query in multi-stage engine.

Broker uses T depending on where it is coming from, spends some time T' in query planning etc and sends the remaining (T - T') to worker nodes when dispatching the plan. This is what is done in the current engine.

For the multi-stage engine as well, regardless of how many stages the plan is going to be executed in, X = (T - T') is the max time subsequent stages have.

If for example leaf stage itself takes more time than X, then it should terminate / interrupt threads and downsteam stages (parents) should detect that timeOut has expired, mailbox should not expect anything.

So, the entire operator pipeline / chain gets terminated if X gets exhausted anywhere

siddharthteotia avatar Oct 14 '22 20:10 siddharthteotia

+1 on @siddharthteotia 's comment.

let's do the following

  1. ignore the table level config for multi-stage (at least for now)
  2. the rest is exactly the same, make sure T is the e2e timeout by setting an absolute "must-finish-or-timeout" epoch and all process should check against

walterddr avatar Oct 14 '22 22:10 walterddr

readjusted the PR. PTAL

walterddr avatar Nov 30 '22 00:11 walterddr