pinot
pinot copied to clipboard
[multi-stage] allow configurable timeout
this enables configuration for timeout on a
- server config level via broker timeout or query runner timeout
- either config key is fine, one is in nano second one is in millis
- or via query options
TODO add test to test query option override.
Codecov Report
Merging #9571 (4afef37) into master (d50f9ee) will decrease coverage by
0.02%. The diff coverage is84.37%.
@@ Coverage Diff @@
## master #9571 +/- ##
============================================
- Coverage 70.40% 70.38% -0.03%
+ Complexity 5012 4932 -80
============================================
Files 1973 1973
Lines 105815 105822 +7
Branches 16032 16033 +1
============================================
- Hits 74503 74478 -25
- Misses 26131 26162 +31
- Partials 5181 5182 +1
| Flag | Coverage Δ | |
|---|---|---|
| integration1 | 25.10% <9.37%> (-0.04%) |
:arrow_down: |
| integration2 | 24.87% <0.00%> (+0.19%) |
:arrow_up: |
| unittests1 | 67.83% <85.71%> (-0.01%) |
:arrow_down: |
| unittests2 | 15.81% <75.00%> (+<0.01%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Impacted Files | Coverage Δ | |
|---|---|---|
| ...rg/apache/pinot/query/service/QueryDispatcher.java | 84.37% <50.00%> (-0.32%) |
:arrow_down: |
| ...requesthandler/MultiStageBrokerRequestHandler.java | 59.45% <75.00%> (+0.30%) |
:arrow_up: |
| ...va/org/apache/pinot/query/runtime/QueryRunner.java | 86.74% <100.00%> (+0.16%) |
:arrow_up: |
| ...query/runtime/operator/MailboxReceiveOperator.java | 91.52% <100.00%> (-0.15%) |
:arrow_down: |
| .../pinot/query/runtime/plan/PhysicalPlanVisitor.java | 96.87% <100.00%> (ø) |
|
| ...e/pinot/query/runtime/plan/PlanRequestContext.java | 93.75% <100.00%> (+0.89%) |
:arrow_up: |
| ...t/query/runtime/plan/ServerRequestPlanVisitor.java | 79.20% <100.00%> (+0.63%) |
:arrow_up: |
| .../runtime/plan/server/ServerPlanRequestContext.java | 100.00% <100.00%> (ø) |
|
| ...va/org/apache/pinot/query/service/QueryServer.java | 76.08% <100.00%> (+0.53%) |
:arrow_up: |
| ...ntroller/helix/core/minion/TaskMetricsEmitter.java | 34.88% <0.00%> (-51.17%) |
:arrow_down: |
| ... and 31 more |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
Correct me if I am wrong, the queryTimeoutMs in queryOption only applies to query server. And this one applies to broker?
So the query one can be tuned per query and this one is tuned per table config? I feel this is a little weird..
@61yao unfortunately there's no direct equivalence. the currently query timeout logic is:
- query timeout can be overwritten in this order
- user config inside the query payload
- table config
- cluster config
- cluster default (10sec)
- query timeout is being used since the beginning of the query planning
- only the remainder part of the timeout (e.g. total - plan finished duration ) is send to server as the server timeout
from a user perspective I think the user-config override should still apply as the top level (as from user's view, a query runs for at maximum X many seconds) however
- there's no direct equivalence of table config override
- consider a join of 2 tables, which table's override should we apply?
- do we apply on per stage or per query?
- if user is specifying for the entire query, how do we split between stages
here is my proposal
- user-config override should always be considered as the total time (maybe minus planning time, i will explain later)
- with this in mind, all the current timeout configuration during query should also be considered as total time
- all stages have a fixed timeout at an absolutely ts value =
sys.currentTimeNano() + desiredTimeoutNano- if any stage hits this timeout, they will all agree to terminate.
@61yao unfortunately there's no direct equivalence. the currently query timeout logic is:
- query timeout can be overwritten in this order
- user config inside the query payload
- table config
- cluster config
- cluster default (10sec)
- query timeout is being used since the beginning of the query planning
- only the remainder part of the timeout (e.g. total - plan finished duration ) is send to server as the server timeout
from a user perspective I think the user-config override should still apply as the top level (as from user's view, a query runs for at maximum X many seconds) however
- there's no direct equivalence of table config override
- consider a join of 2 tables, which table's override should we apply?
- do we apply on per stage or per query?
- if user is specifying for the entire query, how do we split between stages
here is my proposal
user-config override should always be considered as the total time (maybe minus planning time, i will explain later)
- with this in mind, all the current timeout configuration during query should also be considered as total time
all stages have a fixed timeout at an absolutely ts value =
sys.currentTimeNano() + desiredTimeoutNano
- if any stage hits this timeout, they will all agree to terminate.
Maybe after the proposal is done, we can document this somewhere? Let me file an issue to track this.
Do we agree that this is the end state ?
User specifies a query timeOut T which from user's perspective is the end to end query timeOut.
If Pinot does not finish processing the query within T, it should terminate execution (and ideally not leave any lingering execution resources)
In the current engine, T can be specified at Instance level / cluster level (broker config) or query option level or table level. The last one probably does not make sense for a multi table query in multi-stage engine.
Broker uses T depending on where it is coming from, spends some time T' in query planning etc and sends the remaining (T - T') to worker nodes when dispatching the plan. This is what is done in the current engine.
For the multi-stage engine as well, regardless of how many stages the plan is going to be executed in, X = (T - T') is the max time subsequent stages have.
If for example leaf stage itself takes more time than X, then it should terminate / interrupt threads and downsteam stages (parents) should detect that timeOut has expired, mailbox should not expect anything.
So, the entire operator pipeline / chain gets terminated if X gets exhausted anywhere
+1 on @siddharthteotia 's comment.
let's do the following
- ignore the table level config for multi-stage (at least for now)
- the rest is exactly the same, make sure T is the e2e timeout by setting an absolute "must-finish-or-timeout" epoch and all process should check against
readjusted the PR. PTAL