[Bug] Resource Based Tuner is flaky - Does Tuner properly deal with sticky vs non-sticky pollers?
Describe the bug
This test from the TS SDK repo is very flaky (success rate of approx 50%):
test('Worker can run with resource based tuner', async (t) => {
const { createWorker, executeWorkflow } = helpers(t);
const resourceBasedTunerOptions: ResourceBasedTunerOptions = {
targetCpuUsage: 0.6,
targetMemoryUsage: 0.6,
};
const worker = await createWorker({
tuner: {
tunerOptions: resourceBasedTunerOptions,
activityTaskSlotOptions: {
minimumSlots: 2,
maximumSlots: 10,
rampThrottle: 20,
},
},
});
const result = await worker.runUntil(executeWorkflow(successString));
t.is(result, 'success');
});
When it doesn't succeed, this test causes a hang for 60 seconds, after which I see Poll wft timeout printed three times in console, and then a WFT get polled and executed. Describing the TQ while the Worker is stuck shows no poller.
I believe this is due to the fact that the Resource Based Tuner is not aware of sticky vs non-sticky pollers distribution, and therefore, that it is possible to get in a situation where only sticky pollers receives permits at start. Given the default sticky to non-sticky ratio of 0.2, and three pollers (not sure why three, isn't the default two?), we get a probability of polling at least one task from the non-sticky TQ of 48.8% (i.e.: (80%^0 * 20%) + (80%^1 * 20%) + (80%^2 * 20%)), matching the observations.
Any of the following changes fixes (or at least improve) the test:
- Increasing
minimumSlotsonworkflowTaskSlotOptions; - Setting
maxCachedWorkflowsto 0; - Setting
nonStickyToStickyPollRatioto0.9; - Removing
tunnerfrom the Worker options.
Side note
We should maybe consider never to allow more permits for sticky TQ than there are cached workflow executions. In general production environment, that should not make a difference, but in tests and at start time, that might help workers get up to speed more quickly.
Thanks @mjameswh , very useful info. Will incorporate as soon as I get a chance.