[Bug] Resource Based Tuner is flaky - Does Tuner properly deal with sticky vs non-sticky pollers?

Open mjameswh opened this issue 1 year ago • 1 comments

Describe the bug

This test from the TS SDK repo is very flaky (success rate of approx 50%):

test('Worker can run with resource based tuner', async (t) => {
  const { createWorker, executeWorkflow } = helpers(t);
  const resourceBasedTunerOptions: ResourceBasedTunerOptions = {
    targetCpuUsage: 0.6,
    targetMemoryUsage: 0.6,
  };
  const worker = await createWorker({
    tuner: {
      tunerOptions: resourceBasedTunerOptions,
      activityTaskSlotOptions: {
        minimumSlots: 2,
        maximumSlots: 10,
        rampThrottle: 20,
      },
    },
  });
  const result = await worker.runUntil(executeWorkflow(successString));
  t.is(result, 'success');
});

When it doesn't succeed, this test causes a hang for 60 seconds, after which I see Poll wft timeout printed three times in console, and then a WFT get polled and executed. Describing the TQ while the Worker is stuck shows no poller.

I believe this is due to the fact that the Resource Based Tuner is not aware of sticky vs non-sticky pollers distribution, and therefore, that it is possible to get in a situation where only sticky pollers receives permits at start. Given the default sticky to non-sticky ratio of 0.2, and three pollers (not sure why three, isn't the default two?), we get a probability of polling at least one task from the non-sticky TQ of 48.8% (i.e.: (80%^0 * 20%) + (80%^1 * 20%) + (80%^2 * 20%)), matching the observations.

Any of the following changes fixes (or at least improve) the test:

Increasing minimumSlots on workflowTaskSlotOptions;
Setting maxCachedWorkflows to 0;
Setting nonStickyToStickyPollRatio to 0.9;
Removing tunner from the Worker options.

Side note

We should maybe consider never to allow more permits for sticky TQ than there are cached workflow executions. In general production environment, that should not make a difference, but in tests and at start time, that might help workers get up to speed more quickly.

Jul 09 '24 21:07 mjameswh

Thanks @mjameswh , very useful info. Will incorporate as soon as I get a chance.

Jul 10 '24 16:07 Sushisource