sdk-php icon indicating copy to clipboard operation
sdk-php copied to clipboard

[Bug] Got the response to undefined request

Open d-pylypenko opened this issue 2 years ago • 22 comments

Describe the bug

Sometimes when a child worker process throws an exception, the parent worker process throws the following panic error:

PanicError: flush queue: SoftJobError:
	codec_execute:
	sync_worker_exec:
	sync_worker_exec_payload: LogicException: Got the response to undefined request 10389 in /srv/vendor/temporal/sdk/src/Internal/Transport/Client.php:60

and after it:

PanicError: unknown command CommandType: ChildWorkflow, ID: edfb1479-3d88-407e-a428-7e304e0d7bdf, possible causes are nondeterministic workflow definition code or incompatible change in the workflow definition 
Screenshot 2022-05-26 at 20 57 30

Environment/Versions

  • Temporal Version: 1.16.2 and 1.2.0 SDK
  • We use Kubernetes

Additional context

We tried to scale the pods so that can be split in different zones for fault tolerance. Maybe that causing these problems.

d-pylypenko avatar May 26 '22 18:05 d-pylypenko

Hey @dmitry-pilipenko 👋🏻. I guess the problem is in the workers' restarts. SoftJobError indicates some error that leads to process (PHP) restart. I expect that fix will be released the following week.

rustatian avatar May 26 '22 18:05 rustatian

@rustatian Can I do something about it now? This happens in a production now :(

d-pylypenko avatar May 26 '22 18:05 d-pylypenko

@dmitry-pilipenko, I'm not sure what is the initial reason for this error. Could you please turn on debug in the logs and send me this file? Especially before and after this error.

rustatian avatar May 26 '22 18:05 rustatian

@dmitry-pilipenko Thank you. Could you please update RR version? You use an unsupported version (v2.7.4). You may try v2.10.2.

rustatian avatar May 26 '22 19:05 rustatian

@dmitry-pilipenko Thank you. Could you please update RR version? You use an unsupported version (v2.7.4). You may try v2.10.2.

@rustatian RR updates helped me. Your quick response helped me a lot. Thank you for this!

d-pylypenko avatar May 30 '22 10:05 d-pylypenko

@rustatian this problem is still observed, but now it is in cases unknown to me.

Versions: RR - 2.10.2 Temporal Version: 1.16.2 and 1.3.2 SDK

Probably due to the fact that we use a wait with a timeout and then throw a custom exception. Example:

        yield Temporal::awaitWithTimeout(
            $interval = CarbonInterval::minutes(30),
            fn () => $this->answer !== null
        );
        if ($this->answer === null) {
            throw new ReplyTimeout($interval);
        }

Trace:

PanicError: sync_worker_exec: SoftJobError:
	sync_worker_exec_payload: LogicException: Got the response to undefined request 12445 in /srv/vendor/temporal/sdk/src/Internal/Transport/Client.php:60
Stack trace:
#0 /srv/vendor/temporal/sdk/src/WorkerFactory.php(389): Temporal\Internal\Transport\Client->dispatch()
#1 /srv/vendor/temporal/sdk/src/WorkerFactory.php(261): Temporal\WorkerFactory->dispatch()
#2 /srv/src/Infrastructure/CLI/TemporalWorker.php(67): Temporal\WorkerFactory->run()
#3 /srv/vendor/symfony/console/Command/Command.php(308): App\Infrastructure\CLI\TemporalWorker->execute()
#4 /srv/vendor/symfony/console/Application.php(989): Symfony\Component\Console\Command\Command->run()
#5 /srv/vendor/symfony/console/Application.php(299): Symfony\Component\Console\Application->doRunCommand()
#6 /srv/vendor/symfony/console/Application.php(171): Symfony\Component\Console\Application->doRun()
#7 /srv/vendor/helpcrunch/foundation/src/Runtime/Handler.php(29): Symfony\Component\Console\Application->run()
#8 /srv/vendor/helpcrunch/foundation/src/Runtime/Runner.php(34): Helpcrunch\Foundation\Runtime\Handler->__invoke()
#9 /srv/vendor/autoload_runtime.php(29): Helpcrunch\Foundation\Runtime\Runner->run()
#10 /srv/bin/app(11): require('...')
#11 {main} 
process event for default [panic]:
github.com/temporalio/roadrunner-temporal/aggregatedpool.(*Workflow).OnWorkflowTaskStarted(0xc0007a7b30, 0xc00065ba08?)
	github.com/temporalio/[email protected]/aggregatedpool/workflow.go:153 +0x2e8
go.temporal.io/sdk/internal.(*workflowExecutionEventHandlerImpl).ProcessEvent(0xc000835c98, 0xc001a4db80, 0x0?, 0x1)
	go.temporal.io/[email protected]/internal/internal_event_handlers.go:815 +0x203
go.temporal.io/sdk/internal.(*workflowExecutionContextImpl).ProcessWorkflowTask(0xc00079f960, 0xc001b1df50)
	go.temporal.io/[email protected]/internal/internal_task_handlers.go:878 +0xca8
go.temporal.io/sdk/internal.(*workflowTaskHandlerImpl).ProcessWorkflowTask(0xc0006c0210, 0xc001b1df50, 0xc000572300)
	go.temporal.io/[email protected]/internal/internal_task_handlers.go:727 +0x485
go.temporal.io/sdk/internal.(*workflowTaskPoller).processWorkflowTask(0xc0001131e0, 0xc001b1df50)
	go.temporal.io/[email protected]/internal/internal_task_pollers.go:284 +0x2cd
go.temporal.io/sdk/internal.(*workflowTaskPoller).ProcessTask(0xc0001131e0, {0x15e0ae0?, 0xc001b1df50?})
	go.temporal.io/[email protected]/internal/internal_task_pollers.go:255 +0x6c
go.temporal.io/sdk/internal.(*baseWorker).processTask(0xc000170500, {0x15e06a0?, 0xc0007b8e40})
	go.temporal.io/[email protected]/internal/internal_worker_base.go:398 +0x167
created by go.temporal.io/sdk/internal.(*baseWorker).runTaskDispatcher
	go.temporal.io/[email protected]/internal/internal_worker_base.go:302 +0xb5

Log file: wf-default-5794999ddc-r4h94.log

d-pylypenko avatar Jun 15 '22 17:06 d-pylypenko

Did you update RR to v2.10.4 ?

rustatian avatar Jun 15 '22 17:06 rustatian

yes, I have updated my last comment and added more details

d-pylypenko avatar Jun 15 '22 17:06 d-pylypenko

yes, I have updated my last comment and added more details

You didn't update RR, because, according to the stack trace, you are using the temporal plugin version 1.4.1, but in the 2.10.4 it was updated to v1.4.7.

rustatian avatar Jun 15 '22 17:06 rustatian

@rustatian my current RR version is 2.10.2. If I update to 2.10.4 will my problem go away?

d-pylypenko avatar Jun 15 '22 17:06 d-pylypenko

@rustatian my current RR version is 2.10.2. If I update to 2.10.4 will my problem go away?

Yes, we fixed this problem in the latest version.

rustatian avatar Jun 15 '22 18:06 rustatian

@rustatian update didn't help :( Screenshot 2022-06-16 at 11 58 05

wf-default-7f9b7d986b-5bqtw.log

d-pylypenko avatar Jun 16 '22 08:06 d-pylypenko

@dmitry-pilipenko Please, attach the entire sample (link to your repo is preferable) in the description to reproduce your issue.

like this one for example: https://github.com/Torrion/temporal-worker-pool-leak-test

rustatian avatar Jun 16 '22 10:06 rustatian

@rustatian, I can provide a child workflow that has problems: https://github.com/helpcrunch/temporal/blob/main/workflows.php

d-pylypenko avatar Jun 16 '22 12:06 d-pylypenko

@rustatian, I can provide a child workflow that has problems: https://github.com/helpcrunch/temporal/blob/main/workflows.php

Please, remove the not needed parts from your code and provide a minimal example to run it with rr, as I showed in the sample. The minimal example should reproduce the bug and contain all code to run it (your .rr.yaml should also be included).

rustatian avatar Jun 16 '22 12:06 rustatian

@dmitry-pilipenko Friendly ping 😃

rustatian avatar Jun 24 '22 11:06 rustatian

@rustatian I haven't been able to reproduce yet. Our workflows generated dynamically from user settings and I'm trying to find what is causing this problem. It hasn't been done locally yet. One of the main differences: locally I deployed using docker-compose, and in production using a k8s.

The problem appears only for flows where we waiting signal with a timeout. Do you have any hypotheses that might help?

        yield Temporal::awaitWithTimeout(
            $interval = CarbonInterval::minutes(30),
            fn () => $this->answer !== null
        );
        if ($this->answer === null) {
            throw new ReplyTimeout($interval);
        }

It's execute in a child workflow. After that parent workflow got error:

{
    "eventTime": "2022-07-01T10:17:00.000Z",
    "eventType": "WorkflowTaskStarted",
    "eventId": "26",
    "details": {
      "scheduledEventId": "25",
      "identity": "default:3bfa8009-a174-424b-a20c-2eb5f01c93e8",
      "requestId": "3bb264c4-76ac-4962-a6be-8d128650c38c",
      "eventId": "26",
      "eventType": "WorkflowTaskStarted",
      "kvps": [
        {
          "key": "eventTime",
          "value": "Jul 1st 1:17:00 pm"
        },
        {
          "key": "eventId",
          "value": "26"
        },
        {
          "key": "scheduledEventId",
          "value": "25"
        },
        {
          "key": "identity",
          "value": "default:3bfa8009-a174-424b-a20c-2eb5f01c93e8"
        },
        {
          "key": "requestId",
          "value": "3bb264c4-76ac-4962-a6be-8d128650c38c"
        }
      ],
      "eventTime": "Jul 1st 1:17:00 pm"
    },
    "eventTimeDisplay": "Jul 1st 1:17:00 pm",
    "timeElapsedDisplay": "15s",
    "eventSummary": {
      "requestId": "3bb264c4-76ac-4962-a6be-8d128650c38c",
      "eventId": "26",
      "eventType": "WorkflowTaskStarted",
      "kvps": [
        {
          "key": "requestId",
          "value": "3bb264c4-76ac-4962-a6be-8d128650c38c"
        }
      ]
    },
    "eventFullDetails": {
      "scheduledEventId": "25",
      "identity": "default:3bfa8009-a174-424b-a20c-2eb5f01c93e8",
      "requestId": "3bb264c4-76ac-4962-a6be-8d128650c38c",
      "eventId": "26",
      "eventType": "WorkflowTaskStarted",
      "kvps": [
        {
          "key": "scheduledEventId",
          "value": "25"
        },
        {
          "key": "identity",
          "value": "default:3bfa8009-a174-424b-a20c-2eb5f01c93e8"
        },
        {
          "key": "requestId",
          "value": "3bb264c4-76ac-4962-a6be-8d128650c38c"
        }
      ]
    }
  },
  {
    "eventTime": "2022-07-01T10:17:00.000Z",
    "eventType": "WorkflowTaskFailed",
    "eventId": "27",
    "details": {
      "scheduledEventId": "25",
      "startedEventId": "26",
      "cause": "WORKFLOW_TASK_FAILED_CAUSE_NON_DETERMINISTIC_ERROR",
      "failure": {
        "message": "unknown command CommandType: ChildWorkflow, ID: 31ab4b00-a18a-44e8-851a-baf9de182600, possible causes are nondeterministic workflow definition code or incompatible change in the workflow definition",
        "source": "GoSDK",
        "stackTrace": "process event for default [panic]:\ngo.temporal.io/sdk/internal.panicIllegalState(...)\n\tgo.temporal.io/[email protected]/internal/internal_decision_state_machine.go:409\ngo.temporal.io/sdk/internal.(*commandsHelper).getCommand(0x8?, {0x3?, {0xc000bc2d50?, 0x0?}})\n\tgo.temporal.io/[email protected]/internal/internal_decision_state_machine.go:881 +0x109\ngo.temporal.io/sdk/internal.(*commandsHelper).handleStartChildWorkflowExecutionInitiated(0x7f7cc1a01f18?, {0xc000bc2d50?, 0xc000196000?})\n\tgo.temporal.io/[email protected]/internal/internal_decision_state_machine.go:1124 +0x29\ngo.temporal.io/sdk/internal.(*workflowExecutionEventHandlerImpl).ProcessEvent(0xc001373770, 0xc00156aa00, 0xd8?, 0x0)\n\tgo.temporal.io/[email protected]/internal/internal_event_handlers.go:905 +0x6ae\ngo.temporal.io/sdk/internal.(*workflowExecutionContextImpl).ProcessWorkflowTask(0xc0007f9080, 0xc000559680)\n\tgo.temporal.io/[email protected]/internal/internal_task_handlers.go:902 +0xd68\ngo.temporal.io/sdk/internal.(*workflowTaskHandlerImpl).ProcessWorkflowTask(0xc000a40c60, 0xc000559680, 0xc000702510)\n\tgo.temporal.io/[email protected]/internal/internal_task_handlers.go:749 +0x485\ngo.temporal.io/sdk/internal.(*workflowTaskPoller).processWorkflowTask(0xc0005b3a00, 0xc000559680)\n\tgo.temporal.io/[email protected]/internal/internal_task_pollers.go:284 +0x2cd\ngo.temporal.io/sdk/internal.(*workflowTaskPoller).ProcessTask(0xc0005b3a00, {0x16063c0?, 0xc000559680?})\n\tgo.temporal.io/[email protected]/internal/internal_task_pollers.go:255 +0x6c\ngo.temporal.io/sdk/internal.(*baseWorker).processTask(0xc00067e8c0, {0x1605f80?, 0xc00047f9c0})\n\tgo.temporal.io/[email protected]/internal/internal_worker_base.go:400 +0x167\ncreated by go.temporal.io/sdk/internal.(*baseWorker).runTaskDispatcher\n\tgo.temporal.io/[email protected]/internal/internal_worker_base.go:305 +0xb5",
        "cause": null,
        "applicationFailureInfo": {
          "type": "PanicError",
          "nonRetryable": true,
          "details": null
        },
        "failureInfo": "applicationFailureInfo"
      },
      "identity": "default:3bfa8009-a174-424b-a20c-2eb5f01c93e8",
      "baseRunId": "",
      "newRunId": "",
      "forkEventVersion": "0",
      "binaryChecksum": "23bc61c98cc56611c0691d0c4fd23834",
      "eventId": "27",
      "eventType": "WorkflowTaskFailed",
      "kvps": [
        {
          "key": "eventTime",
          "value": "Jul 1st 1:17:00 pm"
        },
        {
          "key": "eventId",
          "value": "27"
        },
        {
          "key": "scheduledEventId",
          "value": "25"
        },
        {
          "key": "startedEventId",
          "value": "26"
        },
        {
          "key": "cause",
          "value": "WORKFLOW_TASK_FAILED_CAUSE_NON_DETERMINISTIC_ERROR"
        },
        {
          "key": "failure",
          "value": "PanicError: unknown command CommandType: ChildWorkflow, ID: 31ab4b00-a18a-44e8-851a-baf9de182600, possible causes are nondeterministic workflow definition code or incompatible change in the workflow definition \nprocess event for default [panic]:\ngo.temporal.io/sdk/internal.panicIllegalState(...)\n\tgo.temporal.io/[email protected]/internal/internal_decision_state_machine.go:409\ngo.temporal.io/sdk/internal.(*commandsHelper).getCommand(0x8?, {0x3?, {0xc000bc2d50?, 0x0?}})\n\tgo.temporal.io/[email protected]/internal/internal_decision_state_machine.go:881 +0x109\ngo.temporal.io/sdk/internal.(*commandsHelper).handleStartChildWorkflowExecutionInitiated(0x7f7cc1a01f18?, {0xc000bc2d50?, 0xc000196000?})\n\tgo.temporal.io/[email protected]/internal/internal_decision_state_machine.go:1124 +0x29\ngo.temporal.io/sdk/internal.(*workflowExecutionEventHandlerImpl).ProcessEvent(0xc001373770, 0xc00156aa00, 0xd8?, 0x0)\n\tgo.temporal.io/[email protected]/internal/internal_event_handlers.go:905 +0x6ae\ngo.temporal.io/sdk/internal.(*workflowExecutionContextImpl).ProcessWorkflowTask(0xc0007f9080, 0xc000559680)\n\tgo.temporal.io/[email protected]/internal/internal_task_handlers.go:902 +0xd68\ngo.temporal.io/sdk/internal.(*workflowTaskHandlerImpl).ProcessWorkflowTask(0xc000a40c60, 0xc000559680, 0xc000702510)\n\tgo.temporal.io/[email protected]/internal/internal_task_handlers.go:749 +0x485\ngo.temporal.io/sdk/internal.(*workflowTaskPoller).processWorkflowTask(0xc0005b3a00, 0xc000559680)\n\tgo.temporal.io/[email protected]/internal/internal_task_pollers.go:284 +0x2cd\ngo.temporal.io/sdk/internal.(*workflowTaskPoller).ProcessTask(0xc0005b3a00, {0x16063c0?, 0xc000559680?})\n\tgo.temporal.io/[email protected]/internal/internal_task_pollers.go:255 +0x6c\ngo.temporal.io/sdk/internal.(*baseWorker).processTask(0xc00067e8c0, {0x1605f80?, 0xc00047f9c0})\n\tgo.temporal.io/[email protected]/internal/internal_worker_base.go:400 +0x167\ncreated by go.temporal.io/sdk/internal.(*baseWorker).runTaskDispatcher\n\tgo.temporal.io/[email protected]/internal/internal_worker_base.go:305 +0xb5"
        },
        {
          "key": "identity",
          "value": "default:3bfa8009-a174-424b-a20c-2eb5f01c93e8"
        },
        {
          "key": "baseRunId",
          "value": ""
        },
        {
          "key": "newRunId",
          "value": ""
        },
        {
          "key": "forkEventVersion",
          "value": "0"
        },
        {
          "key": "binaryChecksum",
          "value": "23bc61c98cc56611c0691d0c4fd23834"
        }
      ],
      "eventTime": "Jul 1st 1:17:00 pm"
    },
    "eventTimeDisplay": "Jul 1st 1:17:00 pm",
    "timeElapsedDisplay": "15s",
    "eventSummary": {
      "message": "unknown command CommandType: ChildWorkflow, ID: 31ab4b00-a18a-44e8-851a-baf9de182600, possible causes are nondeterministic workflow definition code or incompatible change in the workflow definition",
      "eventId": "27",
      "eventType": "WorkflowTaskFailed",
      "kvps": [
        {
          "key": "message",
          "value": "unknown command CommandType: ChildWorkflow, ID: 31ab4b00-a18a-44e8-851a-baf9de182600, possible causes are nondeterministic workflow definition code or incompatible change in the workflow definition"
        }
      ]
    },
    "eventFullDetails": {
      "scheduledEventId": "25",
      "startedEventId": "26",
      "cause": "WORKFLOW_TASK_FAILED_CAUSE_NON_DETERMINISTIC_ERROR",
      "failure": {
        "message": "unknown command CommandType: ChildWorkflow, ID: 31ab4b00-a18a-44e8-851a-baf9de182600, possible causes are nondeterministic workflow definition code or incompatible change in the workflow definition",
        "source": "GoSDK",
        "stackTrace": "process event for default [panic]:\ngo.temporal.io/sdk/internal.panicIllegalState(...)\n\tgo.temporal.io/[email protected]/internal/internal_decision_state_machine.go:409\ngo.temporal.io/sdk/internal.(*commandsHelper).getCommand(0x8?, {0x3?, {0xc000bc2d50?, 0x0?}})\n\tgo.temporal.io/[email protected]/internal/internal_decision_state_machine.go:881 +0x109\ngo.temporal.io/sdk/internal.(*commandsHelper).handleStartChildWorkflowExecutionInitiated(0x7f7cc1a01f18?, {0xc000bc2d50?, 0xc000196000?})\n\tgo.temporal.io/[email protected]/internal/internal_decision_state_machine.go:1124 +0x29\ngo.temporal.io/sdk/internal.(*workflowExecutionEventHandlerImpl).ProcessEvent(0xc001373770, 0xc00156aa00, 0xd8?, 0x0)\n\tgo.temporal.io/[email protected]/internal/internal_event_handlers.go:905 +0x6ae\ngo.temporal.io/sdk/internal.(*workflowExecutionContextImpl).ProcessWorkflowTask(0xc0007f9080, 0xc000559680)\n\tgo.temporal.io/[email protected]/internal/internal_task_handlers.go:902 +0xd68\ngo.temporal.io/sdk/internal.(*workflowTaskHandlerImpl).ProcessWorkflowTask(0xc000a40c60, 0xc000559680, 0xc000702510)\n\tgo.temporal.io/[email protected]/internal/internal_task_handlers.go:749 +0x485\ngo.temporal.io/sdk/internal.(*workflowTaskPoller).processWorkflowTask(0xc0005b3a00, 0xc000559680)\n\tgo.temporal.io/[email protected]/internal/internal_task_pollers.go:284 +0x2cd\ngo.temporal.io/sdk/internal.(*workflowTaskPoller).ProcessTask(0xc0005b3a00, {0x16063c0?, 0xc000559680?})\n\tgo.temporal.io/[email protected]/internal/internal_task_pollers.go:255 +0x6c\ngo.temporal.io/sdk/internal.(*baseWorker).processTask(0xc00067e8c0, {0x1605f80?, 0xc00047f9c0})\n\tgo.temporal.io/[email protected]/internal/internal_worker_base.go:400 +0x167\ncreated by go.temporal.io/sdk/internal.(*baseWorker).runTaskDispatcher\n\tgo.temporal.io/[email protected]/internal/internal_worker_base.go:305 +0xb5",
        "cause": null,
        "applicationFailureInfo": {
          "type": "PanicError",
          "nonRetryable": true,
          "details": null
        },
        "failureInfo": "applicationFailureInfo"
      },
      "identity": "default:3bfa8009-a174-424b-a20c-2eb5f01c93e8",
      "baseRunId": "",
      "newRunId": "",
      "forkEventVersion": "0",
      "binaryChecksum": "23bc61c98cc56611c0691d0c4fd23834",
      "eventId": "27",
      "eventType": "WorkflowTaskFailed",
      "kvps": [
        {
          "key": "scheduledEventId",
          "value": "25"
        },
        {
          "key": "startedEventId",
          "value": "26"
        },
        {
          "key": "cause",
          "value": "WORKFLOW_TASK_FAILED_CAUSE_NON_DETERMINISTIC_ERROR"
        },
        {
          "key": "failure",
          "value": "PanicError: unknown command CommandType: ChildWorkflow, ID: 31ab4b00-a18a-44e8-851a-baf9de182600, possible causes are nondeterministic workflow definition code or incompatible change in the workflow definition \nprocess event for default [panic]:\ngo.temporal.io/sdk/internal.panicIllegalState(...)\n\tgo.temporal.io/[email protected]/internal/internal_decision_state_machine.go:409\ngo.temporal.io/sdk/internal.(*commandsHelper).getCommand(0x8?, {0x3?, {0xc000bc2d50?, 0x0?}})\n\tgo.temporal.io/[email protected]/internal/internal_decision_state_machine.go:881 +0x109\ngo.temporal.io/sdk/internal.(*commandsHelper).handleStartChildWorkflowExecutionInitiated(0x7f7cc1a01f18?, {0xc000bc2d50?, 0xc000196000?})\n\tgo.temporal.io/[email protected]/internal/internal_decision_state_machine.go:1124 +0x29\ngo.temporal.io/sdk/internal.(*workflowExecutionEventHandlerImpl).ProcessEvent(0xc001373770, 0xc00156aa00, 0xd8?, 0x0)\n\tgo.temporal.io/[email protected]/internal/internal_event_handlers.go:905 +0x6ae\ngo.temporal.io/sdk/internal.(*workflowExecutionContextImpl).ProcessWorkflowTask(0xc0007f9080, 0xc000559680)\n\tgo.temporal.io/[email protected]/internal/internal_task_handlers.go:902 +0xd68\ngo.temporal.io/sdk/internal.(*workflowTaskHandlerImpl).ProcessWorkflowTask(0xc000a40c60, 0xc000559680, 0xc000702510)\n\tgo.temporal.io/[email protected]/internal/internal_task_handlers.go:749 +0x485\ngo.temporal.io/sdk/internal.(*workflowTaskPoller).processWorkflowTask(0xc0005b3a00, 0xc000559680)\n\tgo.temporal.io/[email protected]/internal/internal_task_pollers.go:284 +0x2cd\ngo.temporal.io/sdk/internal.(*workflowTaskPoller).ProcessTask(0xc0005b3a00, {0x16063c0?, 0xc000559680?})\n\tgo.temporal.io/[email protected]/internal/internal_task_pollers.go:255 +0x6c\ngo.temporal.io/sdk/internal.(*baseWorker).processTask(0xc00067e8c0, {0x1605f80?, 0xc00047f9c0})\n\tgo.temporal.io/[email protected]/internal/internal_worker_base.go:400 +0x167\ncreated by go.temporal.io/sdk/internal.(*baseWorker).runTaskDispatcher\n\tgo.temporal.io/[email protected]/internal/internal_worker_base.go:305 +0xb5"
        },
        {
          "key": "identity",
          "value": "default:3bfa8009-a174-424b-a20c-2eb5f01c93e8"
        },
        {
          "key": "baseRunId",
          "value": ""
        },
        {
          "key": "newRunId",
          "value": ""
        },
        {
          "key": "forkEventVersion",
          "value": "0"
        },
        {
          "key": "binaryChecksum",
          "value": "23bc61c98cc56611c0691d0c4fd23834"
        }
      ]
    }
  }

d-pylypenko avatar Jul 01 '22 13:07 d-pylypenko

@dmitry-pilipenko 👋🏻 Might be you use a different RR version locally and in the k8s? We had this issue in our past versions.

rustatian avatar Jul 01 '22 13:07 rustatian

@dmitry-pilipenko 👋🏻 Might be you use a different RR version locally and in the k8s? We had this issue in our past versions.

@rustatian versions are completely identical. k8s: Screenshot 2022-07-01 at 16 21 29 Screenshot 2022-07-01 at 16 23 06

docker-compose: Screenshot 2022-07-01 at 16 25 04 Screenshot 2022-07-01 at 16 25 31

d-pylypenko avatar Jul 01 '22 13:07 d-pylypenko

@rustatian Now I found a case when there was no awaiting with a timeout in the flow, but it still cause the problem. I exported the workflow logs from the admin: c0818c95 9669 4de5 ab5e 855e2de2f2d8 - e121f3e0-3df8-4e25-b0c0-d0e5de289955.json.zip

d-pylypenko avatar Jul 01 '22 13:07 d-pylypenko

@dmitry-pilipenko Thanks for the logs, but to help you, we need to reproduce this issue. Please, as I suggested earlier, create a repository with a reproducible sample that includes .rr.yaml and sample minimum app. It can either be in Docker or run with rr serve.

rustatian avatar Jul 01 '22 13:07 rustatian

IDK if this maybe of help, but I've experienced the same issue twice, both times our pods were lacking available memory. IDK how this happens and if it would be the same to you.

Zylius avatar Oct 20 '22 10:10 Zylius

IDK if this maybe of help, but I've experienced the same issue twice, both times our pods were lacking available memory. IDK how this happens and if it would be the same to you.

Do you have a supervisor in the RR's configuration?

rustatian avatar Oct 20 '22 10:10 rustatian

Nope, we probably should, but after increasing memory everything is stable, handling 4+Million workflows a day for half a year now. I'll enable it when I get the chance.

Zylius avatar Oct 20 '22 10:10 Zylius

It's not that PHP or RR was leaking memory, we falsely set the memory too low constraining the pod.

Zylius avatar Oct 20 '22 10:10 Zylius

Nope, we probably should, but after increasing memory everything is stable, handling 4+Million workflows a day for half a year now. I'll enable it when I get the chance.

wow, that's big numbers 😮

Might be OOM kill the workflow worker?

rustatian avatar Oct 20 '22 10:10 rustatian

Yeah, it should kill the whole pod cause of OOM, but it gets into a weird state before that with undefined request. I haven't investigated it enough to reproduce it :(

Zylius avatar Oct 20 '22 10:10 Zylius

Ok, thanks. Please keep us updated; if we can reproduce this weird issue, we will fix it ASAP.

rustatian avatar Oct 20 '22 10:10 rustatian

I'll try to reproduce it with OOM when I get some free time :D :pray:

Zylius avatar Oct 20 '22 10:10 Zylius