sdk-python icon indicating copy to clipboard operation
sdk-python copied to clipboard

[Bug] Workflow sandbox issues with Protobuf

Open Gr1N opened this issue 1 year ago • 12 comments

What are you really trying to do?

Hey!

I'm doing nothing special. I have one workflow and one activity. I use Protobuf messages as input parameters for both cases. I just expect to be able to execute the workflow and wait for it to be completed.

Describe the bug

The issue happens only in the Kubernetes environment.

Traceback (most recent call last):
  File \"/usr/local/lib/python3.11/site-packages/temporalio/converter.py\", line 436, in from_payload
    return google.protobuf.json_format.Parse(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.11/site-packages/google/protobuf/json_format.py\", line 421, in Parse
    return ParseDict(js, message, ignore_unknown_fields, descriptor_pool,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.11/site-packages/google/protobuf/json_format.py\", line 446, in ParseDict
    parser.ConvertMessage(js_dict, message, '')
  File \"/usr/local/lib/python3.11/site-packages/google/protobuf/json_format.py\", line 487, in ConvertMessage
    self._ConvertFieldValuePair(value, message, path)
  File \"/usr/local/lib/python3.11/site-packages/google/protobuf/json_format.py\", line 563, in _ConvertFieldValuePair
    if _IsMapEntry(field):
       ^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.11/site-packages/google/protobuf/json_format.py\", line 151, in _IsMapEntry
    field.message_type.GetOptions().map_entry)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"<frozen importlib._bootstrap>\", line 1176, in _find_and_load
  File \"<frozen importlib._bootstrap>\", line 1130, in _find_and_load_unlocked
  File \"/usr/local/lib/python3.11/site-packages/temporalio/worker/workflow_sandbox/_importer.py\", line 393, in __getitem__
    return self.current[key]
           ~~~~~~~~~~~~^^^^^
KeyError: 'google.protobuf'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File \"/usr/local/lib/python3.11/site-packages/temporalio/converter.py\", line 306, in from_payloads
    values.append(converter.from_payload(payload, type_hint))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.11/site-packages/temporalio/converter.py\", line 442, in from_payload
    raise RuntimeError(f\"Unknown Protobuf type {message_type}\") from err
RuntimeError: Unknown Protobuf type internal.path.to.protobuf.TaskRequest

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File \"/usr/local/lib/python3.11/site-packages/temporalio/worker/_workflow_instance.py\", line 1625, in _convert_payloads
    return self._payload_converter.from_payloads(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.11/site-packages/temporalio/converter.py\", line 308, in from_payloads
    raise RuntimeError(
RuntimeError: Payload at index 0 with encoding json/protobuf could not be converted

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File \"/usr/local/lib/python3.11/site-packages/temporalio/worker/_workflow_instance.py\", line 364, in activate
    self._workflow_input = self._make_workflow_input(start_job)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.11/site-packages/temporalio/worker/_workflow_instance.py\", line 896, in _make_workflow_input
    args = self._convert_payloads(start_job.arguments, arg_types)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File \"/usr/local/lib/python3.11/site-packages/temporalio/worker/_workflow_instance.py\", line 1633, in _convert_payloads
    raise RuntimeError(\"Failed decoding arguments\") from err

Minimal Reproduction

Unfortunately, I don't have steps for reproduction. Everything works perfectly in my and my colleague's local environments.

I would love to hear ideas or proposals on what to look into to understand the root cause. The only thing I can state is that if I add the following import, everything works:

with workflow.unsafe.imports_passed_through():
    import google.protobuf

My other Protobuf imports (the result of code generation from Protobuf definitions) are already under unsafe context manager.

Environment/Versions

  • OS and processor: Linux
  • Temporal Version: 1.8.0
  • Yes, Docker and Kubernetes

Additional context

Gr1N avatar Nov 12 '24 16:11 Gr1N

Unfortunately, I don't have steps for reproduction. Everything works perfectly in my and my colleague's local environments.

Ideally we can replicate so we can help debug the problem. It is interesting that the exact same Python code does different things in different environments. It may be a protobuf version issue or an order of operations issue, but a standalone replication may be needed for us to confirm the issue.

The only thing I can state is that if I add the following import, everything works:

Looking at the traceback, it's a bit of an interesting case where we catch KeyError because that's how we're checking a not-found protobuf, but this is actually having an import-level key error for importing google.protobuf. I wonder if it is lazily importing google.protobuf inside of GetOptions there somehow.

I can't really tell the problem without a replication. Can you confirm exact protobuf version (both library and protoc generator) and continually reduce the k8s-only form of replication until it's very small and standalone?

cretz avatar Nov 12 '24 17:11 cretz

Here are my versions:

protobuf 4.25.3
libprotoc 3.19.6

Here is an example of a workflow file:

from temporalio import workflow

with workflow.unsafe.imports_passed_through():
    from temporalio.v1.lorem_pb2 import (
        LoremWorkflowRequest,
        LoremWorkflowResponse,
    )


@workflow.defn(name="lorem-workflow")
class LoremWorkflow:
    @workflow.run
    async def run(self, req: LoremWorkflowRequest) -> LoremWorkflowResponse:
        return LoremWorkflowResponse(text="i am response")

Could it be the issue with my Protobuf definitions and the fact that I use maps (but again, on local, all good)?

message LoremWorkflowRequest {
  map<string, bool> random = 1;
}

Gr1N avatar Nov 12 '24 17:11 Gr1N

Here is an example of a workflow file:

To confirm, the exact same code from Python start to end fails in one of your environments but not another? May need to see the full standalone code (i.e. how you are registering/running worker). There may be an operations order issue.

Could it be the issue with my Protobuf definitions and the fact that I use maps (but again, on local, all good)?

I cannot know to be honest without replicating and debugging. The code you show above should work fine in every environment. I wonder if there are Python version differences or something else in the one environment where this happens that could help us figure out how to reliably replicate?

I think the main goal is to confirm the exact same small code that fails in one environment but works in another. Even something as simple as the order something is imported may affect it which is why exact code matters.

cretz avatar Nov 12 '24 17:11 cretz

Yep, I understand that providing insights without a reproducible example is almost impossible.

However, I figured out how to make my workflow workable.

I had my Protobuf imports under unsafe context manager:

with workflow.unsafe.imports_passed_through():
    from temporalio.v1.lorem_pb2 import (
        LoremWorkflowRequest,
        LoremWorkflowResponse,
    )

I changed that to the following, and now I see no exceptions:

from temporalio.v1.lorem_pb2 import (
    LoremWorkflowRequest,
    LoremWorkflowResponse,
)

What is the right way?

Gr1N avatar Nov 12 '24 19:11 Gr1N

What is the right way?

The first way with the imports passed through is usually better for models because with the second way you're re-importing the models on every single workflow run which costs memory and CPU. Having said that, we automatically mark any import beneath temporalio as pass through I believe, so a bit surprised there is a difference.

cretz avatar Nov 12 '24 20:11 cretz

Sounds like the same issue I have https://stackoverflow.com/questions/79495633/how-do-i-handle-json-protobuf-messages-in-temporal-python-sdk

But I think the problem is that type is not known by the time it gets to the converter for lookup as noted by <unknown> on the message

site-packages\temporalio\converter.py", line 435, in from_payload
    value = _sym_db.GetSymbol(message_type)()
            ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^

The message_type may not be known at that point because of

message_type = payload.metadata.get("messageType", b"<unknown>").decode()

AFAIK python erases type information at runtime so I am not sure how you're supposed to determine that unless there's an undocumented annotation to say what the types are.

trajano avatar Mar 09 '25 16:03 trajano

But I think the problem is that type is not known by the time it gets to the converter for lookup

The problem is that whatever is doing the protobuf conversion is not setting metadata on the payload properly. This would be a bit strange in all but the oldest Python SDK versions or in CLI. Can you show the raw payload in workflow history (can download still encoded history from UI)? How are you setting this value? Are you possibly using CLI?

AFAIK python erases type information at runtime

No, type hints are available at runtime, but regardless, on deserialization we use a messageType metadata value that should be set

Also, instead of an existing GH issue or SO, there are also forums and #python-sdk channels on Slack if easier for you. I think your problem is unrelated to this issue which is about the sandbox.

cretz avatar Mar 10 '25 13:03 cretz

Are you talking about this?

{
  "eventId": "1",
  "eventTime": "2025-03-10T18:11:26.549986494Z",
  "eventType": "WorkflowExecutionStarted",
  "taskId": "1049165",
  "userMetadata": {},
  "workflowExecutionStartedEventAttributes": {
    "workflowType": {
      "name": "GreetingWorkflowProtobuf"
    },
    "taskQueue": {
      "name": "foo",
      "kind": "TASK_QUEUE_KIND_NORMAL"
    },
    "input": {
      "payloads": [
        {
          "message": "Archie"
        }
      ]
    },
    "workflowTaskTimeout": "10s",
    "originalExecutionRunId": "cc2e7a38-8c5e-4ca7-806b-e441d0ad7514",
    "firstExecutionRunId": "cc2e7a38-8c5e-4ca7-806b-e441d0ad7514",
    "attempt": 1,
    "firstWorkflowTaskBackoff": "0s",
    "workflowId": "29e58714-ebe5-4f6b-b277-fe81215c5a58"
  },
  "links": []
}

No, type hints are available at runtime, but regardless, on deserialization we use a messageType metadata value that should be set

I don't see that but it explicitly be set in userMetadata ? I don't recall having to do that when it's Java

trajano avatar Mar 10 '25 18:03 trajano

I don't see that but it explicitly be set in userMetadata? I don't recall having to do that when it's Java

Yes, Java was able to infer, but even modern Java SDK will set that metadata. But newer SDKs require this field even when they can infer. How are you starting this workflow? Any recent SDK version you start it with using a protobuf message will set the proper metadata. Are you using CLI? If so, you'll need something like --input-meta messageType=my.qualified.Proto.

cretz avatar Mar 10 '25 19:03 cretz

@cretz I am starting through Temporal UI, the Start workflow button

trajano avatar Mar 11 '25 01:03 trajano

When starting through the latest UI, you have to select json/protobuf as the encoding, which will then provide a "Message Type" field you must provide with the fully qualified proto message name. Most SDKs require this even if older SDKs/implementations don't for backwards compatibility reasons.

Note, this is unrelated to this GitHub issue which is specifically concerning the sandbox and protobuf. Can use Slack or forums for general help (or another GitHub issue if really needed), or you can open a cloud ticket if you're a cloud customer.

cretz avatar Mar 11 '25 15:03 cretz

@cretz noted, it appears to be in the latest cloud UI, but not in the current dev-server. I'll open up a related issue

trajano avatar Mar 11 '25 17:03 trajano