[Bug] Workflow sandbox issues with Protobuf
What are you really trying to do?
Hey!
I'm doing nothing special. I have one workflow and one activity. I use Protobuf messages as input parameters for both cases. I just expect to be able to execute the workflow and wait for it to be completed.
Describe the bug
The issue happens only in the Kubernetes environment.
Traceback (most recent call last):
File \"/usr/local/lib/python3.11/site-packages/temporalio/converter.py\", line 436, in from_payload
return google.protobuf.json_format.Parse(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File \"/usr/local/lib/python3.11/site-packages/google/protobuf/json_format.py\", line 421, in Parse
return ParseDict(js, message, ignore_unknown_fields, descriptor_pool,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File \"/usr/local/lib/python3.11/site-packages/google/protobuf/json_format.py\", line 446, in ParseDict
parser.ConvertMessage(js_dict, message, '')
File \"/usr/local/lib/python3.11/site-packages/google/protobuf/json_format.py\", line 487, in ConvertMessage
self._ConvertFieldValuePair(value, message, path)
File \"/usr/local/lib/python3.11/site-packages/google/protobuf/json_format.py\", line 563, in _ConvertFieldValuePair
if _IsMapEntry(field):
^^^^^^^^^^^^^^^^^^
File \"/usr/local/lib/python3.11/site-packages/google/protobuf/json_format.py\", line 151, in _IsMapEntry
field.message_type.GetOptions().map_entry)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File \"<frozen importlib._bootstrap>\", line 1176, in _find_and_load
File \"<frozen importlib._bootstrap>\", line 1130, in _find_and_load_unlocked
File \"/usr/local/lib/python3.11/site-packages/temporalio/worker/workflow_sandbox/_importer.py\", line 393, in __getitem__
return self.current[key]
~~~~~~~~~~~~^^^^^
KeyError: 'google.protobuf'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File \"/usr/local/lib/python3.11/site-packages/temporalio/converter.py\", line 306, in from_payloads
values.append(converter.from_payload(payload, type_hint))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File \"/usr/local/lib/python3.11/site-packages/temporalio/converter.py\", line 442, in from_payload
raise RuntimeError(f\"Unknown Protobuf type {message_type}\") from err
RuntimeError: Unknown Protobuf type internal.path.to.protobuf.TaskRequest
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File \"/usr/local/lib/python3.11/site-packages/temporalio/worker/_workflow_instance.py\", line 1625, in _convert_payloads
return self._payload_converter.from_payloads(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File \"/usr/local/lib/python3.11/site-packages/temporalio/converter.py\", line 308, in from_payloads
raise RuntimeError(
RuntimeError: Payload at index 0 with encoding json/protobuf could not be converted
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File \"/usr/local/lib/python3.11/site-packages/temporalio/worker/_workflow_instance.py\", line 364, in activate
self._workflow_input = self._make_workflow_input(start_job)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File \"/usr/local/lib/python3.11/site-packages/temporalio/worker/_workflow_instance.py\", line 896, in _make_workflow_input
args = self._convert_payloads(start_job.arguments, arg_types)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File \"/usr/local/lib/python3.11/site-packages/temporalio/worker/_workflow_instance.py\", line 1633, in _convert_payloads
raise RuntimeError(\"Failed decoding arguments\") from err
Minimal Reproduction
Unfortunately, I don't have steps for reproduction. Everything works perfectly in my and my colleague's local environments.
I would love to hear ideas or proposals on what to look into to understand the root cause. The only thing I can state is that if I add the following import, everything works:
with workflow.unsafe.imports_passed_through():
import google.protobuf
My other Protobuf imports (the result of code generation from Protobuf definitions) are already under unsafe context manager.
Environment/Versions
- OS and processor: Linux
- Temporal Version: 1.8.0
- Yes, Docker and Kubernetes
Additional context
Unfortunately, I don't have steps for reproduction. Everything works perfectly in my and my colleague's local environments.
Ideally we can replicate so we can help debug the problem. It is interesting that the exact same Python code does different things in different environments. It may be a protobuf version issue or an order of operations issue, but a standalone replication may be needed for us to confirm the issue.
The only thing I can state is that if I add the following import, everything works:
Looking at the traceback, it's a bit of an interesting case where we catch KeyError because that's how we're checking a not-found protobuf, but this is actually having an import-level key error for importing google.protobuf. I wonder if it is lazily importing google.protobuf inside of GetOptions there somehow.
I can't really tell the problem without a replication. Can you confirm exact protobuf version (both library and protoc generator) and continually reduce the k8s-only form of replication until it's very small and standalone?
Here are my versions:
protobuf 4.25.3
libprotoc 3.19.6
Here is an example of a workflow file:
from temporalio import workflow
with workflow.unsafe.imports_passed_through():
from temporalio.v1.lorem_pb2 import (
LoremWorkflowRequest,
LoremWorkflowResponse,
)
@workflow.defn(name="lorem-workflow")
class LoremWorkflow:
@workflow.run
async def run(self, req: LoremWorkflowRequest) -> LoremWorkflowResponse:
return LoremWorkflowResponse(text="i am response")
Could it be the issue with my Protobuf definitions and the fact that I use maps (but again, on local, all good)?
message LoremWorkflowRequest {
map<string, bool> random = 1;
}
Here is an example of a workflow file:
To confirm, the exact same code from Python start to end fails in one of your environments but not another? May need to see the full standalone code (i.e. how you are registering/running worker). There may be an operations order issue.
Could it be the issue with my Protobuf definitions and the fact that I use maps (but again, on local, all good)?
I cannot know to be honest without replicating and debugging. The code you show above should work fine in every environment. I wonder if there are Python version differences or something else in the one environment where this happens that could help us figure out how to reliably replicate?
I think the main goal is to confirm the exact same small code that fails in one environment but works in another. Even something as simple as the order something is imported may affect it which is why exact code matters.
Yep, I understand that providing insights without a reproducible example is almost impossible.
However, I figured out how to make my workflow workable.
I had my Protobuf imports under unsafe context manager:
with workflow.unsafe.imports_passed_through():
from temporalio.v1.lorem_pb2 import (
LoremWorkflowRequest,
LoremWorkflowResponse,
)
I changed that to the following, and now I see no exceptions:
from temporalio.v1.lorem_pb2 import (
LoremWorkflowRequest,
LoremWorkflowResponse,
)
What is the right way?
What is the right way?
The first way with the imports passed through is usually better for models because with the second way you're re-importing the models on every single workflow run which costs memory and CPU. Having said that, we automatically mark any import beneath temporalio as pass through I believe, so a bit surprised there is a difference.
Sounds like the same issue I have https://stackoverflow.com/questions/79495633/how-do-i-handle-json-protobuf-messages-in-temporal-python-sdk
But I think the problem is that type is not known by the time it gets to the converter for lookup as noted by <unknown> on the message
site-packages\temporalio\converter.py", line 435, in from_payload
value = _sym_db.GetSymbol(message_type)()
~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
The message_type may not be known at that point because of
message_type = payload.metadata.get("messageType", b"<unknown>").decode()
AFAIK python erases type information at runtime so I am not sure how you're supposed to determine that unless there's an undocumented annotation to say what the types are.
But I think the problem is that type is not known by the time it gets to the converter for lookup
The problem is that whatever is doing the protobuf conversion is not setting metadata on the payload properly. This would be a bit strange in all but the oldest Python SDK versions or in CLI. Can you show the raw payload in workflow history (can download still encoded history from UI)? How are you setting this value? Are you possibly using CLI?
AFAIK python erases type information at runtime
No, type hints are available at runtime, but regardless, on deserialization we use a messageType metadata value that should be set
Also, instead of an existing GH issue or SO, there are also forums and #python-sdk channels on Slack if easier for you. I think your problem is unrelated to this issue which is about the sandbox.
Are you talking about this?
{
"eventId": "1",
"eventTime": "2025-03-10T18:11:26.549986494Z",
"eventType": "WorkflowExecutionStarted",
"taskId": "1049165",
"userMetadata": {},
"workflowExecutionStartedEventAttributes": {
"workflowType": {
"name": "GreetingWorkflowProtobuf"
},
"taskQueue": {
"name": "foo",
"kind": "TASK_QUEUE_KIND_NORMAL"
},
"input": {
"payloads": [
{
"message": "Archie"
}
]
},
"workflowTaskTimeout": "10s",
"originalExecutionRunId": "cc2e7a38-8c5e-4ca7-806b-e441d0ad7514",
"firstExecutionRunId": "cc2e7a38-8c5e-4ca7-806b-e441d0ad7514",
"attempt": 1,
"firstWorkflowTaskBackoff": "0s",
"workflowId": "29e58714-ebe5-4f6b-b277-fe81215c5a58"
},
"links": []
}
No, type hints are available at runtime, but regardless, on deserialization we use a messageType metadata value that should be set
I don't see that but it explicitly be set in userMetadata ? I don't recall having to do that when it's Java
I don't see that but it explicitly be set in
userMetadata? I don't recall having to do that when it's Java
Yes, Java was able to infer, but even modern Java SDK will set that metadata. But newer SDKs require this field even when they can infer. How are you starting this workflow? Any recent SDK version you start it with using a protobuf message will set the proper metadata. Are you using CLI? If so, you'll need something like --input-meta messageType=my.qualified.Proto.
@cretz I am starting through Temporal UI, the Start workflow button
When starting through the latest UI, you have to select json/protobuf as the encoding, which will then provide a "Message Type" field you must provide with the fully qualified proto message name. Most SDKs require this even if older SDKs/implementations don't for backwards compatibility reasons.
Note, this is unrelated to this GitHub issue which is specifically concerning the sandbox and protobuf. Can use Slack or forums for general help (or another GitHub issue if really needed), or you can open a cloud ticket if you're a cloud customer.
@cretz noted, it appears to be in the latest cloud UI, but not in the current dev-server. I'll open up a related issue