protobuf
protobuf copied to clipboard
Python generated code is no longer compatible with Cython since 3.20
Version: 4.21.8 Language: Python
Before the 3.20 update, the generated code was statically defining all the messages, and could be given to cython without issues. Since the 3.20 update, those messages are dynamically generated, leading to a cython error.
For example, given this proto file
syntax = "proto3";
message Foo {
bool a = 1;
}
The codegen gives this:
$ protoc --python_out=gen proto/a.proto
# -*- coding: utf-8 -*-
# Generated by the protocol buffer compiler. DO NOT EDIT!
# source: proto/a.proto
"""Generated protocol buffer code."""
from google.protobuf.internal import builder as _builder
from google.protobuf import descriptor as _descriptor
from google.protobuf import descriptor_pool as _descriptor_pool
from google.protobuf import symbol_database as _symbol_database
# @@protoc_insertion_point(imports)
_sym_db = _symbol_database.Default()
DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(b'\n\rproto/a.proto\"\x10\n\x03\x46oo\x12\t\n\x01\x61\x18\x01 \x01(\x08\x62\x06proto3')
_builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, globals())
_builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'proto.a_pb2', globals())
if _descriptor._USE_C_DESCRIPTORS == False:
DESCRIPTOR._options = None
_FOO._serialized_start=17
_FOO._serialized_end=33
# @@protoc_insertion_point(module_scope)
This can no longer be compiled by cython:
$ cython gen/proto/a_pb2.py
Error compiling Cython file:
------------------------------------------------------------
...
_builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, globals())
_builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'proto.a_pb2', globals())
if _descriptor._USE_C_DESCRIPTORS == False:
DESCRIPTOR._options = None
_FOO._serialized_start=17
^
------------------------------------------------------------
gen/proto/a_pb2.py:23:2: undeclared name not builtin: _FOO
Cython does not dynamically check variables from modifications of globals()
, and thinks the variable is not set.
This isn't really a regression since I suppose you do not support Cython. However, a small change could make this work with native python and cython:
Instead of generating this:
_FOO._serialized_start=17
generating this would fix the issue:
globals()["_FOO"]._serialized_start=17
This is not an easy fix to do with some post processing of the generated files with regexes and seds and stuff like this, but shouldn't be too hard to do in the code generator I suppose.
Would you be OK with such a change? That would be really useful for cython users. I can try to make this change if needed.
What is the use case for this? What is the expected behavior if you let Cython compile the generated code?
The expected behavior is to have the exact same behavior as without the Cython pass. Cython is usually used to improve performances, but it can also be used for obfuscation and protecting the source code.
I see. That seems reasonable. We reference globals()
several times, I could see rewriting this to:
_globals = globals()
_builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, _globals)
_builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'proto.a_pb2', _globals)
if _descriptor._USE_C_DESCRIPTORS == False:
DESCRIPTOR._options = None
_globals["_FOO"]._serialized_start=17
We could use _globals
for any case where we are referencing a variable we did not directly assign.
I would worry some about the performance, except that _descriptor._USE_C_DESCRIPTORS
will always be true unless we are using the pure-Python library, which is slow anyway. So I think we are ok there.
Yes that would work well!
Are you working on this @haberman ? I can try to make a PR for it otherwise
I am not working on this currently. PRs welcome.