p4c Proposal: protobuf backend

I would like to gauge the communities interest in and feeback on the following proposal: add a p4c backend that generates a protobuf dump of a P4 AST (after the common frontend passes have been applied).

The following applications come to mind:

Since protobufs are language-independent, this would allow writing P4 tools in any programming language, without having to re-implement a P4 frontend.
Since protobufs can be easily serialized and deserialized, this would allow writing P4 tools as a seperate binary, without ever having to touch p4c.
Overall, this would enable more rapid innovation in the P4 tooling space, by allowing P4 tools to be developed independently of the compiler, in any language.

To give some more context and depth, Google has been developing a number of P4 tools, e.g. p4-constraints, a fuzzer (to be open-sourced soon), a symbolic interpreter (to be open-sourced soon), etc. So far all our tools are based on compilation artifacts of the bmv2 backend (namely, the P4Info and JSON config files). This has worked well, and has all the advantages mentioned above, but the following downsides:

The bmv2 backend only has "tentative support" for PSA, according to the the bmv2 README.
The bmv2 backend does not preserve all annotations. (Custom annotations are especially useful in experimenting with new P4 tools.)
Minor: The bmv2 backend outputs JSON. Compared to protobuf, JSON has the disadvantage of being a dynamic format without programatically-specified and compile-time enforced schema. Instead, the schema is given informally in prose.

(In our experience, the final concern is relatively minor compared to the other two.)

I'd love to get your thoughts and feedback on this. Is this a reasonable idea, or should we simply address the downsides of the bmv2 backend? @antoninbas, are these downsides even valid? Would such a backend be useful for researchers such as @hackedy / @ericthewry / @jnfoster?

@jonathan-dilorenzo @kheradmandg

Oct 01 '21 00:10 smolkaj

The BMV2 json is not the same as the IR of the compiler. The IR of the compiler has a different JSON representation and the serialization to JSON is generated automatically from the *.def files (you can exercise it with the --toJson flag). The same path could be taken to generate a protobuf-based implementation. I expect this would be a quick project. But note that some of the information is not in the IR, e.g., types are in a separate structure.

Oct 01 '21 00:10 mihaibudiu

Thanks for the info, I wasn't aware of the --toJson flag. Are there existing mechanisms for dumping the types as well? P4Info comes to mind, but it isn't quite complete in the sense of not preserving annotations on type declarations AFAIK.

Oct 01 '21 00:10 smolkaj

It depends how many types you want. For expressions the Expression base class has a 'type' field which is optionally populated. It can be populated by calling the typeInference pass with a boolean flag. (This field is not maintained when the IR changes, you have to set it just before you will use it.) But other types of nodes, e.g., declarations, arguments, etc, will have types but no such representation in the IR. Note that the compiler frontend will always make implicit casts explicit and specialize type variables, so in principle the IR will have enough information to reconstruct types for everything. Type declarations are always in the IR as well, they are part of the program. P4Info generation does a fair amount of work to collect the type information it needs. But this project could probably be done in a couple of days.

Oct 01 '21 01:10 mihaibudiu

I would also be interested in the difference between the p4c IR json vs the BMV2 json, and why BMv2 decided to go with its own representation.

One possible reason that comes to mind is stability of the representation. How stable is the compiler IR?

Oct 01 '21 01:10 smolkaj

Re types, thanks for clarifying, I misunderstood and thought you were referring to type declarations. Not having types everywhere may just be fine.

Oct 01 '21 01:10 smolkaj

Think of the compiler IR as a set of "beans", as in "Java beans" - pure data objects. They are declared in a subset of C++ and stored in .def files. You can check the churn on these classes by looking at the history of changes. Modifications are very rare. Additions happen occasionally.

So serializing the compiler IR representation of a program is essentially the same as serializing a set of beans with fixed types - a fixed schema. So one can easily write a tool to do that, and that is how the IR JSON is generated. The IR JSON is literally a text representation of the beans' contents.

The BMv2 JSON representation predates the P4-16 compiler implementation; the Python P4-14 compiler generated JSON for BMv2. So the p4-16 compiler just adapted that format. The format is well documented, as you noticed.

The IR JSON format is not documented, because we didn't want to freeze the IR representation. If we made mistakes we want to be able to change the IR. But the *def files are very readable and should be almost self-documenting.

There are 2 additional resources: the toP4 code, which generates P4 code from the IR, and the dump() function which can dump IR representations as text - used mostly for debugging from gdb.

Oct 01 '21 01:10 mihaibudiu

All front-end IR classes are in the ir/*.def directory. Notice that back-ends can add their own IR classes. This was designed on purpose to allow extensibility. The front-end and the midend passes in this repo will never use backend IR nodes, though. The DPDK representation has a completely different parallel IR, for example.

Oct 01 '21 01:10 mihaibudiu

You can also read the parser.ypp code to see how IR classes are created when parsing text; that is also informative.

Oct 01 '21 01:10 mihaibudiu

Thanks for the info, I wasn't aware of the --toJson flag. Are there existing mechanisms for dumping the types as well?

@smolkaj As an example, the --toJson flag is used by T4P4S with hlir16.

Oct 03 '21 14:10 rst0git

@smolkaj Did you find an already existing solution that was able to give you results you could use in your original problem?

Nov 24 '22 15:11 jafingerhut

We are still using BMv2 JSON (and v1model ...) for now.

Since my original post, I learned that the protobuf library provides generic parser/deparsers from/to JSON. So if we were to extend the IR generator to also generate a .proto schema from the *.def files (which I assume should be rather straightforward), we could already export the IR as a protobuf without writing a custom proto serializer, by going through JSON as follows:

IR --(JSON serializer)-> JSON --(generic JSON-to-proto parser)-> parsed proto --(generic proto serializer)-> serialized proto

When we get more serious about transitioning to PSA, this seems like a good approach to explore.

Nov 24 '22 23:11 smolkaj

p4c p4c copied to clipboard

Proposal: protobuf backend

p4c
p4c copied to clipboard