incubator-graphar
incubator-graphar copied to clipboard
[POC]feat(format): Add format protocol definitions files with google protocol buffers
Reason for this PR
as issue #73 describes, currently the C++ library, Java, and scala library implement the format independently and they may have some miss-match between the implementation. It's better to standardize the format with protobuf and the libraries rely on the definition.
The standardized format definition would bring much benefits:
- More clear definition to the C++/Java code
- Make the format cross-language compatibility
- It would not only carry the schema of graph, but also contain the optional metadata like vertex num/edge num, the chunk statistics. ( which we serialize to files directly)
- It can be evolve without change the library code immediately
- arrow and parquet are define their format with IDL too.
But the changes may break the breaking changes to current public APIs if we adapt the implementation to the format protocol.
What changes are included in this PR?
- Add format protocol definitions files with google protocol buffers
- README about the format protocol definitions
Are these changes tested?
yes
Are there any user-facing changes?
not yet
not yet
Based on my experience with Apache Spark itself, protobuf-way may be a painful story:
- you need to incorporate it into CI;
- the generated code is huge (for java it may be thousands of LOC);
- the generated code is unreadable;
- debugging is very hard;
- we need to decide, are we going to store the generated code in git or not (there are pros and cons);
Btw, my first question is, are we able to use buf
? (https://buf.build/docs/introduction)? Apache Spark itself uses buf because it significantly simplifies the process. I guess there shouldn't be a conflict with ASF rules. If we can, I would like to recommend to create buf.work.yaml
and buf.gen.yaml
. You may check examples in Apache Spark (https://github.com/apache/spark/tree/master/connector/connect/common/src/main), or I can do it because I had some experience with it;
My second question, are we going to store generated code in git or not? Apache Spark uses a mixed approach, when they generate JVM-classes on the fly (via protobuf maven plugin) but store generated py files in git. For python they have a cool solution: https://github.com/apache/spark/blob/master/dev/connect-gen-protos.sh, an auto formatter of generated code, that allows to have a readable diff in git PRs;
Based on my experience, Data Science / Network Analysis people and Data Engineers, who are our target auditory, are not familiar with protobuf, so we should also extends the documentation itself.
P.S. It is nice that I was stuck in my work and did not write a lot of code for a new Python API. Because otherwise I would need to rewrite it again with proto :)
@acezen Because it is a very huge change, should we discuss it first in a mailing list? Like voting. Because it is a really big change, pros and cons are known too.
@acezen Because it is a very huge change, should we discuss it first in a mailing list? Like voting. Because it is a really big change, pros and cons are known too.
Hi, @SemyonSinchenko sorry for the misleading,this PR is just a example or show code for format definition. As you said, it should be and I will open a discussion about the proposal and changes (may need some time to list the pros and cons about the protobuf). The goal I want to achieve is we should have some strategies to standardize the format definition, easy to maintain and evolve, make it cross-language compatibility.
Based on my experience with Apache Spark itself, protobuf-way may be a painful story:
- you need to incorporate it into CI;
- the generated code is huge (for java it may be thousands of LOC);
- the generated code is unreadable;
- debugging is very hard;
- we need to decide, are we going to store the generated code in git or not (there are pros and cons);
Btw, my first question is, are we able to use
buf
? (https://buf.build/docs/introduction)? Apache Spark itself uses buf because it significantly simplifies the process. I guess there shouldn't be a conflict with ASF rules. If we can, I would like to recommend to createbuf.work.yaml
andbuf.gen.yaml
. You may check examples in Apache Spark (https://github.com/apache/spark/tree/master/connector/connect/common/src/main), or I can do it because I had some experience with it;My second question, are we going to store generated code in git or not? Apache Spark uses a mixed approach, when they generate JVM-classes on the fly (via protobuf maven plugin) but store generated py files in git. For python they have a cool solution: https://github.com/apache/spark/blob/master/dev/connect-gen-protos.sh, an auto formatter of generated code, that allows to have a readable diff in git PRs;
Based on my experience, Data Science / Network Analysis people and Data Engineers, who are our target auditory, are not familiar with protobuf, so we should also extends the documentation itself.
P.S. It is nice that I was stuck in my work and did not write a lot of code for a new Python API. Because otherwise I would need to rewrite it again with proto :)
Thanks Sem, the comment and advice is very helpful and insightful, buf
is one solution and we do need to discuss about it!
May we add buf
as a building system? It is under Apache 2.0 and actually we need only binaries.
https://github.com/bufbuild/buf
May we add
buf
as a building system? It is under Apache 2.0 and actually we need only binaries. https://github.com/bufbuild/buf
Good idea, I will give a try.
@acezen May we merge it into a separate branch? I want to try to add buf
and some additional scripts, especially for python part.
@acezen May we merge it into a separate branch? I want to try to add
buf
and some additional scripts, especially for python part.
Sure, I can create a branch for the format and merge this to the branch.
@acezen May we merge it into a separate branch? I want to try to add
buf
and some additional scripts, especially for python part.
Hi, Sem, I have created a format-definition-dev
branch and merged the format definition into it. You can develop base on the branch.
The change has merge into format-definition-dev
branch for continue development, close this PR.