incubator-graphar icon indicating copy to clipboard operation
incubator-graphar copied to clipboard

[POC]feat(format): Add format protocol definitions files with google protocol buffers

Open acezen opened this issue 9 months ago • 4 comments

Reason for this PR

as issue #73 describes, currently the C++ library, Java, and scala library implement the format independently and they may have some miss-match between the implementation. It's better to standardize the format with protobuf and the libraries rely on the definition.

The standardized format definition would bring much benefits:

  • More clear definition to the C++/Java code
  • Make the format cross-language compatibility
  • It would not only carry the schema of graph, but also contain the optional metadata like vertex num/edge num, the chunk statistics. ( which we serialize to files directly)
  • It can be evolve without change the library code immediately
  • arrow and parquet are define their format with IDL too.

But the changes may break the breaking changes to current public APIs if we adapt the implementation to the format protocol.

What changes are included in this PR?

  • Add format protocol definitions files with google protocol buffers
  • README about the format protocol definitions

Are these changes tested?

yes

Are there any user-facing changes?

not yet

not yet

acezen avatar May 08 '24 12:05 acezen

Based on my experience with Apache Spark itself, protobuf-way may be a painful story:

  • you need to incorporate it into CI;
  • the generated code is huge (for java it may be thousands of LOC);
  • the generated code is unreadable;
  • debugging is very hard;
  • we need to decide, are we going to store the generated code in git or not (there are pros and cons);

Btw, my first question is, are we able to use buf? (https://buf.build/docs/introduction)? Apache Spark itself uses buf because it significantly simplifies the process. I guess there shouldn't be a conflict with ASF rules. If we can, I would like to recommend to create buf.work.yaml and buf.gen.yaml. You may check examples in Apache Spark (https://github.com/apache/spark/tree/master/connector/connect/common/src/main), or I can do it because I had some experience with it;

My second question, are we going to store generated code in git or not? Apache Spark uses a mixed approach, when they generate JVM-classes on the fly (via protobuf maven plugin) but store generated py files in git. For python they have a cool solution: https://github.com/apache/spark/blob/master/dev/connect-gen-protos.sh, an auto formatter of generated code, that allows to have a readable diff in git PRs;

Based on my experience, Data Science / Network Analysis people and Data Engineers, who are our target auditory, are not familiar with protobuf, so we should also extends the documentation itself.

P.S. It is nice that I was stuck in my work and did not write a lot of code for a new Python API. Because otherwise I would need to rewrite it again with proto :)

SemyonSinchenko avatar May 08 '24 13:05 SemyonSinchenko

@acezen Because it is a very huge change, should we discuss it first in a mailing list? Like voting. Because it is a really big change, pros and cons are known too.

SemyonSinchenko avatar May 08 '24 13:05 SemyonSinchenko

@acezen Because it is a very huge change, should we discuss it first in a mailing list? Like voting. Because it is a really big change, pros and cons are known too.

Hi, @SemyonSinchenko sorry for the misleading,this PR is just a example or show code for format definition. As you said, it should be and I will open a discussion about the proposal and changes (may need some time to list the pros and cons about the protobuf). The goal I want to achieve is we should have some strategies to standardize the format definition, easy to maintain and evolve, make it cross-language compatibility.

acezen avatar May 08 '24 15:05 acezen

Based on my experience with Apache Spark itself, protobuf-way may be a painful story:

  • you need to incorporate it into CI;
  • the generated code is huge (for java it may be thousands of LOC);
  • the generated code is unreadable;
  • debugging is very hard;
  • we need to decide, are we going to store the generated code in git or not (there are pros and cons);

Btw, my first question is, are we able to use buf? (https://buf.build/docs/introduction)? Apache Spark itself uses buf because it significantly simplifies the process. I guess there shouldn't be a conflict with ASF rules. If we can, I would like to recommend to create buf.work.yaml and buf.gen.yaml. You may check examples in Apache Spark (https://github.com/apache/spark/tree/master/connector/connect/common/src/main), or I can do it because I had some experience with it;

My second question, are we going to store generated code in git or not? Apache Spark uses a mixed approach, when they generate JVM-classes on the fly (via protobuf maven plugin) but store generated py files in git. For python they have a cool solution: https://github.com/apache/spark/blob/master/dev/connect-gen-protos.sh, an auto formatter of generated code, that allows to have a readable diff in git PRs;

Based on my experience, Data Science / Network Analysis people and Data Engineers, who are our target auditory, are not familiar with protobuf, so we should also extends the documentation itself.

P.S. It is nice that I was stuck in my work and did not write a lot of code for a new Python API. Because otherwise I would need to rewrite it again with proto :)

Thanks Sem, the comment and advice is very helpful and insightful, buf is one solution and we do need to discuss about it!

acezen avatar May 08 '24 15:05 acezen

May we add buf as a building system? It is under Apache 2.0 and actually we need only binaries. https://github.com/bufbuild/buf

SemyonSinchenko avatar May 28 '24 07:05 SemyonSinchenko

May we add buf as a building system? It is under Apache 2.0 and actually we need only binaries. https://github.com/bufbuild/buf

Good idea, I will give a try.

acezen avatar May 28 '24 08:05 acezen

@acezen May we merge it into a separate branch? I want to try to add buf and some additional scripts, especially for python part.

SemyonSinchenko avatar Jun 06 '24 10:06 SemyonSinchenko

@acezen May we merge it into a separate branch? I want to try to add buf and some additional scripts, especially for python part.

Sure, I can create a branch for the format and merge this to the branch.

acezen avatar Jun 06 '24 10:06 acezen

@acezen May we merge it into a separate branch? I want to try to add buf and some additional scripts, especially for python part.

Hi, Sem, I have created a format-definition-dev branch and merged the format definition into it. You can develop base on the branch.

acezen avatar Jun 06 '24 11:06 acezen

The change has merge into format-definition-dev branch for continue development, close this PR.

acezen avatar Jun 06 '24 11:06 acezen