thriftpy2 icon indicating copy to clipboard operation
thriftpy2 copied to clipboard

Expose optional lineno in thrift modules

Open stiga-huang opened this issue 11 months ago • 6 comments

We have a precommit job to check incompatible thrift changes and add warnings in Gerrit code reviews. E.g. when a patch adds a required field in an existing struct, the job will add a comment at that line saying "this might be incompatible".

Currently, the job doesn't understand the thrift syntax so has lots of false positive warnings. I'd like to improve the job by using the parser of thriftpy2. However, to add a Gerrit comment, I need to locate the symbol in the file. Basically just need the file name and line number. I can't find a way to expose it in thriftpy2. Not sure if I miss anything.

I see yacc.LRParser.parse() which is used in parser.py actually provides a tracking mode:

    def parse(self, input=None, lexer=None, debug=False, tracking=False, tokenfunc=None):

https://github.com/dabeaz/ply/blob/3.11/ply/yacc.py#L803-L809 However, thriftpy2 just uses the default mode so maybe line numbers are not attached in the results: https://github.com/Thriftpy/thriftpy2/blob/e4baaf5014b976bb273bc766ee91ea681059d314/thriftpy2/parser/parser.py#L654-L655

What confused me is in parser.py, there are some variables of line number. E.g. p.lineno here: https://github.com/Thriftpy/thriftpy2/blob/8e226b12750829ee48abc1f883ea0fcf0b13f717/thriftpy2/parser/parser.py#L27-L31 It seems thriftpy2 can still get the line numbers.

So is it doable in current thriftpy2 versions to get the location (fileName + lineno) of each thrift symbols (struct/enum/method)? Any inputs are appreciated!

stiga-huang avatar Dec 30 '24 07:12 stiga-huang

Thriftpy2 is not suitable for parsing work. Maybe you should use Antlr to write your own visitor for parsing thrift metainfo. https://github.com/antlr/grammars-v4/tree/master/thrift

cocolato avatar Dec 31 '24 07:12 cocolato

Thank @cocolato ! There are projects like https://github.com/thrift-labs/thrift-parser using Antlr but they are not that powerful as thriftpy2.

For instance, one of our requirements is just checking methods/structs used in some internal RPCs. There are some structs and enums used in data exchange between Java and C++ codes in the same process. They are allowed to add/remove required fields.

Thriftpy2 is convenient that it parses all thrift files at once and we can easily track methods and structs used in internal RPCs. However, using thrift-parser mentioned above, we have to parse the thrift files one by one and "join" the structs defined in different files.

It'd be nice if the parser in Thriftpy2 can expose the line numbers. I thought it's not that hard since line numbers are already used to report parsing/compilation errors. Do you think it needs complex changes?

stiga-huang avatar Jan 20 '25 10:01 stiga-huang

Could you describe the example of using the parse API that you expect?

Thriftpy2 is convenient that it parses all thrift files at once and we can easily track methods and structs used in internal RPCs. However, using thrift-parser mentioned above, we have to parse the thrift files one by one and "join" the structs defined in different files.

cocolato avatar Jan 21 '25 03:01 cocolato

BTW I don't think that thriftpy parser is the best practice, it uses lots of anonymouse struct (tuples) to represent parsing intermediates, each item semantic of tuple is so indistinct, this makes the parser is not that easy to be extended, I'm happy to see if there are better alternatives.

ethe avatar Jan 21 '25 03:01 ethe

I'm so sorry to be late here. I was planning to write a demo to show how we can use thriftpy2 but got interrupt by other tasks.

Currently, we are using pyparsing with a customized syntax to parse each individual thrift files to get the structs and enums: https://github.com/apache/impala/blob/4.5.0/bin/jenkins/thrift_parser.py#L214

Do this for all the thrift files changed in a commit. Then compare the structs and enums before and after the commit: https://github.com/apache/impala/blob/4.5.0/bin/jenkins/critique-gerrit-review.py#L364-L367

Pyparsing has a method to inject the lineno while parsing any objects: https://github.com/apache/impala/blob/4.5.0/bin/jenkins/thrift_parser.py#L112C22-L112C35

As long as thriftpy can expose the lineno, I think there is a way to use it in our job.

stiga-huang avatar Mar 10 '25 12:03 stiga-huang

Hi @stiga-huang, please check #322 to see if it matches your requirements and help us review it.

aisk avatar Sep 20 '25 11:09 aisk