py-tree-sitter Extracting user-defined identifiers in multiple languages

Hi, there!

I am facing a problem where I need to extract the user-defined identifiers from the source code snippets in multiple programming languages. User-defined identifiers (UIDs) here refer to those identifiers defined in the concrete code snippets by the authors. For instance, I need to extract "MAXN", "some_func", "arg", and "var" in the following C snippet, which are all defined or declared in this snippet.

#define MAXN 1000
int some_func(int arg) {
    int var = arg;
    return var;
}

The hard part is that it seems each programming language has its own unique properties. For instance, macro exsists in C, but not in Java, so I must specially handle macro in C (just like "MAXN" in the example above). I am not familiar with all of my target programming languages, so it is very possible that I miss some of these properties. My problem is whether there is a simple or unified approach to do so? To simplify the situation, let's say... I need the UID extractor to support C, Java, and Python. Is there a unified way to achieve so? Or do I have to dive into each language, and build an individual UID extractor for each language?

Currently, my solution is to pretend that all languages are the same... I traverse the parsing tree, and mark all nodes whose type contains "definition", "declaration", "declarator", "specification", "specifier", and "param". Then I look up the children of these marked nodes -- If it has a child with the type of "identifier", this identifier is regarded as an UID. The current UID extractor works fine in C/C++ (I write some additional rules to support macro extraction by marking those "preproc" nodes). The code is shown as below, where "root" is the root_node of the parsing tree, and "_bytes" is the byte sequence of the code.

def get_uids(root, _bytes):
    
    q = [root]
    ret = []
    while len(q):
        node = q[0]
        q = q[1:]
        for c in node.children:
            if "declaration" in c.type or "declarator" in c.type \
                    or "specifier" in c.type or "specification" in c.type \
                    or ("preproc" in c.type and "def" in c.type) or "definition" in c.type:
                for grand_c in c.children:
                    if "identifier" in grand_c.type:
                        ret.append(str(_bytes[grand_c.start_byte: grand_c.end_byte], "latin1").strip())
                        break
            if c.type == "preproc_params":
                for grand_c in c.children:
                    if grand_c.type == "identifier":
                        ret.append(str(_bytes[grand_c.start_byte: grand_c.end_byte], "latin1").strip())
            q.append(c)
    return set(ret)

For sure, this is not a graceful solution... and I need help of you guys...

THANKS A LOT!

Dec 03 '21 12:12 LC-John

Hi @LC-John, do you have solution for java and python to detect user defined identifiers? Could you please share if you have? Thanks!!

Dec 17 '21 05:12 anks12297

@anks12297 Sorry but no... I kind of give up... I have seperated the UID extractor from my downstream application, and currently I'm building my downstream projects. I'm testing them upon C/C++ only, so it is not a big problem for me... Hopefully, when I'm done, I'll turn back to the UID extractors.

Dec 17 '21 06:12 LC-John

you'd need to write queries for this tailored to every language, im sorry but treating every language's identifiers as the same makes no sens purely based on the node names, queries are meant to unify this in a sense.

Feb 26 '24 14:02 amaanq

py-tree-sitter py-tree-sitter copied to clipboard

Extracting user-defined identifiers in multiple languages

py-tree-sitter
py-tree-sitter copied to clipboard