code_tokenize string literals lost during tokenization for C/C++ code.

string literals lost during tokenization for C/C++ code.

Open 5c4lar opened this issue 2 years ago • 2 comments

For the following code

import code_tokenize as ctok
sample = """
#include <stdio.h>
int main() {
    printf("hello world");
}
"""
ctok.tokenize(sample, lang = "cpp")

Output:

[#include, <stdio.h>, , int, main, (, ), {, printf, (, ", ", ), ;, }]

But parsing string literals works fine for Java and Python code. How should I fix this problem?

Jul 20 '22 03:07 5c4lar

Use a custom visitor like this:

class CLeafVisitor(ctok.tokenizer.LeafVisitor):
    def visit_string_literal(self, node):
        self.node_handler(node)
        return False

seems to fix the problem

Jul 20 '22 03:07 5c4lar

Thank you for this hint!

I will add more custom visitors for the supported languages in the next release. Until then, you can use custom visitors to parse your code. For example, you could use your C visitor as follows:

ctok.tokenize(sample, lang = "cpp", visitors=[CLeafVisitor])

Jul 20 '22 07:07 cedricrupb

code_tokenize code_tokenize copied to clipboard

string literals lost during tokenization for C/C++ code.

code_tokenize
code_tokenize copied to clipboard