code_tokenize
code_tokenize copied to clipboard
string literals lost during tokenization for C/C++ code.
For the following code
import code_tokenize as ctok
sample = """
#include <stdio.h>
int main() {
printf("hello world");
}
"""
ctok.tokenize(sample, lang = "cpp")
Output:
[#include, <stdio.h>, , int, main, (, ), {, printf, (, ", ", ), ;, }]
But parsing string literals works fine for Java and Python code. How should I fix this problem?
Use a custom visitor like this:
class CLeafVisitor(ctok.tokenizer.LeafVisitor):
def visit_string_literal(self, node):
self.node_handler(node)
return False
seems to fix the problem
Thank you for this hint!
I will add more custom visitors for the supported languages in the next release. Until then, you can use custom visitors to parse your code. For example, you could use your C visitor as follows:
ctok.tokenize(sample, lang = "cpp", visitors=[CLeafVisitor])