llama_index
llama_index copied to clipboard
[feature request] GithubRepositoryReader seems to ignore all source code files
It seems like only files with supported parsers are parsed
https://github.com/jerryjliu/gpt_index/blob/e5605c171331f29ef3dc00cc21d2149eaed0af05/gpt_index/readers/github_readers/github_repository_reader.py#L322
It would be great to be able to index source code too
Hi @elyase! Thanks for the feature request.
You can turn on/off the parser by passing use_parser = False
to the constructor of GithubRepositoryReader
.
And for the source code parser, I can't think of a way to parse the source code. What I mean is that the parser for png file tries to extract words from image or the parser for audio converts the machine-readable signals to natural language, so from these examples I feel like the source code is already natural language. However, I think the parser for specific programming languages could be beneficial for providing more context for that specific programming language.
Do you have anything in mind for a source code file parser?
The above comment will enable all files to be parsed.
Going to close for now :)