llama_index icon indicating copy to clipboard operation
llama_index copied to clipboard

[feature request] GithubRepositoryReader seems to ignore all source code files

Open elyase opened this issue 1 year ago • 1 comments

It seems like only files with supported parsers are parsed

https://github.com/jerryjliu/gpt_index/blob/e5605c171331f29ef3dc00cc21d2149eaed0af05/gpt_index/readers/github_readers/github_repository_reader.py#L322

It would be great to be able to index source code too

elyase avatar Mar 01 '23 16:03 elyase

Hi @elyase! Thanks for the feature request.

You can turn on/off the parser by passing use_parser = False to the constructor of GithubRepositoryReader. And for the source code parser, I can't think of a way to parse the source code. What I mean is that the parser for png file tries to extract words from image or the parser for audio converts the machine-readable signals to natural language, so from these examples I feel like the source code is already natural language. However, I think the parser for specific programming languages could be beneficial for providing more context for that specific programming language.

Do you have anything in mind for a source code file parser?

ahmetkca avatar Mar 06 '23 21:03 ahmetkca

The above comment will enable all files to be parsed.

Going to close for now :)

logan-markewich avatar Jul 21 '23 22:07 logan-markewich