DocsGPT
DocsGPT copied to clipboard
Code parsing
Make sure we can ingest code well
Javascript Java Python
How are you planning to do this? Would be happy to try and help out!
Think this requires to make some tricky prompt engineering and maybe an alternative to semantic search. But I also think results might improve if we do some extra prep work.
My first suggestion would be to ingest code, by just converting everything to txt and feeding it into the vecor store. We can ask it some questions try find some issues and use it as baseline. From here on when we make imporvenments we will have something to compare to
Open ai has a nice notebook around parsing python codebases - https://github.com/openai/openai-cookbook/blob/main/examples/Code_search.ipynb They parse and index all function embeddings and then get the most relevant functions per query using cosine similarity.
The code parsing module would require a new file like scripts/ingest_rst.py right?
Yeah relevant functions is good, but i decided to go a bit different route, to basically Take each function and summarise it
I have some work done and results on undocumented unpopular libararies is good. Yeah i think we might need an alternative to how we search for relevant functions, but lets see. I have few more ideas to try and compare befor i merge
I made a scripts/code_docs_gen.py that you can try, its on a code-ingestion branch https://github.com/arc53/DocsGPT/tree/code-ingestion
Nice! Using ast to parse is probably better. Makes sense, generating documentation would help in searching better.
Could be useful to have another intermediate prompt layer and generate class summary using function summaries. Or if using control flow graph of a program to generate documentation could help.
Good thinking, definetly need something for a class summary, but they have too many functions sometimes, its just very hard to ingest everything. (maybe some summarisation will help)
In terms of graphs, UML too i was thinking a lot about it. Think it will be extremely useful bit of context. But I just dont know how to respresent in text well enough. I do think maybe we should save them and use them to deliver more context alongside similarity search. (this may be the key)
#129 First version is pushed here