DocsGPT icon indicating copy to clipboard operation
DocsGPT copied to clipboard

Code parsing

Open dartpain opened this issue 2 years ago • 6 comments

Make sure we can ingest code well

Javascript Java Python

dartpain avatar Feb 05 '23 13:02 dartpain

How are you planning to do this? Would be happy to try and help out!

sidthekidder avatar Feb 07 '23 10:02 sidthekidder

Think this requires to make some tricky prompt engineering and maybe an alternative to semantic search. But I also think results might improve if we do some extra prep work.

My first suggestion would be to ingest code, by just converting everything to txt and feeding it into the vecor store. We can ask it some questions try find some issues and use it as baseline. From here on when we make imporvenments we will have something to compare to

dartpain avatar Feb 07 '23 10:02 dartpain

Open ai has a nice notebook around parsing python codebases - https://github.com/openai/openai-cookbook/blob/main/examples/Code_search.ipynb They parse and index all function embeddings and then get the most relevant functions per query using cosine similarity.

The code parsing module would require a new file like scripts/ingest_rst.py right?

sidthekidder avatar Feb 08 '23 19:02 sidthekidder

Yeah relevant functions is good, but i decided to go a bit different route, to basically Take each function and summarise it

I have some work done and results on undocumented unpopular libararies is good. Yeah i think we might need an alternative to how we search for relevant functions, but lets see. I have few more ideas to try and compare befor i merge

I made a scripts/code_docs_gen.py that you can try, its on a code-ingestion branch https://github.com/arc53/DocsGPT/tree/code-ingestion

dartpain avatar Feb 08 '23 19:02 dartpain

Nice! Using ast to parse is probably better. Makes sense, generating documentation would help in searching better.

Could be useful to have another intermediate prompt layer and generate class summary using function summaries. Or if using control flow graph of a program to generate documentation could help.

sidthekidder avatar Feb 08 '23 20:02 sidthekidder

Good thinking, definetly need something for a class summary, but they have too many functions sometimes, its just very hard to ingest everything. (maybe some summarisation will help)

In terms of graphs, UML too i was thinking a lot about it. Think it will be extremely useful bit of context. But I just dont know how to respresent in text well enough. I do think maybe we should save them and use them to deliver more context alongside similarity search. (this may be the key)

dartpain avatar Feb 09 '23 11:02 dartpain

#129 First version is pushed here

dartpain avatar Feb 25 '23 19:02 dartpain