feat: add optional code-base summarization via LLM
Sometimes to understand what exactly the codebase does we need to summarize the whole codebase , so add a feature for summarizing. Shall I work on this feature? using META or gemini Model.
This is something I have in mind but I'm not convinced yet that we should start using LLMs in gitingest. Currently the extraction logic doesn't involve any LLM and it has a lot of benefits, I think I would like it to stay that way for now; we already have a lot that we can improve in "classic" code before we start integrating LLMS
That being said if you want to start a PoC of this, I'm indeed very interested to see how it turns out
Sure, Thanks
I agree that summarisation should be done properly with other tools which will likely require a full RAG/vector store setup due to the size of the context required. Unless using googles 1M+ context window it's not a simple task to just pull everything here.
You both have a valid point here: in the long term, LLMs and vectorisation techniques will be mandatory to achieve the best summary possible.
So let me rephrase what I said earlier: This step will eventually come for gitingest, but I want to stay focused for now on improving the simple "declarative code" ingestion.
The idea behind this is:
-
Less overhead, less complexity: The project is still young and any added complexity (dependencies) should be carefully considered. Right now it's very easy to contribute, but even a simple local oLlama running on CPU would make it harder for some people to onboard the codebase.
-
Performances: A summarization step using LLMs or Vectorisation would come with a tradeoff in speed, and gitingest is focused bringing a smooth user experience, so that work will be better approached once we have a proper "profiling & optimisation" workflow going on.
-
There's lower hanging fruits to pick for now: I think we can already push the quality of the digest with simple ingestion logic: We know what codebases look like on average for popular languages It's certainly possible to improve based on known patterns before having to rely on models to make finer grained choices.
In the meantime, feel free to either:
- draft some PoC around this idea, maybe I need to change my mind
- Start gathering ressoures or ideas that could help us once we start working on this milestone
I totally agree with you. Before using a local llm, we need to take considerations from all type user's perspectives. Let's work on the PoC first, then it'll be easier to decide next steps.
A few suggestions from my end,
-
Make this feature optional, local llms or Gemini doesn't have to be a dependency.
-
We can use API based models by letting users utilize their API Keys.
Very good point, making it optional is a good approach to this transition, with the option to use API models as well
Hey, I am glad this post is really engaging ! I am working currently on a PoC for the same, as suggested by @argishh The option for letting the users their own API key is very feasible as the Target audiences are developers so they might know how to get their API key , alternatively we can also write instructions to do so, however the large repositories can exhaust upto 22K tokens even more! So that's why Gemini provides aroud 1M token per minutes , than Groq or other llm models . plus we can have other optimizing like reducing or bucket algorithmn to optimize , let's see how it goes . Right now I am engaged with ICPC .But I have a prototype for small repositories ready, feel free to checkout! and create issues , once I am done with ICPC regionals i.e 5th January I will implement the ideas I have and shared here for large repositories . Check this out , meanwhile : https://github.com/Sarahkhan20/GitZen (It works for small repositories for now )
@Sarahkhan20 nice work! I've gone through the code, and so far it looks great. I'll try it out next. I'm also interested in knowing how you're planning to optimize it for larger repos. Do not hesitate to reach out if you require any help to ideate or implement.
I agree with @cyclotruc here.
Before we move on to add more features, and trust me adding an LLM summarisation feature would be huge, and probably will need a lot of testing and time. Before we move on to adding such features, we should make sure that the working of the current version of gitingest is nice and robust. Also that we employ the best coding practices moving forward.
Being a Machine Learning Engineer myself, I cannot help but want to work on such features!
Definitely @joydeep049 . That's why @Sarahkhan20 started working on a separate PoC first. It'll take a while before the PoC is functional for larger repos, which would require summarization the most. So there's no rush as of now.
Hello, not sure if this feature could also include filtering or related change extraction to replace a vector db.
Lets say you have a commit and want to extract / summarize a related context instead of the full repo.
Maybe you need both, the full repo summary and a more complete related context of some folders, classes or commit.
Anyway, I'm thinking out loud but have the feeling this LLM analysis tool idea could become huge and probably a separate project by itself.
Hi there! We haven’t seen activity here for 45 days, so I’m marking this issue as stale. If you’d like to keep it open, please leave a comment within 10 days. Thanks!
Hi there! We haven’t heard anything for 10 days, so I’m closing this issue. Feel free to reopen if you’d like to continue the discussion. Thanks!