Language Specific File Import Context / Nearest neighbors to file context provider

Open NinjaPerson24119 opened this issue 1 year ago • 0 comments

Validations

[X] I believe this is a way to improve. I'll try to join the Continue Discord for questions
[X] I'm not able to find an open issue that requests the same enhancement

Problem

Motivation

So, I don't actually use the RAG that much because I find it's not the most accurate for pulling in files.

But when I manually add context, what I do is very easily automated.

When I'm writing a Go package, my folder has a bunch of related files. Most languages are like this.

So I usually open all the files in a folder, and use @open-files as context, and that works great for debugging.

But this just a "step-out"/networking kind of problem for glueing context together.

I was writing a bug scanner, and realized that I've got a couple steps I could take for improving contextual analysis: Phase one for scanning is to just input an entire file and ask if there's bugs Next is to combine adjacent files for context, in the same folder. But in a language like Go, the imports are actually GitHub URLs typically. I could go one step further and add every imported file as context. I could then run the LLM on each pair (my file, imported file) and cut it down to only contain the used chunk of code.

I could then recursively "step-out" to add context as many times as I want.

In a typical codebase, as we keep stepping out and add more context, eventually we'd just have the entire codebase in one big prompt.

We can't do that because of the context size limits, but conceptually it's a good approximation.

Have you guys considered building a context provider like this? You could build it as: func gets the list of imports from the target file (could call LLM again) func resolves each import to a document (language specific) func reduces document to used sections (could call LLM again)

So what you're left with is needing to write a resolver for each import statement to the code it points at.

If you had 32k context, you could do a lot with this kind of thing I think.

Kinda like a C++ linker resolving #includes, except the end result goes to an LLM for analysis, instead of a compiler If you had a classifier that could determine if an import is internal or external, you could also parameterize it as n_recursion_steps, and only step into internal imports

All this would take some time to execute, and would probably be expensive on an API. But as a SonarQube type tool with a good enough 33-70b model I could see it being really useful

Ways to reduce context requirements

You'd probably need a choice / weight algorithm for additional reduction. Assuming the core libraries are baked into the model itself, it would be left to look at internal libraries.

There's the "in-between" class of libraries which are open-sourced packages, but not "core". You'd probably decide that things like moment.js or pandas are "core" libraries even if they're not part of the language itself, just based on the amount of public code that could be used as training data on them.

So for each language, you'd probably need a list of common libraries to assume the model has proficiency in

Solution

Implement context provider as a separate module. Probably in a language like Go.
Provide a binding for Continue to call the module
Provide a binding for the module to the model specified by Continue's configuration

In this way the algorithm could be applied from CLI independent of Continue. A modular approach would also prevent such a large component from becoming tightly coupled to the greater Continue codebase, result in an overall increase in complexity.

Mar 19 '24 23:03 NinjaPerson24119

continue continue copied to clipboard

Language Specific File Import Context / Nearest neighbors to file context provider

Validations

Problem

Motivation

Ways to reduce context requirements

Solution

continue
continue copied to clipboard