starcoder icon indicating copy to clipboard operation
starcoder copied to clipboard

Generate (Android,..) project and use it as context in StarCoder (for whole life-cycle help)?

Open ai-bits opened this issue 1 year ago • 5 comments

Seeing a lot of typical test examples like a Python counting function or code completion I'm wondering if there is something in the works to help on the project level. I tried GPT4All with an Android (Studio) sample app and it gave me some code and XML, but it was far from trivial to find where to put the pieces.

Taking the GPT 4 Code Interpreter with file uploads as context, spitting out marvelous analysis and graphics: I'm wondering if or when StarCoder will be able to generate a project according to a prompt and / or further use it as advanced context to help in the whole life-cycle.

Thanks for any insights G.

ai-bits avatar May 29 '23 10:05 ai-bits

This is what I am currently working on: attempting to fine-tune Starcoder with my own project's code context in hopes of obtaining code snippets within my code base. I have tried splitting my code into datasets, but so far I have not been able to get any recall of my code snippets.

Language models (LMs) always aim to predict the next token, and I am puzzled about fine-tuning an LM with NLP context or {"prompt", "completion"} pairs. If one fine-tunes with unsupervised NLP context, it is more like the continuation of pre-training an LM. This means that a very large dataset is required to adjust the weights. On the other hand, fine-tuning with a low-quantity of high-quality {"prompt", "completion"} pairs Starcoder involves concatenating strings with prepare_sample_text text = f"Question: {example[input_column_name]}\n\nAnswer: {example[output_column_name]}" to an NLP context. However, "Question" and "Answer" are not sentinel tokens listed in the StarCoder paper, so how does the supervised approach work?

look forward to hearing from experts who can provide insight on my inquiry @loubnabnl @lvwerra @lewtun @arjunguha

h-clickshift avatar May 30 '23 02:05 h-clickshift

I'm afraid I'm pretty sure ours are different, separate goals. You're talking about fine-tuning or In-Context-Learning for a model running locally with trade-secret company code. I am asking for / about a model that can cope with a programming project's tree structure and content and tooling, very different from local code completion or generating a function for single-file .py or notebook. Regards G.

ai-bits avatar May 30 '23 10:05 ai-bits

Tooling I'm not sure about, but for content in tree structure have you tried using the <filename> tag in your context? According to the white paper that is how it was trained on all of the code so it seems to respond well when giving multiple files with the filename tag. I have it generating new files just based on the filename I'm suggesting and it's even importing (js) from other "files" I've included in the context.

yeomanse avatar May 30 '23 13:05 yeomanse

text = f"Question: {example[input_column_name]}\n\nAnswer: {example[output_column_name]}"

For your workflow, you probably just have a single column of data, e.g., example["content"]. If so, you can do this:

text = f"example['content']"

arjunguha avatar May 30 '23 14:05 arjunguha

@yeomanse Thanks for the suggestion! Seems I need to read up and dig into taming the beast. Will try and then return here. G.

ai-bits avatar May 30 '23 17:05 ai-bits