starcoder
starcoder copied to clipboard
Generate (Android,..) project and use it as context in StarCoder (for whole life-cycle help)?
Seeing a lot of typical test examples like a Python counting function or code completion I'm wondering if there is something in the works to help on the project level. I tried GPT4All with an Android (Studio) sample app and it gave me some code and XML, but it was far from trivial to find where to put the pieces.
Taking the GPT 4 Code Interpreter with file uploads as context, spitting out marvelous analysis and graphics: I'm wondering if or when StarCoder will be able to generate a project according to a prompt and / or further use it as advanced context to help in the whole life-cycle.
Thanks for any insights G.
This is what I am currently working on: attempting to fine-tune Starcoder with my own project's code context in hopes of obtaining code snippets within my code base. I have tried splitting my code into datasets, but so far I have not been able to get any recall of my code snippets.
Language models (LMs) always aim to predict the next token, and I am puzzled about fine-tuning an LM with NLP context or {"prompt", "completion"} pairs. If one fine-tunes with unsupervised NLP context, it is more like the continuation of pre-training an LM. This means that a very large dataset is required to adjust the weights. On the other hand, fine-tuning with a low-quantity of high-quality {"prompt", "completion"} pairs Starcoder involves concatenating strings with prepare_sample_text text = f"Question: {example[input_column_name]}\n\nAnswer: {example[output_column_name]}"
to an NLP context. However, "Question" and "Answer" are not sentinel tokens listed in the StarCoder paper, so how does the supervised approach work?
look forward to hearing from experts who can provide insight on my inquiry @loubnabnl @lvwerra @lewtun @arjunguha
I'm afraid I'm pretty sure ours are different, separate goals. You're talking about fine-tuning or In-Context-Learning for a model running locally with trade-secret company code. I am asking for / about a model that can cope with a programming project's tree structure and content and tooling, very different from local code completion or generating a function for single-file .py or notebook. Regards G.
Tooling I'm not sure about, but for content in tree structure have you tried using the <filename>
tag in your context? According to the white paper that is how it was trained on all of the code so it seems to respond well when giving multiple files with the filename tag. I have it generating new files just based on the filename I'm suggesting and it's even importing (js) from other "files" I've included in the context.
text = f"Question: {example[input_column_name]}\n\nAnswer: {example[output_column_name]}"
For your workflow, you probably just have a single column of data, e.g., example["content"]
. If so, you can do this:
text = f"example['content']"
@yeomanse Thanks for the suggestion! Seems I need to read up and dig into taming the beast. Will try and then return here. G.