Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Code Instructions using data augmentation

Open GravermanDev opened this issue 2 years ago • 11 comments

The idea is to have a notebook that takes a code as an input and converts it into set of instructions on how to make it. This can be achieved using existing models form HF but data quality is crucial here. This will be very useful for generating more data for training

Example: Code: performs raycasting Instructions:

  1. Set up pygame window
  2. do a function to cast a ray
  3. cast rays from the player
  4. render

this is related to #279

GravermanDev avatar Jan 02 '23 22:01 GravermanDev

Where would the training data for these instructions come from?

yk avatar Jan 02 '23 22:01 yk

the main idea is to use this dataset as it contains code <-> comment pairs. This can with some help of summarizing models, can be used most of the time to figure out what the function does

GravermanDev avatar Jan 03 '23 06:01 GravermanDev

sound good. are you interested in working on it or should we keep it as a todo?

yk avatar Jan 03 '23 09:01 yk

unfortunetly I can't for the next 2 days so I'll take it if noone completes this for the next 2 days

GravermanDev avatar Jan 03 '23 09:01 GravermanDev

okay, you can assign me to the issue

GravermanDev avatar Jan 04 '23 19:01 GravermanDev

@GravermanDev - hey hey - i know you are busy. wanted to see if you were still working on this or we can add another person to help you?

huu4ontocord avatar Jan 22 '23 03:01 huu4ontocord

you can add another person, thanks and sorry for the inconvenience

GravermanDev avatar Jan 23 '23 00:01 GravermanDev

Hey @ontocord 👋 Can you assign me to this?

I will try out some different code summarisation models that look promising and see how good and reliable the results are. Got any leads on which ones to look at?

What level of abstraction do you think would be good here as the output? Could be anything from each line of the input code as natural language to a single sentence summarisation. I would think that more granular explanations could be better, because it's similar to step-by-step prompting.

Having both the summarisation (eg. This program renders an object with raycasting) followed by the steps taken to make that happen might be even better, so the model can associate abstract concepts with the steps needed to realize them. What do you think?

If the models are good at recognizing what a code does but are not verbose enough and only provide short summaries, they could probably be easily fine tuned to provide more granular descriptions without too much data, right? Because to correctly classify that a function does raycasting, for example, you'd need to already understand what sort of steps are needed to that.

mikastamm avatar Feb 08 '23 21:02 mikastamm

@GravermanDev @ontocord Hello, I'm interested in contributing to this task. In my understanding, you want to create synthetic data. What is the base model for generating synthetic data?

xrsrke avatar Mar 13 '23 09:03 xrsrke

@xrsrke Hello, hit me up on Discord Graverman#0804, I trained the model to do summarization of code (called t5-code-summary on Huggingface) and am working on a dataset with instructions related to code with output that is expected. Examples include: "write code based on instructions" "write instructions based on code" "write docstring to this code" "rewrite this code"

I can give you the access to the dataset and you can contribute. This dataset will directly be added to the main dataset

GravermanDev avatar Mar 13 '23 09:03 GravermanDev

Hello everyone!

I recently had some free time so I decided to give this task a go. At the moment I have generated an instruction dataset in (INSTRUCTION/RESPONSE) format for Python. I have used the code_search_net dataset as discussed previously, but from kaggle as at the moment is not available at HuggingFace.

The dataset contains around 450000 python annotated functions. I have split the dataset into two blocks, one in which the task is starting from the summary to generate an instruction to generate the code as a response, and another one in which the expected response is to generate a description of the function or a docstring. For the second block for obvious reasons I have removed the docstring from the function from 90% of the samples. To generate the summaries I have used this model.

The dataset can be found here and the notebooks to produce it here. The quality is pretty high and the method can easily be extended for many other languages. Let me know if it's good or I should change something.

Thanks!

Nan-Do avatar May 08 '23 14:05 Nan-Do