Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Making code instructions for large codebases

Open GravermanDev opened this issue 2 years ago • 5 comments

My idea is to break down huge, and I mean massive github codebases (like Pytorch or Tensorflow) into step by step instructions. This is more on the ambitious side and will probably need a small team but can totally be worth it.

The first step would be to break down simple code into instructions, example: Q: Can you write raycasting in python using pygame Steps:

  1. Initialize pygame window
  2. set up a ray cast_ray function
  3. set up ray_cast function to cast multiple rays.

The next step would be to take the data from massive codebases and analyze each commit using this method.

this probably needs discussion as it's not an easy task but can be totally be worth it

GravermanDev avatar Jan 02 '23 14:01 GravermanDev

We could break some working code to have questions like. This code has [problem_x] output the corrected code, or this code has problem, which is it? We could do it in a smart way with a taxonomy of bugs that we introduce and have increasingly bugged versions like: "This code has [typing error, undefined input,class methods not referenced to self, memory leak]" and ask assistant to fix or classify errors etc.

The same we could do with codebases with very good unit testing we can get pairs code - testcode and train on that mapping.

furlat avatar Jan 02 '23 14:01 furlat

We would need to start with some web scrapes of websites that have description of a code snippet and the code, plus the final code. We can use this as both instruction->code, instruction->code=>final code. As well as break down final code to instruction, snippet pairs. https://huggingface.co/datasets/code_search_net has comments code paris (comments would need to be cleaned up). But i don't believe it parses the internal comments of the code pieces inside the function itself.

huu4ontocord avatar Jan 02 '23 14:01 huu4ontocord

But i don't believe it parses the internal comments of the code pieces inside the function itself.

It only parses the docstring. Also they had the limitation, that docstring is not always maps to the real code inside function, because someone after updating the function, forget to modify the docstring.

doroshroman avatar Jan 02 '23 20:01 doroshroman

This is a bit too huge to be a single issue. If people want to work on this, please pick a small, defined sub-problem and make a separate issue (maybe link it to this one)

yk avatar Jan 02 '23 21:01 yk

Opened https://github.com/LAION-AI/Open-Assistant/issues/317 for specifical creation of datasets of broken python code + bug type, correct_code

furlat avatar Jan 03 '23 09:01 furlat

Closing old data issue that has not been completed by now.

andreaskoepf avatar Jun 14 '23 08:06 andreaskoepf