Open-Assistant
Open-Assistant copied to clipboard
Making code instructions for large codebases
My idea is to break down huge, and I mean massive github codebases (like Pytorch or Tensorflow) into step by step instructions. This is more on the ambitious side and will probably need a small team but can totally be worth it.
The first step would be to break down simple code into instructions, example: Q: Can you write raycasting in python using pygame Steps:
- Initialize pygame window
- set up a ray cast_ray function
- set up ray_cast function to cast multiple rays.
The next step would be to take the data from massive codebases and analyze each commit using this method.
this probably needs discussion as it's not an easy task but can be totally be worth it
We could break some working code to have questions like. This code has [problem_x] output the corrected code, or this code has problem, which is it? We could do it in a smart way with a taxonomy of bugs that we introduce and have increasingly bugged versions like: "This code has [typing error, undefined input,class methods not referenced to self, memory leak]" and ask assistant to fix or classify errors etc.
The same we could do with codebases with very good unit testing we can get pairs code - testcode and train on that mapping.
We would need to start with some web scrapes of websites that have description of a code snippet and the code, plus the final code. We can use this as both instruction->code, instruction->code=>final code. As well as break down final code to instruction, snippet pairs. https://huggingface.co/datasets/code_search_net has comments code paris (comments would need to be cleaned up). But i don't believe it parses the internal comments of the code pieces inside the function itself.
But i don't believe it parses the internal comments of the code pieces inside the function itself.
It only parses the docstring. Also they had the limitation, that docstring is not always maps to the real code inside function, because someone after updating the function, forget to modify the docstring.
This is a bit too huge to be a single issue. If people want to work on this, please pick a small, defined sub-problem and make a separate issue (maybe link it to this one)
Opened https://github.com/LAION-AI/Open-Assistant/issues/317 for specifical creation of datasets of broken python code + bug type, correct_code
Closing old data issue that has not been completed by now.