Open-Assistant
Open-Assistant copied to clipboard
multi language code competiton / challange dataset
similar to #330
Applications like codewars & leetcode give you prompts with a use case, sample data & expected results, they have this for many languages, you can also see the working results, they track the function runtime & memory usage.
What would be required:
- a prompt ("code challenge")
- test cases
- code validation
what would be Optional to measure:
- performance
Why is this useful?
language models are used to create & explain code, by giving it prompts with many working implementations we can let the assistant generate:
- high quality code
- convert one programming language to another
semi related to #279
Hi are you proposing to actually hold a competiton, or scrape data from a competition?
@extreme4all we will just seed our model from pretrain language model from codefor this and see the result. Do you have any suggestion for existing public dataset?
Hi are you proposing to actually hold a competiton, or scrape data from a competition?
both, maintaining our own dataset would probably have the highest quality for the purposes of the project. initially questions & solutions can be scraped from competitions, but do they allow us to do this
@extreme4all we will just seed our model from pretrain language model from codefor this and see the result. Do you have any suggestion for existing public dataset?
i don't know of any public dataset, what is codefor?
@theblackcat102 i found this dataset: https://github.com/codereport/LeetCode
Also this one looks promisisng. https://huggingface.co/datasets/bigscience/xP3/tree/main/code
@extreme4all would you want to add this dataset into our dataset in instruction->answer form. We would need to make sure the code is executable.
@momegas another issue is dealing with xp3. but no one has been assigned it. Would you be interested in working on it?
Im not sure what the outcome of this issue is though. A well curated dataset uploaded to a Hugging Face repo?
@extreme4all would you want to add this dataset into our dataset in instruction->answer form. We would need to make sure the code is executable.
indeed i think it would be valuable to have such dataset in instruction -> answer form.
And indeed validating if the code is executable and meets some tests is required
i have two ideas:
- use code_contests, which was collected from codeforces/Aizu/atcoder/codechef/HackerEarth by deepmind, and was used for pretraing for alphacode, (Apache 2.0/CC BY 4.0 license)
- scrape problems from leetcode (2500+problems, includes problem description, input/output test cases and soutions implemented with different programming languages(need to parse from posts in the discussion sections))
We are now using code contest and leetcode data