Open-Assistant multi language code competiton / challange dataset

similar to #330

Applications like codewars & leetcode give you prompts with a use case, sample data & expected results, they have this for many languages, you can also see the working results, they track the function runtime & memory usage.

What would be required:

a prompt ("code challenge")
test cases
code validation

what would be Optional to measure:

performance

Why is this useful?

language models are used to create & explain code, by giving it prompts with many working implementations we can let the assistant generate:

high quality code
convert one programming language to another

Jan 04 '23 22:01 extreme4all

semi related to #279

Jan 04 '23 22:01 extreme4all

Hi are you proposing to actually hold a competiton, or scrape data from a competition?

Jan 05 '23 00:01 huu4ontocord

@extreme4all we will just seed our model from pretrain language model from codefor this and see the result. Do you have any suggestion for existing public dataset?

Jan 05 '23 06:01 theblackcat102

Hi are you proposing to actually hold a competiton, or scrape data from a competition?

both, maintaining our own dataset would probably have the highest quality for the purposes of the project. initially questions & solutions can be scraped from competitions, but do they allow us to do this

@extreme4all we will just seed our model from pretrain language model from codefor this and see the result. Do you have any suggestion for existing public dataset?

i don't know of any public dataset, what is codefor?

Jan 07 '23 13:01 extreme4all

@theblackcat102 i found this dataset: https://github.com/codereport/LeetCode

Jan 12 '23 13:01 extreme4all

Also this one looks promisisng. https://huggingface.co/datasets/bigscience/xP3/tree/main/code

Jan 19 '23 23:01 momegas

@extreme4all would you want to add this dataset into our dataset in instruction->answer form. We would need to make sure the code is executable.

Jan 22 '23 03:01 huu4ontocord

@momegas another issue is dealing with xp3. but no one has been assigned it. Would you be interested in working on it?

Jan 22 '23 03:01 huu4ontocord

Im not sure what the outcome of this issue is though. A well curated dataset uploaded to a Hugging Face repo?

Jan 23 '23 20:01 momegas

@extreme4all would you want to add this dataset into our dataset in instruction->answer form. We would need to make sure the code is executable.

indeed i think it would be valuable to have such dataset in instruction -> answer form.

And indeed validating if the code is executable and meets some tests is required

Jan 24 '23 18:01 extreme4all

i have two ideas:

use code_contests, which was collected from codeforces/Aizu/atcoder/codechef/HackerEarth by deepmind, and was used for pretraing for alphacode, (Apache 2.0/CC BY 4.0 license)
scrape problems from leetcode (2500+problems, includes problem description, input/output test cases and soutions implemented with different programming languages(need to parse from posts in the discussion sections))

Jan 31 '23 08:01 zirui

We are now using code contest and leetcode data

Jun 02 '23 09:06 olliestanley