Open-Assistant
Open-Assistant copied to clipboard
Suggestion: Proposal for a curated Dataset for problem solving and coding.
Hello, I was working on creating an exciting dataset to fine-tune some of the available new LLM. Unfortunately, the prompts are too big to be used by the means I have at my disposal, so I thought a dataset like that could be interesting for a project like this.
The dataset description:
- The dataset is composed by submissions to Leetcode's contests. - This data is of public access.
- For each contest, the first accepted solution for some popular languages is taken - In this way, we can be sure that the source code is warranted to be correct.
- In total, there are 4535 unique submissions and 895 unique problems.
- The included languages are: - C, C++, Java, JavaScript, Python 2, Python 3, Rust.
- The dataset has been cleaned to ensure formatting consistency and the correctness of the present numerical values.
- The dataset is in instruction, input, and output JSON format.
The dataset is available at https://github.com/Nan-Do/LeetCodeContestsDataset. Let me know if this is of interest to the project; I could keep the dataset up to date if that is the case.
That sounds interesting, especially since our models are not really good at coding, does anyone has the time to add this?
I'm happy to work on this.
Well, the dataset is already built. I mean the file could be fed directly into a training pipeline and it should work. What would you need to do/improve?
As my main goal was to fine-tune other models, I only used the two last questions for each contest (which are the most difficult ones) and only one solution for each language.
If it is required/interesting to make it bigger, like adding the rest of the questions or getting more submissions per language, as not all solutions follow the same approach. I could do it without too much trouble.
I could also add sources from other competitive sources like Codeforces or Atcoder.
Anyways I'll probably keep working on it, keeping it updated with new the new contests that might come up and probably adding more sources.
Someone else is working on something similar to this https://github.com/LAION-AI/Open-Assistant/pull/2494 I guess this could be considered completed
I think it should be reopened but to integrate the datasets
I would also like to contribute a dataset for Clojure/Script
Edit: I'll be pulling data from here: https://github.com/oxalorg/4clojure-solutions-archive/tree/main/deduped/solutions
And combining it with the problem metadata here: https://github.com/oxalorg/4ever-clojure/blob/main/src/app/data.cljc
I can try to follow the format that is used here: https://www.kaggle.com/datasets/erichartford/leetcode-solutions
But if there's any more specific format you'd like us to use, let me know. Thanks!
@johnmn3 The idea of this dataset, at least the one I proposed in this issue, is to add code submitted to competitive programming platforms, 4clojure doesn't quite fit that category. For context, I have my own set of solutions for 4clojure on my GitHub. I don't know if @ehartford has something else to say about this or if there is a broader interest to push more code sources into training.
this dataset is specifically for leetcode questions and answers with detailed explanations
So in my opinion other code related datasets should be added independently of leet10k
@Nan-Do It's competitive in the sense that you can submit your own answers and view other people's answers. And people compete to complete the most answers. The frontend has been resurrected here: https://4clojure.oxal.org/
Apologies if this issue was the wrong place to drop the suggestion. I can open a new issue or add this somewhere else if desired.
This dataset doesn't have detailed explanations for each answer, but it does have a decent english description of the problem, with dozens or hundreds of answers for each problem. Let me know if that's not an ideal dataset and/or if there's anything I can do to clean it up better.
I'm sure lots of developer communities will want to do the same thing, helping OA help their communities, and I've heard y'all are looking into ways folks can more easily help out in that regard? Whatever the recommended path for contributing programming language training data will be, let me know and I can try to document my process for others to follow as well if that would help.
Thank you all for this amazing work!
@johnmn3 it's cool that the front-end has been resurrected, and Clojure is a cool language that I personally enjoy, but the idea of the dataset is that it contains code is written by people with years of experience, to somewhat original and difficult problems, in the very particular set of conditions that is a contest. That's why I don't considered other kind of submissions relevant for the dataset, not even my own.
Now, it's not my decision but I think that Clojure and other lisp-languages deserve to be included on the training sources, so I'd invite you to open another issue explaining your idea for the dataset and what has been done so far. I wouldn't mind working on that if required.
Best
may I suggest that another source of excellent code with explanations could be Kaggle:
https://www.kaggle.com/code
I guess this can be considered done as the 10k leetcode dataset has already been merged