Open-Assistant
Open-Assistant copied to clipboard
Using the huge ManySStuBs4J code bugs dataset
The ManySStuBs4J dataset is a huge open source dataset of bugs based on GitHub.
It is easy to put it into dialogue form:
User: Find the bug in the following code: {INITIAL_CODE} Reply: The bugfix can be described as follows: {COMMIT_MESSAGE} The fixed code is: {FIXED-CODE}
It would be a substantial boost to our dataset as far as code is concerned.
We need someone to help with this. Looks very straight forward.
@ontocord I got wrong to fixed code done. Getting commits is gonna be hard because they have to be found online on GitHub.
This generates prompts_{num}.txt files in the generated_bugfix_prompts
that you can just download or write into another more specialized format: Colab Notebook
It is already very useful in this state in my opinion but if someone good at using the GitHub API or web scraping in general could improve it with the commit messages it would be even better.
Here is the same code as the notebook in plain text:
# -*- coding: utf-8 -*-
"""bugs_dataset.ipynb
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/drive/16Fr4iW0x1JdPC8x6GiBb6b4YtEr1CDGH
"""
#!wget https://zenodo.org/record/5845439/files/tssb_data_3M.zip?download=1
#!mv tssb_data_3M.zip?download=1 tssb_data_3M.zip
#!unzip tssb_data_3M.zip
import pandas
FILENUM = 32
table = pandas.read_json(f"tssb_data_3M/file-{FILENUM}.jsonl.gz", lines=True)
table
import re
TEMPLATE = \
"""User: Find the bug in the following code:
{}
Reply: The fixed code is:
```
{}
```
"""
def remove_starting_plus_minus(text):
if text.startswith("+") or text.startswith("-"):
return text[1:]
else:
return text
def remove_extraneous_diff_info(text):
pattern = "@@.*@@"
return re.sub(pattern, "", text)
def clean(text):
return remove_extraneous_diff_info(remove_starting_plus_minus(text))
def write_prompts(num, table):
with open(f"generated_bugfix_prompts/prompts_{num}.txt", "w+") as f:
for index, row in table.iterrows():
#print("diff")
#print(row['diff'])
correct = '\n'.join(clean(line) for line in row['diff'].split("\n") if not line.startswith("+"))
wrong = '\n'.join(clean(line) for line in row['diff'].split("\n") if not line.startswith("-"))
#print("prompt")
#print(TEMPLATE.format(correct, wrong))
f.write(TEMPLATE.format(correct, wrong))
#!mkdir generated_bugfix_prompts
for num in range(33):
table = pandas.read_json(f"tssb_data_3M/file-{num}.jsonl.gz", lines=True)
write_prompts(num, table)
#!zip -r generated_bugfix_prompts.zip generated_bugfix_prompts
Can you push the generated bugfix_prompts to hf please? As parquet? Please see the guide we have in the data directory.
And check in your notebook. We will find someone to finish it :)
@ontocord This is the notebook Colab Notebook
If you run it you can download the generated prompts so far, it only takes around 20 minutes.
This is the pull request: https://github.com/LAION-AI/Open-Assistant/pull/1410
Tell me if it is correct, it is the first pull request that I do.
I think I can help to get commit messages from GitHub, but it may not be done too quickly, as there is a rate limit imposed by the GitHub API.
@zirui @ontocord here is the pull asking to add the notebook: https://github.com/LAION-AI/Open-Assistant/pull/1425
It says something weird about pre-commit, but I cannot run pre-commit locally because of version incompatibility
@andreaskoepf and others who are controlling PR will review. thank you!
@ontocord I managed to run precommit by installing it with conda rather than with snap
I'm working on getting commit messages from GitHub. Due to the limit of 5,000 requests imposed by the GitHub API, the process is slow, and I am trying to find a way to speed it up.
please assign this issue to me. @caridorc-tergiliti @ontocord
@zirui Thanks for your effort, if you are querying the GitHub API for commits, it might be worth it to also get more context for the code, i.e. more lines before and after the bug as the dataset only contains a few lines near the bug (but I am not sure if this is worth the extra effort, your call).
Another possibility is downloading the GitHub repositories and using the python github library to get the commit data from them. The repositories are only 8266 even if the bugfixes are over 3 million. So maybe it is possible to download the whole repository with the whole history and query it locally for the bugfixes.
Link to the PR: https://github.com/LAION-AI/Open-Assistant/pull/1425
Thanks @RiccardoRiglietti Your suggestion of downloading all GitHub repositories seems like a good idea, I will look into whether this method is more efficient(This method may be impacted by the network environment, and in my area, downloading the full Git code can sometimes take a long time... ).
@zirui I have assigned you to this. Soon we should be able to get #1425 merged and then you may be able to work on top of that existing notebook
Reopening to track the commit issue addition on top of the work in #1425
@zirui I have assigned you to this. Soon we should be able to get #1425 merged and then you may be able to work on top of that existing notebook
Ok, after finishing the work of retrieving all the COMMIT_MESSAGE of codes, I will create a new HF dataset and update the existing notebook.
Looks like this issue is moving along nicely @zirui and @caridorc-tergiliti
Hi all @zirui and @caridorc-tergiliti
Are we good with this issue? Any results?
Hi all @zirui and @caridorc-tergiliti
Are we good with this issue? Any results?
Sorry for the late reply.
I have created two HF datasets follow this OA README:
The ManySStuBs4J
dataset extended with commit info: TSSB-3M-ext
instruction dataset : TSSB-3M-instructions
and after further filtering and checking, I will create a pull request for the OA repository this week