Open-Assistant Using the huge ManySStuBs4J code bugs dataset

The ManySStuBs4J dataset is a huge open source dataset of bugs based on GitHub.

It is easy to put it into dialogue form:

User: Find the bug in the following code: {INITIAL_CODE} Reply: The bugfix can be described as follows: {COMMIT_MESSAGE} The fixed code is: {FIXED-CODE}

It would be a substantial boost to our dataset as far as code is concerned.

Feb 09 '23 16:02 caridorc-tergiliti

We need someone to help with this. Looks very straight forward.

Feb 09 '23 17:02 huu4ontocord

@ontocord I got wrong to fixed code done. Getting commits is gonna be hard because they have to be found online on GitHub.

This generates prompts_{num}.txt files in the generated_bugfix_prompts that you can just download or write into another more specialized format: Colab Notebook

It is already very useful in this state in my opinion but if someone good at using the GitHub API or web scraping in general could improve it with the commit messages it would be even better.

Here is the same code as the notebook in plain text:

# -*- coding: utf-8 -*-
"""bugs_dataset.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/16Fr4iW0x1JdPC8x6GiBb6b4YtEr1CDGH
"""

#!wget https://zenodo.org/record/5845439/files/tssb_data_3M.zip?download=1

#!mv tssb_data_3M.zip?download=1 tssb_data_3M.zip

#!unzip tssb_data_3M.zip

import pandas

FILENUM = 32
table = pandas.read_json(f"tssb_data_3M/file-{FILENUM}.jsonl.gz", lines=True)
table

import re

TEMPLATE = \
"""User: Find the bug in the following code:
{}
Reply: The fixed code is:
```
{}
```


"""

def remove_starting_plus_minus(text):
  if text.startswith("+") or text.startswith("-"):
    return text[1:]
  else:
    return text

def remove_extraneous_diff_info(text):
  pattern = "@@.*@@"
  return re.sub(pattern, "", text)

def clean(text):
  return remove_extraneous_diff_info(remove_starting_plus_minus(text))

def write_prompts(num, table):
  with open(f"generated_bugfix_prompts/prompts_{num}.txt", "w+") as f:
    for index, row in table.iterrows():
      #print("diff")
      #print(row['diff'])
      correct = '\n'.join(clean(line) for line in row['diff'].split("\n") if not line.startswith("+"))
      wrong = '\n'.join(clean(line) for line in row['diff'].split("\n") if not line.startswith("-"))
      #print("prompt")
      #print(TEMPLATE.format(correct, wrong))
      f.write(TEMPLATE.format(correct, wrong))

#!mkdir generated_bugfix_prompts

for num in range(33):
  table = pandas.read_json(f"tssb_data_3M/file-{num}.jsonl.gz", lines=True)
  write_prompts(num, table)

#!zip -r generated_bugfix_prompts.zip generated_bugfix_prompts

Feb 09 '23 22:02 caridorc-tergiliti

Can you push the generated bugfix_prompts to hf please? As parquet? Please see the guide we have in the data directory.

Feb 10 '23 00:02 huu4ontocord

And check in your notebook. We will find someone to finish it :)

Feb 10 '23 00:02 huu4ontocord

@ontocord This is the notebook Colab Notebook

If you run it you can download the generated prompts so far, it only takes around 20 minutes.

Feb 10 '23 00:02 caridorc-tergiliti

This is the pull request: https://github.com/LAION-AI/Open-Assistant/pull/1410

Tell me if it is correct, it is the first pull request that I do.

Feb 10 '23 00:02 caridorc-tergiliti

I think I can help to get commit messages from GitHub, but it may not be done too quickly, as there is a rate limit imposed by the GitHub API.

Feb 10 '23 09:02 zirui

@zirui @ontocord here is the pull asking to add the notebook: https://github.com/LAION-AI/Open-Assistant/pull/1425

It says something weird about pre-commit, but I cannot run pre-commit locally because of version incompatibility

Feb 10 '23 12:02 RiccardoRiglietti

@andreaskoepf and others who are controlling PR will review. thank you!

Feb 10 '23 17:02 huu4ontocord

@ontocord I managed to run precommit by installing it with conda rather than with snap

Feb 11 '23 02:02 RiccardoRiglietti

I'm working on getting commit messages from GitHub. Due to the limit of 5,000 requests imposed by the GitHub API, the process is slow, and I am trying to find a way to speed it up.

please assign this issue to me. @caridorc-tergiliti @ontocord

Feb 12 '23 13:02 zirui

@zirui Thanks for your effort, if you are querying the GitHub API for commits, it might be worth it to also get more context for the code, i.e. more lines before and after the bug as the dataset only contains a few lines near the bug (but I am not sure if this is worth the extra effort, your call).

Another possibility is downloading the GitHub repositories and using the python github library to get the commit data from them. The repositories are only 8266 even if the bugfixes are over 3 million. So maybe it is possible to download the whole repository with the whole history and query it locally for the bugfixes.

Link to the PR: https://github.com/LAION-AI/Open-Assistant/pull/1425

Feb 12 '23 18:02 RiccardoRiglietti

Thanks @RiccardoRiglietti Your suggestion of downloading all GitHub repositories seems like a good idea, I will look into whether this method is more efficient(This method may be impacted by the network environment, and in my area, downloading the full Git code can sometimes take a long time... ).

Feb 13 '23 05:02 zirui

@zirui I have assigned you to this. Soon we should be able to get #1425 merged and then you may be able to work on top of that existing notebook

Feb 20 '23 22:02 olliestanley

Reopening to track the commit issue addition on top of the work in #1425

Feb 21 '23 16:02 olliestanley

@zirui I have assigned you to this. Soon we should be able to get #1425 merged and then you may be able to work on top of that existing notebook

Ok, after finishing the work of retrieving all the COMMIT_MESSAGE of codes, I will create a new HF dataset and update the existing notebook.

Feb 22 '23 02:02 zirui

Looks like this issue is moving along nicely @zirui and @caridorc-tergiliti

Feb 24 '23 06:02 huu4ontocord

Hi all @zirui and @caridorc-tergiliti

Are we good with this issue? Any results?

Apr 09 '23 21:04 huu4ontocord

Hi all @zirui and @caridorc-tergiliti

Are we good with this issue? Any results?

Sorry for the late reply. I have created two HF datasets follow this OA README: The ManySStuBs4J dataset extended with commit info: TSSB-3M-ext instruction dataset : TSSB-3M-instructions

and after further filtering and checking, I will create a pull request for the OA repository this week

Apr 12 '23 15:04 zirui