Open-Assistant
Open-Assistant copied to clipboard
Harvest github issues, for finding solutions to bugs
Sometimes it is not enough to find a solution to a problem using stackoverflow, sometimes you can find it in github issues. Also the structure of the issue is presented in question-answer style.
So, github has the REST API for retrieving issues in particular repository.
Also, I think after ingesting the issues data, we can filter by closed status.
For example the following issue.
Related to #279
@doroshroman, I recommend you try to make this into a notebook and share it in notebooks/
! It would be very helpful for a bigger issue (bigger in difficulty that is) I started yesterday #279. Thanks for the idea
import aiohttp
import asyncio
from dotenv import load_dotenv
import os
import json
from pathlib import Path
load_dotenv()
GITHUB_REPOS_FILENAME = 'github_repos_names.txt'
GITHUB_ISSUES_FILENAME = 'github_issues.json'
VISITED_GITHUB_REPOS_FILENAME = 'visited_github_repos.txt'
API_LIMIT = 4000
#create following file, because json loads strange with "a" mode, so it's better to create file explicitly
file = Path(GITHUB_ISSUES_FILENAME)
file.touch(exist_ok=True)
with open(GITHUB_REPOS_FILENAME, 'r') as file:
GITHUB_REPOS = file.read().splitlines()
try:
with open(VISITED_GITHUB_REPOS_FILENAME, 'r') as file:
VISITED_REPOS = file.read().splitlines()
# every time someone want to run it, it will update GITHUB_REPOS variable and we ingest new data
# limit to 4000
GITHUB_REPOS = list(set(GITHUB_REPOS) - set(VISITED_REPOS))[:API_LIMIT]
if len(GITHUB_REPOS) == 0:
print("ALL DATA SUCCESSFULLY INGESTED!")
exit()
except FileNotFoundError:
VISITED_REPOS = []
GITHUB_API_TOKEN = os.environ["GITHUB_API_TOKEN"]
GITHUB_REPOS_URL = f"https://api.github.com/repos/"
headers = {
"Accept": "application/vnd.github+json",
"Authorization" : f"Bearer {GITHUB_API_TOKEN}",
"X-GitHub-Api-Version": "2022-11-28"
}
def generate_chunks_unvisited(visited_repos, repos, chunk_len=100):
append_chunk = lambda repo, visited_repos: chunk.append(repo) if repo not in visited_repos else None
chunk = []
for i, repo in enumerate(repos, 1):
if i == len(repos):
append_chunk(repo, visited_repos)
yield chunk
return
if i % chunk_len == 0:
yield chunk
chunk = []
append_chunk(repo, visited_repos)
def append_to_json(filename, data):
with open(filename, 'r+', encoding='utf-8') as file:
if os.stat(filename).st_size == 0:
if isinstance(data, dict):
json.dump([data], file, indent=4)
else:
json.dump(data, file, indent=4)
else:
data_json = json.load(file, strict=False)
data_json.extend(data)
file.seek(0)
json.dump(data_json, file, indent=4)
def append_to_file(filename, data):
with open(filename, 'a+', encoding='utf-8') as file:
file.write('\n'.join(data))
async def fetch(url, session):
url = f"{GITHUB_REPOS_URL}{url}/issues?state=closed"
async with session.get(url) as response:
resp = await response.json()
return resp
async def main(visited_repos, repos):
tasks = []
async with aiohttp.ClientSession(headers=headers) as session:
for repos in generate_chunks_unvisited(visited_repos, repos):
tasks = [asyncio.ensure_future(fetch(repo_url, session)) for repo_url in repos]
responses = await asyncio.gather(*tasks, return_exceptions=True)
# write to file in batches
batch_size = 100
for i in range(0, len(responses), batch_size):
responses_batch = [resp for resp in responses[i:i+batch_size] if resp is not None and not isinstance(resp, Exception)]
if not responses_batch:
continue
visited_repos_batch = set()
for responses in responses_batch:
for response in responses:
if "repository_url" in response:
repo = '/'.join(response["repository_url"].split('/')[-2:])
visited_repos_batch.add(repo)
append_to_json(GITHUB_ISSUES_FILENAME, responses_batch)
append_to_file(VISITED_GITHUB_REPOS_FILENAME, visited_repos_batch)
await main(VISITED_REPOS, GITHUB_REPOS)
@GravermanDev So, that is basically it. The main problem is API_LIMIT, it only gives 5000 requests per registered user. Also, after processing repos, I've collected about 69k repos from the code_search_net dataset. So, in order to ingest all closed issues, this script needs to be executed ~ 14 or 15 times with break of every hour.
Okay! Looks good to me, very useful!
Created repository for this: https://github.com/doroshroman/github_issues I'll continue to collect other parts and dump all data not in github, because of git lfs limitation.
This is very cool @doroshroman and @GravermanDev
@doroshroman see here for adding datasets: https://github.com/LAION-AI/Open-Assistant/blob/main/docs/docs/data/datasets.md
@doroshroman - will we be able to convert some of these into instruction->answer or question->answer. Would be awsome if we could.
@ontocord Here the raw dataset. The format is the following:
[
{
"issue_url": "https://api.github.com/repos/paulirish/speedline/issues/92",
"issue_title": "Create issues ",
"comments": [
{
"url": "https://api.github.com/repos/paulirish/speedline/issues/comments/882629228",
"html_url": "https://github.com/paulirish/speedline/issues/92#issuecomment-882629228",
"issue_url": "https://api.github.com/repos/paulirish/speedline/issues/92",
"id": 882629228,
"node_id": "IC_kwDOA0JEEM40m9ps",
"user": {
"login": "LGNDDOLLABOUTIQUE",
"id": 78938698,
"node_id": "MDQ6VXNlcjc4OTM4Njk4",
"avatar_url": "https://avatars.githubusercontent.com/u/78938698?v=4",
"gravatar_id": "",
"url": "https://api.github.com/users/LGNDDOLLABOUTIQUE",
"html_url": "https://github.com/LGNDDOLLABOUTIQUE",
"followers_url": "https://api.github.com/users/LGNDDOLLABOUTIQUE/followers",
"following_url": "https://api.github.com/users/LGNDDOLLABOUTIQUE/following{/other_user}",
"gists_url": "https://api.github.com/users/LGNDDOLLABOUTIQUE/gists{/gist_id}",
"starred_url": "https://api.github.com/users/LGNDDOLLABOUTIQUE/starred{/owner}{/repo}",
"subscriptions_url": "https://api.github.com/users/LGNDDOLLABOUTIQUE/subscriptions",
"organizations_url": "https://api.github.com/users/LGNDDOLLABOUTIQUE/orgs",
"repos_url": "https://api.github.com/users/LGNDDOLLABOUTIQUE/repos",
"events_url": "https://api.github.com/users/LGNDDOLLABOUTIQUE/events{/privacy}",
"received_events_url": "https://api.github.com/users/LGNDDOLLABOUTIQUE/received_events",
"type": "User",
"site_admin": false
},
"created_at": "2021-07-19T15:11:22Z",
"updated_at": "2021-07-19T15:11:22Z",
"author_association": "NONE",
"body": "Hey Paul sorry if I'm bugging you , but as an Expert, can you please take a close look at my work and tell me if I could meet the Federal standard Trading and Banking Sites while I'm still working on building. Thank you",
"reactions": {
"url": "https://api.github.com/repos/paulirish/speedline/issues/comments/882629228/reactions",
"total_count": 0,
"+1": 0,
"-1": 0,
"laugh": 0,
"hooray": 0,
"confused": 0,
"heart": 0,
"rocket": 0,
"eyes": 0
},
"performed_via_github_app": null
}
]
}
]
So, the next task are:
- filter this dataset in format question -> answer
- add it as dataset to OpenAssistant.
Can someone do this instead of me?
Ok. can you ping the discord to see if someone can take overr for you @doroshroman ? cc me and I can help find someone too.
Hello, could someone brief me about this issue and what needs to be done? I can take a look and see what I can come up with. Thanks!
In the Excel this issue is marked as needing a new assignee. If that is still the case I can help with this and be assigned to it.
@RiccardoRiglietti I think this issue stalled, would you be still interested to work on it?
@zirui has done all the hard work of scraping commit messages and has put the finished dataset on HuggingFace at https://huggingface.co/datasets/zirui3/TSSB-3M-ext So I think this issue is done thanks to him and can be closed as completed.