Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Harvest github issues, for finding solutions to bugs

Open doroshroman opened this issue 2 years ago • 16 comments

Sometimes it is not enough to find a solution to a problem using stackoverflow, sometimes you can find it in github issues. Also the structure of the issue is presented in question-answer style.

So, github has the REST API for retrieving issues in particular repository.

Also, I think after ingesting the issues data, we can filter by closed status.

For example the following issue.

doroshroman avatar Jan 03 '23 09:01 doroshroman

Related to #279

yk avatar Jan 03 '23 10:01 yk

@doroshroman, I recommend you try to make this into a notebook and share it in notebooks/! It would be very helpful for a bigger issue (bigger in difficulty that is) I started yesterday #279. Thanks for the idea

GravermanDev avatar Jan 03 '23 14:01 GravermanDev


import aiohttp
import asyncio
from dotenv import load_dotenv
import os
import json
from pathlib import Path

load_dotenv()


GITHUB_REPOS_FILENAME = 'github_repos_names.txt'
GITHUB_ISSUES_FILENAME = 'github_issues.json'
VISITED_GITHUB_REPOS_FILENAME = 'visited_github_repos.txt'
API_LIMIT = 4000

#create following file, because json loads strange with "a" mode, so it's better to create file explicitly
file = Path(GITHUB_ISSUES_FILENAME)
file.touch(exist_ok=True)


with open(GITHUB_REPOS_FILENAME, 'r') as file:
    GITHUB_REPOS = file.read().splitlines()
    
try:
    with open(VISITED_GITHUB_REPOS_FILENAME, 'r') as file:
        VISITED_REPOS = file.read().splitlines()
        
        # every time someone want to run it, it will update GITHUB_REPOS variable and we ingest new data
        # limit to 4000
        GITHUB_REPOS = list(set(GITHUB_REPOS) - set(VISITED_REPOS))[:API_LIMIT]
        if len(GITHUB_REPOS) == 0:
            print("ALL DATA SUCCESSFULLY INGESTED!")
            exit()
except FileNotFoundError:
    VISITED_REPOS = []



GITHUB_API_TOKEN = os.environ["GITHUB_API_TOKEN"]


GITHUB_REPOS_URL = f"https://api.github.com/repos/"

headers = {
    "Accept": "application/vnd.github+json",
    "Authorization" : f"Bearer {GITHUB_API_TOKEN}",
    "X-GitHub-Api-Version": "2022-11-28"
}


def generate_chunks_unvisited(visited_repos, repos, chunk_len=100):
    append_chunk = lambda repo, visited_repos: chunk.append(repo) if repo not in visited_repos else None
    chunk = []
    for i, repo in enumerate(repos, 1):
        if i == len(repos):
            append_chunk(repo, visited_repos)
            yield chunk
            return
        if i % chunk_len == 0:
            yield chunk
            chunk = []
        append_chunk(repo, visited_repos)


def append_to_json(filename, data):
    with open(filename, 'r+', encoding='utf-8') as file:
        if os.stat(filename).st_size == 0:
            if isinstance(data, dict):
                json.dump([data], file, indent=4)
            else:
                json.dump(data, file, indent=4)
        else:
            data_json = json.load(file, strict=False)
            data_json.extend(data) 
            
            file.seek(0)
            json.dump(data_json, file, indent=4)


def append_to_file(filename, data):
    with open(filename, 'a+', encoding='utf-8') as file:
        file.write('\n'.join(data)) 
        

async def fetch(url, session):
    url = f"{GITHUB_REPOS_URL}{url}/issues?state=closed"
    async with session.get(url) as response:
        resp = await response.json()
    return resp


async def main(visited_repos, repos):
    tasks = []
    
    async with aiohttp.ClientSession(headers=headers) as session:
        
        for repos in generate_chunks_unvisited(visited_repos, repos):
            tasks = [asyncio.ensure_future(fetch(repo_url, session)) for repo_url in repos]
            
        responses = await asyncio.gather(*tasks, return_exceptions=True)
        
        # write to file in batches
        batch_size = 100
        for i in range(0, len(responses), batch_size):
            responses_batch = [resp for resp in responses[i:i+batch_size] if resp is not None and not isinstance(resp, Exception)]
            if not responses_batch:
                continue
            
            visited_repos_batch = set()
            for responses in responses_batch:
                for response in responses:
                    if "repository_url" in response:
                        repo = '/'.join(response["repository_url"].split('/')[-2:])
                        visited_repos_batch.add(repo)

            append_to_json(GITHUB_ISSUES_FILENAME, responses_batch)
            append_to_file(VISITED_GITHUB_REPOS_FILENAME, visited_repos_batch)


await main(VISITED_REPOS, GITHUB_REPOS)

doroshroman avatar Jan 03 '23 20:01 doroshroman

@GravermanDev So, that is basically it. The main problem is API_LIMIT, it only gives 5000 requests per registered user. Also, after processing repos, I've collected about 69k repos from the code_search_net dataset. So, in order to ingest all closed issues, this script needs to be executed ~ 14 or 15 times with break of every hour.

doroshroman avatar Jan 03 '23 20:01 doroshroman

Okay! Looks good to me, very useful!

GravermanDev avatar Jan 03 '23 21:01 GravermanDev

Created repository for this: https://github.com/doroshroman/github_issues I'll continue to collect other parts and dump all data not in github, because of git lfs limitation.

doroshroman avatar Jan 05 '23 14:01 doroshroman

This is very cool @doroshroman and @GravermanDev

huu4ontocord avatar Jan 10 '23 05:01 huu4ontocord

@doroshroman see here for adding datasets: https://github.com/LAION-AI/Open-Assistant/blob/main/docs/docs/data/datasets.md

yk avatar Jan 10 '23 11:01 yk

@doroshroman - will we be able to convert some of these into instruction->answer or question->answer. Would be awsome if we could.

huu4ontocord avatar Jan 15 '23 05:01 huu4ontocord

@ontocord Here the raw dataset. The format is the following:

[
{
        "issue_url": "https://api.github.com/repos/paulirish/speedline/issues/92",
        "issue_title": "Create issues ",
        "comments": [
            {
                "url": "https://api.github.com/repos/paulirish/speedline/issues/comments/882629228",
                "html_url": "https://github.com/paulirish/speedline/issues/92#issuecomment-882629228",
                "issue_url": "https://api.github.com/repos/paulirish/speedline/issues/92",
                "id": 882629228,
                "node_id": "IC_kwDOA0JEEM40m9ps",
                "user": {
                    "login": "LGNDDOLLABOUTIQUE",
                    "id": 78938698,
                    "node_id": "MDQ6VXNlcjc4OTM4Njk4",
                    "avatar_url": "https://avatars.githubusercontent.com/u/78938698?v=4",
                    "gravatar_id": "",
                    "url": "https://api.github.com/users/LGNDDOLLABOUTIQUE",
                    "html_url": "https://github.com/LGNDDOLLABOUTIQUE",
                    "followers_url": "https://api.github.com/users/LGNDDOLLABOUTIQUE/followers",
                    "following_url": "https://api.github.com/users/LGNDDOLLABOUTIQUE/following{/other_user}",
                    "gists_url": "https://api.github.com/users/LGNDDOLLABOUTIQUE/gists{/gist_id}",
                    "starred_url": "https://api.github.com/users/LGNDDOLLABOUTIQUE/starred{/owner}{/repo}",
                    "subscriptions_url": "https://api.github.com/users/LGNDDOLLABOUTIQUE/subscriptions",
                    "organizations_url": "https://api.github.com/users/LGNDDOLLABOUTIQUE/orgs",
                    "repos_url": "https://api.github.com/users/LGNDDOLLABOUTIQUE/repos",
                    "events_url": "https://api.github.com/users/LGNDDOLLABOUTIQUE/events{/privacy}",
                    "received_events_url": "https://api.github.com/users/LGNDDOLLABOUTIQUE/received_events",
                    "type": "User",
                    "site_admin": false
                },
                "created_at": "2021-07-19T15:11:22Z",
                "updated_at": "2021-07-19T15:11:22Z",
                "author_association": "NONE",
                "body": "Hey Paul sorry if I'm bugging you , but as an Expert, can you please take a close look at my work and tell me if I could meet the Federal standard Trading and Banking Sites while I'm still working on building. Thank you",
                "reactions": {
                    "url": "https://api.github.com/repos/paulirish/speedline/issues/comments/882629228/reactions",
                    "total_count": 0,
                    "+1": 0,
                    "-1": 0,
                    "laugh": 0,
                    "hooray": 0,
                    "confused": 0,
                    "heart": 0,
                    "rocket": 0,
                    "eyes": 0
                },
                "performed_via_github_app": null
            }
        ]
    }
]

doroshroman avatar Jan 18 '23 23:01 doroshroman

So, the next task are:

  1. filter this dataset in format question -> answer
  2. add it as dataset to OpenAssistant.

Can someone do this instead of me?

doroshroman avatar Jan 19 '23 00:01 doroshroman

Ok. can you ping the discord to see if someone can take overr for you @doroshroman ? cc me and I can help find someone too.

huu4ontocord avatar Jan 20 '23 00:01 huu4ontocord

Hello, could someone brief me about this issue and what needs to be done? I can take a look and see what I can come up with. Thanks!

DhruvSondhi avatar Feb 06 '23 16:02 DhruvSondhi

In the Excel this issue is marked as needing a new assignee. If that is still the case I can help with this and be assigned to it.

ricostynha1 avatar Apr 17 '23 19:04 ricostynha1

@RiccardoRiglietti I think this issue stalled, would you be still interested to work on it?

andreaskoepf avatar May 05 '23 10:05 andreaskoepf

@zirui has done all the hard work of scraping commit messages and has put the finished dataset on HuggingFace at https://huggingface.co/datasets/zirui3/TSSB-3M-ext So I think this issue is done thanks to him and can be closed as completed.

RiccardoRiglietti avatar May 05 '23 12:05 RiccardoRiglietti