aqa-test-tools icon indicating copy to clipboard operation
aqa-test-tools copied to clipboard

Data Collection for deep AQAtik

Open LongyuZhang opened this issue 4 years ago • 12 comments

To automate the data collection process for deep AQAtik, we need to investigate and work on the following functions:

  • [x] Collect all open issue contents in related repos, e.g. openjdk-tests/issues
  • [x] Based on the issue contents, collect corresponding original test outputs from TRSS database or Jenkins output if exist.
  • [x] After storing all existing issue contents, continuously monitoring and collecting new issues in these repos.
  • [ ] Link data collection with ml model training program, so when a new issue is created, we can trigger ml training in needed.

Relate Issue: https://github.com/adoptium/aqa-test-tools/issues/355

LongyuZhang avatar Apr 29 '21 14:04 LongyuZhang

FYI @smlambert @llxia

LongyuZhang avatar Apr 29 '21 14:04 LongyuZhang

FYI @avishreekh

LongyuZhang avatar Apr 29 '21 14:04 LongyuZhang

Thank you @LongyuZhang!

Collect all open issue contents in related repos, e.g. openjdk-tests/issues

We can use the issues API for listing issues of a repository (here) provided by GitHub.

After storing all existing issue contents, continuously monitoring and collecting new issues in these repos.

For collecting new issues, we could save the last updated timestamp when querying for new issues. We could then use this timestamp with the issues API for fetching new issues next time (It allows to fetch issues created/updated after a certain time using the since parameter). So we maintain a variable that stores the latest timestamp and use it for new queries.

Please let me know your thoughts on this @LongyuZhang @llxia @smlambert. Thank you!

avishreekh avatar Apr 30 '21 14:04 avishreekh

Talked with @LongyuZhang , below are some of the details:

We should query git repos at an appropriate frequency (every 30 mins?).

  • relationship should be store in DB (i.e., MongoDB)
[ { "url": "https://api.github.com/repos/octocat/Hello-World/issues/1347",
    "repository_url": "https://api.github.com/repos/octocat/Hello-World",
    "number": 1347,
    "state": "open",
    "title": "Found a bug",
    "created_at": "2011-04-10T20:09:31Z",
    "updated_at": "2014-03-03T18:58:10Z",
    "issue_content_path": "/path to the content file/issueContent/<repo name>_<issue#>.txt"
    "test_output_path": "/path to the content file/testOutput/<repo name>_<issue#>.txt"
    },....
   ]
  • text data (git issue content file, test output file) should be stored on the file system
  • we can use since to limit git query for issues created/updated after a certain time that matches with our query internals.
  • optional: If data is too large, we can use label to narrow down the search. (i.e., label="test failure")

In summary:

Step1: figure out git query using since Step2: write a query to query git periodically Step3: filter returned data into issue content and test output and store files in the file system Step4: store the relationship and data into DB. If an issue is updated, the data in DB should be updated accordingly Step5: trigger ml model training program to read /path to the content file/testOutput/<repo name>_<issue#>.txt

llxia avatar Apr 30 '21 16:04 llxia

Thank you for the elaborate discussion @llxia. Please let me know if I can work on this.

avishreekh avatar May 04 '21 13:05 avishreekh

Please go ahead. Thanks a lot for working on this!

llxia avatar May 04 '21 13:05 llxia

I was wondering if we could use GitHub webhooks instead of polling using APIs. That way, we will be religiously notified when a new issue is added and we won't have to keep tracking it.

Please let me know your thoughts on this.

Thank you

avishreekh avatar May 04 '21 13:05 avishreekh

It is a good idea to use GitHub webhooks to monitor new issues, but for the initial collection of existing issues, the issue api may work better. We can try to use them separately for these two purposes if possible. Thanks.

LongyuZhang avatar May 04 '21 14:05 LongyuZhang

I agree, since we need to query multiple repos, I think git API is more flexible/easy. I think it is a good idea to keep an eye on alternatives (i.e., webhook, github workflow, etc), so we know what are the advantages and disadvantages of using them.

llxia avatar May 04 '21 14:05 llxia

Thank you @LongyuZhang @llxia! I will first try to implement the initial collection of issues using the Issues API and poll for new issues using the since parameter. The Webhook integration can be done later if it is found to be a better alternative. I will also look for other alternatives in the meantime.

Please let me know if this sounds like a good strategy to begin with or if any modifications are needed.

Thank you.

avishreekh avatar May 04 '21 16:05 avishreekh

Sounds good! Thanks @avishreekh

LongyuZhang avatar May 04 '21 16:05 LongyuZhang

For now, we are querying git for issues. But please keep in mind, we may not limit to git issues. It could be other bug-tacking systems.

llxia avatar May 05 '21 13:05 llxia