Open-Assistant Data set: Python code and unit test pairings

trafficstars

Re: #279, it should be pretty possible to automatically create a huge data set made of functions and unit tests that cover the functions.

This could work in both directions: "write me some code that passes these tests" and "write me some tests for this code", with the former being useful for the automation of software development (Take a feature description and write acceptance criteria, take acceptance criteria and write tests, take tests and write code that passes the tests)

At risk of solutionizing early, here's some thoughts for steps to make this data set:

Identify candidate projects

Find repos with CI files that contain the word "pytest"
Replace that pytest line with something else (e.g. "python3 -m injection_test")
Run the CI job. If the step runs, this is a good candidate because we can automatically switch out the pytest line for our own code.
Add the repo to the list of candidates, along with metadata about the project (license, last commit date etc)

Identify tests and code that was executed by them.

Use pytest --discover to discover tests.
Run pytest for each test, one at a time, and have it output a .cover file
Extract the code for the test (introspect/ast/pickle it for later?)
Snag the .cover file and the code executed by it
Save this metadata for the next step

Heuristics for extracting executed code

We need to figure out which parts are the function under test, and which parts are setup steps. The best strategy will depend on the repo we're working with, but here's some general strategies:

The test name should mention the function/method under test
The directory structure should be similar
Tests in the same file should test the same module
The structure of the tests should be "setup steps, blank line, execution step, blank line, validation steps" - if so we can get the candidate function name from the execution step.

Filtering the work to be done

This will create a metric forkton of data work and data, so we'll need some way to filter dupes. Here's some ideas:

Look at how repos are forked from each other, only use the one with the latest commit head.
Hash the strings of the tests, and early-out if we have run it previously.
Hash the functions under test
Throw out all functions that don't have full coverage; if we combine all the .cover files and there's uncovered lines, then the tests aren't good enough and shouldn't be included in the output data.

Jan 03 '23 13:01 bitplane

this is very cool! I like the idea and it can be very high quality data which is always nice

Jan 03 '23 13:01 GravermanDev

Hello,

I'm starting to work on Identifying candidate projects.

I am limited to 5000 requests per day with my account and 60 requests per day with per ip without an account.

I will start off by writing the code to get the top 100 project and filtering them. Then will see how to scale it.

We can also use grep.app but I will see this later if it is needed.

EDIT: I am going to currently base myself on python repos that have more than 200 stars, 25103 in total.

Jan 04 '23 19:01 theol-git

@theol-git really cool thanks. It looks like we can get 1000 results from the graphql search API at a time, so 5000 requests should be enough to get the vast majority of Python project details, just not the code. Unless I'm reading the docs wrong.

Then I guess the next step should be a bit of research to figure out which ones have unit tests and CI, and what those CI platforms are? I dunno. But yeah getting a list of the most important projects is definitely the first step.

Jan 05 '23 05:01 bitplane

Really great idea team!

Jan 05 '23 06:01 huu4ontocord

@bitplane I actually was using the REST api, but the graphql api does seem much better for this usecase so I will be changing to that.

As for the filtering, github allows us to search repos for specific files. So I am currently looking for any file in the .github/workflows directory that contains pytest. Then filtering on those. This first part should not be too long in terms of dev.

Jan 05 '23 10:01 theol-git

Also, when gathering data, please put in the license info. Or better yet, don't include any GPL type code as we want any code our model to be outputted not to be restricted and respect the license wishes of the software developers. thank you!

Jan 05 '23 17:01 huu4ontocord

I am interested in helping with this. I could start working on extracting and matching code and tests.

@theol-git Do you have some list of repos using pytest you can share with me? Doesn't have to be complete

Also, what other programmming languages are we planning to support? This could be done with other languages as well

Feb 06 '23 19:02 mikastamm

@mikastamm I am currently really busy with IRL stuff.

Although I do have a list and will get it to you asap

Feb 06 '23 20:02 theol-git

@theol-git friendly reminder, in case you forgot about it :)

Feb 14 '23 05:02 mikastamm

@mikastamm I have time and will look into it today :)

Feb 15 '23 10:02 theol-git

Removed myself from this for now, I'm still interested but don't have the time at the moment with other tasks. I'm still subscribed though and if you need any help/input ping me here or on Discord :)

Feb 26 '23 05:02 bitplane

@mikastamm Sorry for not being reactive, I've been handling IRL stuff. I had time to look into it today and have created a bit of code to parse the github api and am currently able to check if the pipeline has a file with "pytest" inside.

I implemented simple checks to see whether pytest is actually called but I am currently gathering data to get all edge cases (every way pytest can be called).

I will continue to work on this tonight so that it can automatically gather the data and will leave it running during the night

Mar 01 '23 20:03 theol-git

Closing old data issue.

Jun 14 '23 08:06 andreaskoepf

Open-Assistant Open-Assistant copied to clipboard

Data set: Python code and unit test pairings

Identify candidate projects

Identify tests and code that was executed by them.

Heuristics for extracting executed code

Filtering the work to be done

Open-Assistant
Open-Assistant copied to clipboard