Open-Assistant
Open-Assistant copied to clipboard
Data set: Python code and unit test pairings
Re: #279, it should be pretty possible to automatically create a huge data set made of functions and unit tests that cover the functions.
This could work in both directions: "write me some code that passes these tests" and "write me some tests for this code", with the former being useful for the automation of software development (Take a feature description and write acceptance criteria, take acceptance criteria and write tests, take tests and write code that passes the tests)
At risk of solutionizing early, here's some thoughts for steps to make this data set:
Identify candidate projects
- Find repos with CI files that contain the word "pytest"
- Replace that
pytestline with something else (e.g. "python3 -m injection_test") - Run the CI job. If the step runs, this is a good candidate because we can automatically switch out the pytest line for our own code.
- Add the repo to the list of candidates, along with metadata about the project (license, last commit date etc)
Identify tests and code that was executed by them.
- Use
pytest --discoverto discover tests. - Run
pytestfor each test, one at a time, and have it output a.coverfile - Extract the code for the test (introspect/ast/pickle it for later?)
- Snag the
.coverfile and the code executed by it - Save this metadata for the next step
Heuristics for extracting executed code
We need to figure out which parts are the function under test, and which parts are setup steps. The best strategy will depend on the repo we're working with, but here's some general strategies:
- The test name should mention the function/method under test
- The directory structure should be similar
- Tests in the same file should test the same module
- The structure of the tests should be "setup steps, blank line, execution step, blank line, validation steps" - if so we can get the candidate function name from the execution step.
Filtering the work to be done
This will create a metric forkton of data work and data, so we'll need some way to filter dupes. Here's some ideas:
- Look at how repos are forked from each other, only use the one with the latest commit head.
- Hash the strings of the tests, and early-out if we have run it previously.
- Hash the functions under test
- Throw out all functions that don't have full coverage; if we combine all the .cover files and there's uncovered lines, then the tests aren't good enough and shouldn't be included in the output data.
this is very cool! I like the idea and it can be very high quality data which is always nice
Hello,
I'm starting to work on Identifying candidate projects.
I am limited to 5000 requests per day with my account and 60 requests per day with per ip without an account.
I will start off by writing the code to get the top 100 project and filtering them. Then will see how to scale it.
We can also use grep.app but I will see this later if it is needed.
EDIT: I am going to currently base myself on python repos that have more than 200 stars, 25103 in total.
@theol-git really cool thanks. It looks like we can get 1000 results from the graphql search API at a time, so 5000 requests should be enough to get the vast majority of Python project details, just not the code. Unless I'm reading the docs wrong.
Then I guess the next step should be a bit of research to figure out which ones have unit tests and CI, and what those CI platforms are? I dunno. But yeah getting a list of the most important projects is definitely the first step.
Really great idea team!
@bitplane I actually was using the REST api, but the graphql api does seem much better for this usecase so I will be changing to that.
As for the filtering, github allows us to search repos for specific files. So I am currently looking for any file in the .github/workflows directory that contains pytest. Then filtering on those. This first part should not be too long in terms of dev.
Also, when gathering data, please put in the license info. Or better yet, don't include any GPL type code as we want any code our model to be outputted not to be restricted and respect the license wishes of the software developers. thank you!
I am interested in helping with this. I could start working on extracting and matching code and tests.
@theol-git Do you have some list of repos using pytest you can share with me? Doesn't have to be complete
Also, what other programmming languages are we planning to support? This could be done with other languages as well
@mikastamm I am currently really busy with IRL stuff.
Although I do have a list and will get it to you asap
@theol-git friendly reminder, in case you forgot about it :)
@mikastamm I have time and will look into it today :)
Removed myself from this for now, I'm still interested but don't have the time at the moment with other tasks. I'm still subscribed though and if you need any help/input ping me here or on Discord :)
@mikastamm Sorry for not being reactive, I've been handling IRL stuff. I had time to look into it today and have created a bit of code to parse the github api and am currently able to check if the pipeline has a file with "pytest" inside.
I implemented simple checks to see whether pytest is actually called but I am currently gathering data to get all edge cases (every way pytest can be called).
I will continue to work on this tonight so that it can automatically gather the data and will leave it running during the night
Closing old data issue.