spack-infrastructure
spack-infrastructure copied to clipboard
Add scripts to populate and query db with build statistics
This PR adds two new scripts:
generate_builds_stats.py Fetches and caches api endpoint results and job artifacts, and uses them to populate two tables in an sqlite3 database: pipelines and builds. See that script for the table definitions.
If you run the above script many times, only new things are fetched and added to the database, and all the data cached on the filesystem allow the subsequent runs to finish quickly.
query_build_stats.py Just gives some example queries against the database. I'm not very familiar with SQL, so the queries don't yet produce exactly what we want, but hopefully the information is there for someone who has stronger sql-fu. If not, we can tweak what we're gathering and storing.
Eventually, we may want to consider standing up a service in the cluster backed by some persistent storage where we can keep the database growing, which will allow us to quickly formulate and issue queries against a long history of stored builds. Right now, I'm just running the script locally and growing a database on my laptop.
Just noticed, it looks like the project (or maybe just this directory) needs a .gitignore with an entry to avoid adding pycache files.
This is awesome!
Not sure if this is what you were going for, but I think something like this will give you every spec that had built more than once in a given PR, along with the corresponding PR:
SELECT pipelines.pr_number AS pr,
builds.full_hash AS hash,
COUNT(*) AS count
FROM builds
INNER JOIN pipelines
ON builds.pipeline_id == pipelines.pipeline_id
WHERE pr IS NOT NULL
GROUP BY pr, hash
ORDER BY count DESC
Or maybe you want to work out how many times a build had to be retried in a PR (not counting the initial attempt):
SELECT pr,
SUM(count) - COUNT(*) AS retries
FROM (
SELECT pipelines.pr_number AS pr,
COUNT(*) AS count
FROM builds
INNER JOIN pipelines
ON builds.pipeline_id == pipelines.pipeline_id
WHERE pr IS NOT NULL
GROUP BY pr, builds.full_hash
)
GROUP BY pr
ORDER BY retries DESC
Buyer beware: I have not tested any of these queries. They almost certainly have bugs/typos! 😄