versatile-data-kit
versatile-data-kit copied to clipboard
Track downloads of vdk
What is the feature request? What problem does it solve?
PyPI has pretty nice stats about downloads.
But we cannot differentiate user downloads from CICD-caused downloads - https://pypistats.org/packages/quickstart-vdk
It would be nice if we can
Suggested solution
Try to identify what is causing downloads by our CICD and possibly ignore them or identify them.
Triaged. We will keep it, as being a feature that is inline with the current VDK roadmap and we would like to address.
From the PyPI FAQ
What about downloads due to CI/CD tools? Downloads from CI/CD tools are included in all metrics. There is currently no easy way to attribute downloads to build/deployment tools.
I have added a monthly download badge to each of the readmes, including each plugin. I was not able to distinguish from CI/CD downloads, but that seems to be the case for others as well, as seen by the comment above.
https://github.com/vmware/versatile-data-kit/pull/2983
I tried to chew at the CI/CD download issue at a few different angles and landed on using the BigQuery dataset. Here is the query I am running
#standardSQL
SELECT
file.project,
COUNT(*) as download_count
FROM
`bigquery-public-data.pypi.file_downloads`
WHERE
file.project IN ('vdk-core',
'vdk-control-cli',
'vdk-heartbeat',
'airflow-provider-vdk',
'quickstart-vdk',
'vdk-audit',
'vdk-control-api-auth',
'vdk-dag',
'vdk-data-sources',
'vdk-gdp-execution-id',
'vdk-huggingface',
'vdk-ingest-file',
'vdk-ipython',
'vdk-jupyter',
'vdk-lineage',
'vdk-meta-jobs',
'vdk-oracle',
'vdk-postgres',
'vdk-server',
'vdk-smarter',
'vdk-sqlite',
'vdk-test-utils') AND
DATE(timestamp) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH) AND CURRENT_DATE() AND
details.installer.name NOT LIKE '%pip/x.y.z Python/x.y.z Linux/x.y.z%' AND
details.installer.name NOT LIKE '%actions/setup-python%'
GROUP BY
file.project
ORDER BY
file.project ASC
Here is the result
The quote is 3.2 GB when run. At $6.25 per TiB processed, that comes out to $0.018 per run.
Importantly, the result doesn't match the page on PyPI stats. What would you recommend?
@yonitoo Thanks for your comments on the PR, I would appreciate your feedback on this.
Is this closed by #2983 ? Also thanks to @yonitoo for fixing the closing tags