versatile-data-kit icon indicating copy to clipboard operation
versatile-data-kit copied to clipboard

Track downloads of vdk

Open antoniivanov opened this issue 2 years ago • 5 comments

What is the feature request? What problem does it solve?

PyPI has pretty nice stats about downloads.

But we cannot differentiate user downloads from CICD-caused downloads - https://pypistats.org/packages/quickstart-vdk

It would be nice if we can

Suggested solution

Try to identify what is causing downloads by our CICD and possibly ignore them or identify them.

antoniivanov avatar Jan 26 '23 14:01 antoniivanov

Triaged. We will keep it, as being a feature that is inline with the current VDK roadmap and we would like to address.

sabadzhiev avatar Jul 11 '23 11:07 sabadzhiev

From the PyPI FAQ

What about downloads due to CI/CD tools? Downloads from CI/CD tools are included in all metrics. There is currently no easy way to attribute downloads to build/deployment tools.

chrfoyer avatar Dec 22 '23 21:12 chrfoyer

I have added a monthly download badge to each of the readmes, including each plugin. I was not able to distinguish from CI/CD downloads, but that seems to be the case for others as well, as seen by the comment above.

https://github.com/vmware/versatile-data-kit/pull/2983

chrfoyer avatar Dec 22 '23 23:12 chrfoyer

I tried to chew at the CI/CD download issue at a few different angles and landed on using the BigQuery dataset. Here is the query I am running

#standardSQL
SELECT
  file.project,
  COUNT(*) as download_count
FROM
  `bigquery-public-data.pypi.file_downloads`
WHERE
  file.project IN ('vdk-core',
  'vdk-control-cli',
  'vdk-heartbeat', 
  'airflow-provider-vdk',
  'quickstart-vdk',
  'vdk-audit',
  'vdk-control-api-auth',
  'vdk-dag',
  'vdk-data-sources',
  'vdk-gdp-execution-id',
  'vdk-huggingface',
  'vdk-ingest-file',
  'vdk-ipython',
  'vdk-jupyter',
  'vdk-lineage',
  'vdk-meta-jobs',
  'vdk-oracle',
  'vdk-postgres',
  'vdk-server',
  'vdk-smarter',
  'vdk-sqlite',
  'vdk-test-utils') AND
  DATE(timestamp) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH) AND CURRENT_DATE() AND
  details.installer.name NOT LIKE '%pip/x.y.z Python/x.y.z Linux/x.y.z%' AND
  details.installer.name NOT LIKE '%actions/setup-python%'
GROUP BY
  file.project
ORDER BY
  file.project ASC

Here is the result image

The quote is 3.2 GB when run. At $6.25 per TiB processed, that comes out to $0.018 per run.

Importantly, the result doesn't match the page on PyPI stats. What would you recommend? image

@yonitoo Thanks for your comments on the PR, I would appreciate your feedback on this.

chrfoyer avatar Jan 02 '24 21:01 chrfoyer

Is this closed by #2983 ? Also thanks to @yonitoo for fixing the closing tags

chrfoyer avatar Feb 18 '24 21:02 chrfoyer