crates.io
crates.io copied to clipboard
Archive old entries in `version_downloads` table
We should find a place to archive daily download counts, and drop old entries from version_downloads
. We only ever query for the last 90 days of recent downloads. We could upload a CSV of the previous day's downloads to S3 as part of a daily background job.
Currently, the version_downloads
table consumes 4241 MB and its primary key index consumes 1825 MB. Reducing the size of this table should greatly reduce cache pressure on our database sever (with 4GB of RAM) and will make the size of our experimental database dumps much more practical.
Summary from the team meeting today:
- we already only export the last 90 days in the database dump
- we need a way to convert older entries in the database to CSV files (or similar)
- we need a public place to store these exported files
- afterwards we can remove the old data from the database
Any update on this, Specifically how to access older data?
I can provide a complete version_downloads table dating back to 2014-11-11 to whoever sends me a preferred way to share a large file. Currently the csv is 120,554,803 rows, 2.3GB, gzipped is 361MB.
Thanks for your help. If you can upload it to any file sharing service, that would be really helpful. Gdrive/Dropbox/Onedrive or any service you prefer.
https://send.vis.ee/download/6030078658da7a07/#QzIAS1VImWg0p5WfEAi9Dw
$ zcat version_downloads.csv.gz | (head; tail)
date,version_id,downloads
2014-11-11,6,7
2014-11-11,9,1
2014-11-11,10,1
2014-11-11,12,1
2014-11-11,13,1
2014-11-11,15,1
2014-11-11,16,1
2014-11-11,17,1
2014-11-11,20,1
2022-08-10,599691,6
2022-08-10,599692,6
2022-08-10,599693,6
2022-08-10,599694,4
2022-08-10,599695,5
2022-08-10,599696,5
2022-08-10,599697,5
2022-08-10,599698,4
2022-08-10,599699,4
2022-08-10,599700,4
Data from the last day is obviously partial because the day is not over yet.
@dtolnay Thank you so much!!
@dtolnay
Hi! My name is Tak-Ho Lee, and I am conducting research on open-source sustainability at the School of Computer Science at CMU, under Dr. Christian Kaestner. Carol Nichols directed me here.
I want to gather project data, including the repository link, download counts, etc. As the issue mentions, the DB dump only has the past 90 days, so I was wondering if I could receive the CSV you're hosting (the link has expired it says).
Hi @tlee0818 , we just published a dataset for research purposes at Nature Scientific Data that does include downloads (with parsed repo URLs, commits + much more) until september:
https://www.nature.com/articles/s41597-022-01819-z
Metadata here
And full data here
Feel free to reach out for more info!
Hi @wschuell, I was exploring the sample dataset but had trouble finding where monthly download counts exist. Could I have pointers to find it?
Thanks!
@tlee0818 I'm replying here now but the discussion should probably continue elsewhere to avoid spamming this issue; you can create an issue on this repo or you can easily find my academic email on csh.ac.at .
Assuming you downloaded the dataset from figshare:
The raw data (by version and day -- corresponding to the data in the original dumps that we could complete thanks to C. Nichols) is in the package_version_downloads table of the SQLite DB, or in the corresponding CSV (careful, the id column does not match the official dumps). Refer to the README for where to find the files. To get the monthly downloads globally, you can follow this notebook with time_window='month' at the beginning, or by crate you can run this query on the SQLite DB:
SELECT sum(pvd.downloads) AS dl_cnt,date(downloaded_at,'start of month') AS month,p.name FROM package_version_downloads pvd
INNER JOIN package_versions pv
ON pv.id=pvd.package_version
INNER JOIN packages p
ON p.id=pv.package_id
GROUP BY p.name,p.id,month
- https://github.com/rust-lang/crates.io/pull/8596 has implemented a background worker job which archives old (90+ days) data from the
version_downloads
table to S3 - yesterday we archived everything from 2014-11-11 to 2024-02-14 to S3
- the S3 bucket is not publicly reachable (yet)
the S3 bucket is not publicly reachable (yet)
this has been fixed last week and the archive is now publicly available at https://static.crates.io/archive/version-downloads/ :)