crates.io icon indicating copy to clipboard operation
crates.io copied to clipboard

Archive old entries in `version_downloads` table

Open jtgeibel opened this issue 3 years ago • 11 comments

We should find a place to archive daily download counts, and drop old entries from version_downloads. We only ever query for the last 90 days of recent downloads. We could upload a CSV of the previous day's downloads to S3 as part of a daily background job.

Currently, the version_downloads table consumes 4241 MB and its primary key index consumes 1825 MB. Reducing the size of this table should greatly reduce cache pressure on our database sever (with 4GB of RAM) and will make the size of our experimental database dumps much more practical.

jtgeibel avatar Apr 01 '21 02:04 jtgeibel

Summary from the team meeting today:

  • we already only export the last 90 days in the database dump
  • we need a way to convert older entries in the database to CSV files (or similar)
  • we need a public place to store these exported files
  • afterwards we can remove the old data from the database

Turbo87 avatar Sep 17 '21 13:09 Turbo87

Any update on this, Specifically how to access older data?

nikhilpatel0 avatar Aug 10 '22 05:08 nikhilpatel0

I can provide a complete version_downloads table dating back to 2014-11-11 to whoever sends me a preferred way to share a large file. Currently the csv is 120,554,803 rows, 2.3GB, gzipped is 361MB.

dtolnay avatar Aug 10 '22 05:08 dtolnay

Thanks for your help. If you can upload it to any file sharing service, that would be really helpful. Gdrive/Dropbox/Onedrive or any service you prefer.

nikhilpatel0 avatar Aug 10 '22 05:08 nikhilpatel0

https://send.vis.ee/download/6030078658da7a07/#QzIAS1VImWg0p5WfEAi9Dw

$ zcat version_downloads.csv.gz | (head; tail)
date,version_id,downloads
2014-11-11,6,7
2014-11-11,9,1
2014-11-11,10,1
2014-11-11,12,1
2014-11-11,13,1
2014-11-11,15,1
2014-11-11,16,1
2014-11-11,17,1
2014-11-11,20,1
2022-08-10,599691,6
2022-08-10,599692,6
2022-08-10,599693,6
2022-08-10,599694,4
2022-08-10,599695,5
2022-08-10,599696,5
2022-08-10,599697,5
2022-08-10,599698,4
2022-08-10,599699,4
2022-08-10,599700,4

Data from the last day is obviously partial because the day is not over yet.

dtolnay avatar Aug 10 '22 06:08 dtolnay

@dtolnay Thank you so much!!

nikhilpatel0 avatar Aug 10 '22 06:08 nikhilpatel0

@dtolnay

Hi! My name is Tak-Ho Lee, and I am conducting research on open-source sustainability at the School of Computer Science at CMU, under Dr. Christian Kaestner. Carol Nichols directed me here.

I want to gather project data, including the repository link, download counts, etc. As the issue mentions, the DB dump only has the past 90 days, so I was wondering if I could receive the CSV you're hosting (the link has expired it says).

tlee0818 avatar Nov 30 '22 19:11 tlee0818

Hi @tlee0818 , we just published a dataset for research purposes at Nature Scientific Data that does include downloads (with parsed repo URLs, commits + much more) until september: https://www.nature.com/articles/s41597-022-01819-z
Metadata here And full data here Feel free to reach out for more info!

wschuell avatar Nov 30 '22 20:11 wschuell

Hi @wschuell, I was exploring the sample dataset but had trouble finding where monthly download counts exist. Could I have pointers to find it?

Thanks!

tlee0818 avatar Dec 08 '22 22:12 tlee0818

@tlee0818 I'm replying here now but the discussion should probably continue elsewhere to avoid spamming this issue; you can create an issue on this repo or you can easily find my academic email on csh.ac.at .

Assuming you downloaded the dataset from figshare:

The raw data (by version and day -- corresponding to the data in the original dumps that we could complete thanks to C. Nichols) is in the package_version_downloads table of the SQLite DB, or in the corresponding CSV (careful, the id column does not match the official dumps). Refer to the README for where to find the files. To get the monthly downloads globally, you can follow this notebook with time_window='month' at the beginning, or by crate you can run this query on the SQLite DB:

SELECT sum(pvd.downloads) AS dl_cnt,date(downloaded_at,'start of month') AS month,p.name FROM package_version_downloads pvd
INNER JOIN package_versions pv
ON pv.id=pvd.package_version
INNER JOIN packages p
ON p.id=pv.package_id
GROUP BY p.name,p.id,month

wschuell avatar Dec 09 '22 00:12 wschuell

  • https://github.com/rust-lang/crates.io/pull/8596 has implemented a background worker job which archives old (90+ days) data from the version_downloads table to S3
  • yesterday we archived everything from 2014-11-11 to 2024-02-14 to S3
  • the S3 bucket is not publicly reachable (yet)

Turbo87 avatar May 16 '24 07:05 Turbo87

the S3 bucket is not publicly reachable (yet)

this has been fixed last week and the archive is now publicly available at https://static.crates.io/archive/version-downloads/ :)

Turbo87 avatar Aug 24 '24 08:08 Turbo87