crates.io icon indicating copy to clipboard operation
crates.io copied to clipboard

Experimental database dumps changelog

Open pietroalbini opened this issue 4 years ago • 12 comments

This is a low-traffic issue tracking all the changes happening to the experimental database dumps. We recommend subscribing to this issue to get notified whenever we make some changes to the contents of the dumps.

pietroalbini avatar May 14 '21 15:05 pietroalbini

The next crates.io deploy (happening in the next few days) will include the following changes to the database dumps:

  • PR #3612: The textsearchable_index_col column will be removed from crates.csv, as that column is an implementation detail of crates.io's search. Users importing the database dumps into a PostgreSQL database will not be affected by this change, as a trigger will populate that column at import time.
  • PR #3611: The version_downloads.csv file will only include the last 90 days of data instead of full day-to-day historical data. Cumulative download counts are still available in crates.csv and versions.csv.
  • PR #3549: The version_authors.csv file will be removed, as that data was deleted from the crates.io database too.

We also plan to make the following changes in the future:

  • Issue #3479: all the data from version_downloads.csv will be moved out of the database dump into separate files, one for each day. This will allow clients interested in this data to download it separately.

pietroalbini avatar May 14 '21 15:05 pietroalbini

Two relevant changes were just deployed:

  • https://github.com/rust-lang/crates.io/pull/5077 and
  • https://github.com/rust-lang/crates.io/pull/5074

Turbo87 avatar Aug 16 '22 21:08 Turbo87

  • https://github.com/rust-lang/crates.io/pull/8155 will delete the badges table

Turbo87 avatar Feb 19 '24 13:02 Turbo87

  • https://github.com/rust-lang/crates.io/pull/8232 added a new crate_downloads table, which is supposed to replace the crates.downloads column soon. this was done for performance reasons to reduce the amount of bloat in the crates table from the regular downloads column updates. at the moment the data should be in sync, but if everything works out we will stop writing to the crates.downloads column in the near future and eventually remove it.

Turbo87 avatar Mar 06 '24 08:03 Turbo87

  • as mentioned in the last update, https://github.com/rust-lang/crates.io/pull/8295 is going to disable writes to the crates.downloads column. we will keep the column around for now to avoid unnecessary schema churn, but once the system has shown the expected performance benefits we will most likely remove the column completely.

Turbo87 avatar Mar 13 '24 09:03 Turbo87

  • once https://github.com/rust-lang/crates.io/pull/8233 is merged and deployed it will remove the crates.downloads column. please us the crate_downloads table instead.

Turbo87 avatar Apr 12 '24 11:04 Turbo87

  • https://github.com/rust-lang/crates.io/pull/8484 will introduce a new experimental default_versions table with a mapping from crates to their "default" version, that will be shown by the frontend and used in e.g. reverse dependency queries.

Turbo87 avatar Apr 25 '24 12:04 Turbo87

  • https://github.com/rust-lang/crates.io/pull/8748 added an experimental ZIP file artifact at https://static.crates.io/db-dump.zip. this file has the advantage of not having to decompress the entire file if you only need access to a certain database table CSV file. compared to the tarball the ZIP file does not have a top-level datetime path prefix, otherwise the files should contain the exact same data.

Turbo87 avatar Jun 10 '24 14:06 Turbo87