crates.io icon indicating copy to clipboard operation
crates.io copied to clipboard

Data dumps feedback

Open kornelski opened this issue 5 years ago • 6 comments

Overall having a full data access is great! However, the implementation could be optimized:

  1. The tarball format is problematic. It doesn't allow random access, so to extract only interesting parts it's necessary to download and decompress most of it. A ZIP archive or individually gzipped files would be more convenient for selective consumption.

  2. crates.csv is needed to map crate names to crates-io internal IDs, so parsing of this file is required to make sense of the rest of the data. However, this file also contains bodies of README files, which are relatively big. It'd be nice to put READMEs separately. It also has textsearchable_index_col which is postgres-specific and redundant.

  3. version_downloads.csv is the largest file, and it will keep growing. It'd be nice to shard this data by time (e.g. a separate file by year or even by day). I would like to get downloads data daily, but I'd rather download one day of data daily, not all days every day.

kornelski avatar Dec 30 '19 01:12 kornelski

Thanks a lot for the feedback!

In the first implementation of the database dumps, we planned to support one way of using the data – by importing it back to an empty Postgres database. One of the use cases we want to support is local testing with realistic data during development of crates.io itself. We intentionally chose the CSV export format instead of a format specific to Postgres to allow other uses of the data as well, but we did not put any effort in facilitating other ways of using the data. At least so far – we are happy to make improvements based on your feedback. :)

  1. It should be relatively straight-forward to also provide ZIP archives, or we could switch to ZIP instead of tarballs altogether. So far we advertised the dumps as "experimental", so we are free to make incompatible changes.

    Making indivually gzipped files available adds considerable complexity, since we would need a way to advertise the list of files that comprise the full database dump (which may change over time). Currently we simply have a single fixed location in our S3 bucket that gets overwritten every day.

  2. The files are database dumps. Each file corresponds to a table in the database, and given the other use cases we want to support I don't think we will fundamentally depart from this approach. However, we may mitigate the two issues you mention anyway. First, the readme column in the crates table is likely going to go away – readme files are stored per version rather than per crate. And we can probably simply disable exporting textsearchable_index_col and tell Postgres to regenerate it after import if desired.

  3. I completely agree that it is kind of ridiculous that the version_downloads table account for two thirds of the whole size of the dump, while it stays largely the same every day. Right now I don't have a good idea how to address this without fundamentally changing the way the exports work – maybe we need a second, completely different export format aimed at different use cases?

Could you please explain a bit more in what way you want to use the data? The best way to support lib.rs would be some kind of event stream interface, I guess?

smarnach avatar Jan 02 '20 11:01 smarnach

For lib.rs I'm interested only in data that is not available in .crate archives/Cargo.toml. For example, I don't need dependencies, authors, categories and keywords tables, because I already have that data from the primary source. I almost don't need README files, except for crates that use ../README.md path for the README, because in such case the README file is missing from the .crate archive.

I started by extracting download numbers. Currently I use only aggregated downloads numbers per crate per day, and it's OK if the data is up to a week stale.

I'm planning to use crate ownership data next. The data dump saves a lot of requests to the website API, and it has GitHub IDs (the website API doesn't expose the IDs). However, I'll probably still fall back to the website API for newly published crates, because I want to display correct ownership on crate pages as soon as I create them. Changes in ownership would be useful as some kind of an event stream, since they can happen independently of crate releases.

I'm also interested in knowing who published each version of the crate (not Cargo.toml authors, but GitHub account/token owner), so that I can judge which users are most active.

kornelski avatar Jan 02 '20 20:01 kornelski

Thanks for the details! I currently don't have much time to work on this, but I plan to do so in the next few months. I hope to get feedback from other people as well – I should probably ask on URLO.

We are currently working on an auditing table for crate owner changes. Maybe we can add some kind of event stream interface for selected database tables. This could cover ownership changes and download counts, which would address two of your points.

The information available via the API and in the database dumps is almost identical, but the database dumps publish a few additional database columns.

The information who published a crate version is available in the crates.io database, but currently it's not made public. I'm open for publishing this data, since it could be valuable for auditing dependencies as well.

smarnach avatar Jan 02 '20 21:01 smarnach

The information who published a crate version is available in the crates.io database, but currently it's not made public. I'm open for publishing this data, since it could be valuable for auditing dependencies as well.

I'm confused, versions.published_by is currently public?

carols10cents avatar Jan 03 '20 13:01 carols10cents

@carols10cents You are right – it's exposed in both the database dumps and the API (e.g. https://crates.io/api/v1/crates/rand/0.7.1 returns a published_by key). I think I got confused by the unused versions_published_by table.

smarnach avatar Jan 05 '20 20:01 smarnach

I'm interested in using the dumps in https://github.com/rust-secure-code/cargo-supply-chain and we're also facing the issues 1 and 2 from from the original post.

The data we need is just the list of owners for a given crate, 3.5Mb all told. However, we have to download the entire 250Mb of the archive to get to that info, and download+parse a very large file with READMEs in it to get mapping from crate names to crate IDs.

Shnatsel avatar Oct 10 '20 13:10 Shnatsel

https://developer.fastly.com/learning/concepts/purging/#forwarding-purge-requests

17351269 avatar May 29 '24 17:05 17351269

  • https://github.com/rust-lang/simpleinfra/pull/421 was merged today, which means that https://static.crates.io/db-dump.zip is now available :)

Turbo87 avatar Jun 10 '24 14:06 Turbo87