osv.dev icon indicating copy to clipboard operation
osv.dev copied to clipboard

Provide a BigQuery Public Dataset of OSV data

Open andrewpollock opened this issue 3 years ago • 6 comments

An interesting conversation today spawned the idea of providing OSV data via a BigQuery Public Dataset.

Apparently there's a path from Cloud Data Store to BigQuery. Something to explore further in the future.

andrewpollock avatar Dec 06 '22 23:12 andrewpollock

I initially thought https://cloud.google.com/bigquery/docs/loading-data-cloud-datastore may be an easy way to achieve this, but our underlying Datastore entity format does not match the OSV schema exactly, so it may not be easily usable.

oliverchang avatar Dec 07 '22 04:12 oliverchang

Having it as a time series day partitioned as well as just latest would be brilliant.

MikeMoore63 avatar May 02 '24 06:05 MikeMoore63

This issue has not had any activity for 60 days and will be automatically closed in two weeks

github-actions[bot] avatar Jul 27 '24 18:07 github-actions[bot]

I took a little bit of a look and another easy path would be loading from gcs... would there be appetite to add another exporter (or to the existing exporter) that generates ndJSON files? Those could then be easily loaded with a bigquery loader from gcs

katzj avatar Nov 25 '24 16:11 katzj

Interesting. So if I'm reading https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json and http://jsonlines.org/ correctly, this is a matter of emitting a single (or perhaps, per-ecosystem) file with all of the records concatenated and optionally compressed? This doesn't seem like it would be particularly difficult to modify the exporter to do.

andrewpollock avatar Nov 25 '24 20:11 andrewpollock

So I am doing something like this in my use case and have been for several years already and have had to deal with 2 areas over this time period. Big query you cannot easily change a column type so in schema 1.58 the credits[].type field changed to an array so to credits[].type[] so I had to force the single entries pre 1.58 to an array. So bought the array forward to the newer schema.

The other area has been the "database_specific" and "ecosystem_specific" fields which I serialise to json and load as a string as unclear if this would remain valid for big queruy. Big query constrains characters allowed in column names https://cloud.google.com/bigquery/docs/schemas#column_names and the osv schema does match but I would assume governing names out of scope of osv is a possible challenge. This approach has worked for me otherwise with no issues. Just wante to make sure the constraints on field names was clear.

MikeMoore63 avatar Nov 26 '24 06:11 MikeMoore63