osv.dev icon indicating copy to clipboard operation
osv.dev copied to clipboard

Some "all.zip" files do not contain all files

Open martin-bpw opened this issue 1 year ago • 2 comments

Describe the bug We have recently noticed that zip.all files are missing sometimes few sometimes quite a lot of json files that are present in folder. In https://storage.googleapis.com/osv-vulnerabilities/index.html . This looks like bug, as spec says:

https://google.github.io/osv.dev/data/#data-dumps

...This bucket contains individual entries of the format gs://osv-vulnerabilities/<ECOSYSTEM>/<ID>.json as well as a zip containing all vulnerabilities for each ecosystem at gs://osv-vulnerabilities/<ECOSYSTEM>/all.zip....

To Reproduce Just compare the numbers of files in zip and in folder.

Expected behaviour Number of files in zip shall be same as in folder.

Additional context I created short report on the current (7.9.2024) state:

AlmaLinux/: ZIP: 3086, FOLDER: 3087 AlmaLinux:8/: ZIP: 2353, FOLDER: 2355 AlmaLinux:9/: ZIP: 733, FOLDER: 735 Alpine/: ZIP: 3432, FOLDER: 3531 Alpine:v3.10/: ZIP: 1487, FOLDER: 1490 Alpine:v3.11/: ZIP: 1596, FOLDER: 1605 Alpine:v3.12/: ZIP: 1706, FOLDER: 1748 Alpine:v3.13/: ZIP: 1796, FOLDER: 1839 Alpine:v3.14/: ZIP: 1917, FOLDER: 1966 Alpine:v3.15/: ZIP: 2034, FOLDER: 2084 Alpine:v3.16/: ZIP: 2126, FOLDER: 2179 Alpine:v3.17/: ZIP: 2238, FOLDER: 2325 Alpine:v3.18/: ZIP: 2242, FOLDER: 2339 Alpine:v3.19/: ZIP: 2292, FOLDER: 2302 Alpine:v3.2/: ZIP: 301, FOLDER: 305 Alpine:v3.20/: ZIP: 2277, FOLDER: 2287 Alpine:v3.3/: ZIP: 464, FOLDER: 470 Alpine:v3.4/: ZIP: 659, FOLDER: 663 Alpine:v3.5/: ZIP: 805, FOLDER: 809 Alpine:v3.6/: ZIP: 881, FOLDER: 887 Alpine:v3.7/: ZIP: 1034, FOLDER: 1039 Alpine:v3.8/: ZIP: 1188, FOLDER: 1195 Alpine:v3.9/: ZIP: 1319, FOLDER: 1322 Android/: ZIP: 2120, FOLDER: 2476 Bitnami/: ZIP: 4406, FOLDER: 7711 CRAN/: ZIP: 10, FOLDER: 10 Chainguard/: ZIP: 13193, FOLDER: 13193 DWF/: ZIP: 0, FOLDER: 30 Debian/: ZIP: 17194, FOLDER: 18171 Debian:10/: ZIP: 1830, FOLDER: 8712 Debian:11/: ZIP: 7223, FOLDER: 7236 Debian:12/: ZIP: 6518, FOLDER: 6537 Debian:13/: ZIP: 6056, FOLDER: 6164 Debian:3.0/: ZIP: 727, FOLDER: 773 Debian:3.1/: ZIP: 649, FOLDER: 653 Debian:4.0/: ZIP: 669, FOLDER: 670 Debian:5.0/: ZIP: 733, FOLDER: 736 Debian:6.0/: ZIP: 1152, FOLDER: 1152 Debian:7/: ZIP: 1796, FOLDER: 1796 Debian:8/: ZIP: 1826, FOLDER: 1826 Debian:9/: ZIP: 1568, FOLDER: 1568 GIT/: ZIP: 31694, FOLDER: 57517 GSD/: ZIP: 7, FOLDER: 37 GitHub Actions/: ZIP: 19, FOLDER: 20 Go/: ZIP: 3472, FOLDER: 3473 Hackage/: ZIP: 19, FOLDER: 19 Hex/: ZIP: 30, FOLDER: 30 JavaScript/: ZIP: 1, FOLDER: 1 Linux/: ZIP: 15909, FOLDER: 15910 Maven/: ZIP: 5075, FOLDER: 5076 NuGet/: ZIP: 1367, FOLDER: 1373 OSS-Fuzz/: ZIP: 3588, FOLDER: 3588 Packagist/: ZIP: 4046, FOLDER: 4047 Pub/: ZIP: 10, FOLDER: 13 PyPI/: ZIP: 13982, FOLDER: 13985 Rocky Linux/: ZIP: 1333, FOLDER: 1333 Rocky Linux:8/: ZIP: 1008, FOLDER: 1008 Rocky Linux:9/: ZIP: 327, FOLDER: 327 ecosystems.txt: ZIP: 0, FOLDER: 1 index.html: ZIP: 0, FOLDER: 1 RubyGems/: ZIP: 1653, FOLDER: 1653 SwiftURL/: ZIP: 35, FOLDER: 35 UVI/: ZIP: 1, FOLDER: 1 Ubuntu/: ZIP: 5446, FOLDER: 39883 Ubuntu:14.04:LTS/: ZIP: 1593, FOLDER: 10370 Ubuntu:16.04:LTS/: ZIP: 1483, FOLDER: 11363 Ubuntu:18.04:LTS/: ZIP: 1700, FOLDER: 3411 Ubuntu:20.04:LTS/: ZIP: 1763, FOLDER: 9928 Ubuntu:22.04:LTS/: ZIP: 1015, FOLDER: 8177 Ubuntu:23.10/: ZIP: 274, FOLDER: 274 Ubuntu:24.04:LTS/: ZIP: 133, FOLDER: 6081 Ubuntu:Pro:14.04:LTS/: ZIP: 554, FOLDER: 4826 Ubuntu:Pro:16.04:LTS/: ZIP: 972, FOLDER: 20630 Ubuntu:Pro:18.04:LTS/: ZIP: 517, FOLDER: 15030 Ubuntu:Pro:20.04:LTS/: ZIP: 134, FOLDER: 2011 Ubuntu:Pro:22.04:LTS/: ZIP: 89, FOLDER: 1402 Ubuntu:Pro:24.04:LTS/: ZIP: 6, FOLDER: 771 Wolfi/: ZIP: 8224, FOLDER: 8224 crates.io/: ZIP: 1461, FOLDER: 1461 icons/: ZIP: 0, FOLDER: 4 npm/: ZIP: 19047, FOLDER: 19052

It might be related to timestamp, as certain pattern can be spotted:

image image

martin-bpw avatar Sep 07 '24 11:09 martin-bpw

This discrepancy is something that in the short-term needs to be documented in the FAQ and longer-term needs to be fixed in our exporter (#2329 touches on this a little as well)

Essentially, the all.zip files are canonical. The individual records in GCS are not. They may have existed and been exported at some point in the past, but not any longer, and do not (currently) get cleaned up.

https://github.com/google/osv.dev/blob/f240c5a8c6ee3a1bd8110b3068d98a40c9e6b5f2/docker/exporter/exporter.py#L86-L134 is the relevant code. One possible solution is do add in a deletion run at the end, or some other reverse check.

There's some conceptual similarity with code added to the importer in https://github.com/google/osv.dev/pull/2030

andrewpollock avatar Sep 09 '24 00:09 andrewpollock

Thank you very much for quick feedback, it already helped to know that all.zip is the preferred source.

martin-bpw avatar Sep 09 '24 08:09 martin-bpw

I think that with recent work that @hogo6002 did to make adjustments to how our exporting works we may be able to almost call this "done".

I think a review and refresh of what is stated at https://google.github.io/osv.dev/data/#data-dumps is all that is necessary.

andrewpollock avatar Nov 10 '24 23:11 andrewpollock

Actually @hogo6002 already made the necessary documentation changes in #2784 so I think we can call this done.

andrewpollock avatar Nov 11 '24 04:11 andrewpollock

Thank you, nice!

martin-bpw avatar Nov 11 '24 12:11 martin-bpw

Thanks for clarifications @andrewpollock @martin-bpw

fazleyazdan avatar Feb 13 '25 11:02 fazleyazdan