tensorstore
tensorstore copied to clipboard
Implement a "zip" kvstore driver
zip is not on the list of KvStore drivers. It should be similar to zarr.storage.ZipStore.
The main need for zip store is for use in JavaScript and WebAssembly, where a single file can be passed around as a blob, and a directory is very inconvenient (to say the least).
This was first mentioned in https://github.com/google/tensorstore/issues/36#issuecomment-1294043315 (near the end of discussion).
Can I do something to help this along?
You could certainly take a stab at implementing it. The closest existing driver is the neuroglancer_uiny64_sharded driver:
https://github.com/google/tensorstore/blob/master/tensorstore/kvstore/neuroglancer_uint64_sharded/neuroglancer_uint64_sharded.cc
And we'd be happy to provide guidance.
Due to the need to operate asynchronously, most likely existing zip libraries would not be directly usable. Personally I would probably use the Python standard library zipfile.py implementation as a reference.
The simplest way to implement write support would be the same as for neuroglancer_uiny64_sharded: just rewrite the entire zip file for each update. Using tensorstore transactions you can ensure the zip file is only written once. Append support could be added later but would be an added complication.
As I believe that typical use for zip files is to write once, "rewrite the entire zip file for each update" should be totally fine.
I will first deal with JSON metadata (image origin, spacing, etc.), before I can pay attention to ZIP support. I never dealt with Bazel build system before, so I know nothing about it. minizip-ng uses CMake, is that a problem for integrating into tensorstore? With extra compression methods disabled, it should not need further dependencies.
TensorStore already depends on zlib, bzip, and liblzma, so supporting those compression methods would not require any additional dependencies.
To integrate minizip-ng we would need a bazel build for it, see e.g. the build for zlib: https://github.com/google/tensorstore/tree/master/third_party/net_zlib
There we have listed the source files explicitly, but in many cases you can just use the glob function (https://bazel.build/reference/be/functions#glob).
Then you can run https://github.com/google/tensorstore/blob/master/third_party/update_third_party_bzl.py.
For our cmake build we can either use the native cmake build provided by the package, if there is one, or just use the bazel one converted to cmake via our bazel_to_cmake tool. In most cases it works better to just use the bazel build even if there is a native cmake build, because most cmake builds aren't designed to work properly as subprojects.
I'm not sure how helpful minizip-ng will be, though, given that it appears to be based on a synchronous stream interface but for tensorstore we will need an asynchronous interface in order to be able to efficiently read multiple archive members in parallel. For writing, assuming we write the entire file at once, it could probably be used.
The only other library I know of is libzip. It also uses CMake, so I expect it will be no easier to integrate. I doubt it is asynchronous either. Is there a more suitable option for tensorstore?
What I had in mind was to just implement the zip support "from scratch" (but still using zlib, etc. for the actual compression).
Basically for reading:
- Locate and decode the central directory, which contains the list of all archive members. This would be cached using tensorstore's AsyncCache framework. There is some complexity in supporting zip64, etc. but overall I don't think it is too complicated to decode, especially when using e.g. the existing Python zipfile.py as a reference.
- When there is a request to read a key, find it in the central directory, read the corresponding byte range, and then decode it if compressed.
For writing:
- Using an in-memory buffer, write each key, either compressed or not.
- Append the central directory and related footers at the end.
Implementing zip "from scratch" seems like a lot of work. Would it make sense to add a basic, low-performance support first, and if there are resources later replace it by a better performing implementation?
zlib build setup looks simple. Libraries with dependencies look more complicated, e.g. libAOM or gRPC. A reasonable starting help for me would be for you to add minizip-ng into tensorstore's build system, and point me to places in existing code where I need to make changes to add zip support.
Yes, you could make a simpler version that always reads the entire zip file into memory. Then you could easily use an existing library like minizip-ng.
To add support to tensorstore, you could start by duplicating kvstore/neuroglancer_uint64_sharded to kvstore/zip and then modifying it to use the zip format. The non-simple version would be very similar indeed to neuroglancer_uint64_sharded --- the simple version would be simpler.
Agreed that adding a third-party dependency can be challenging --- not sure if we will have time to do that for minizip-ng though.
You don't need to add all third party dependencies. Just configure-enable the ones which are already bundled (zlib, zstd etc) and correctly point minizip-ng to them.
I would like to get started on this, as a way to accomplish https://github.com/InsightSoftwareConsortium/ITKIOOMEZarrNGFF/issues/24. Any chance @jbms or @laramiel will add minizip-ng to tensorstore's build system? We don't need a lot of the stuff from it.
This commit (not pushed to the main branch) shows how it can be done:
https://github.com/jbms/tensorstore/commit/5d8ce407e3be5ade26f1976b3d2d39d551463874
This seems to work on Linux, I haven't tested on Windows but you can probably get it to work with some minor tweaks if necessary. It could use some refinement for the additional optional functionality but should be enough for you to get started.
I am giving it a try.
I just pushed a new version to this branch https://github.com/jbms/tensorstore/tree/add-minizip-ng
I had forgotten to specify hdrs and strip_include_prefix.
This does not seem to work. I tried adding #include "mz.h" in neuroglancer_uint64_sharded.cc, which produces:
dzenan@corista:~/zarr/ts-deb$ ninja && ctest -j3
[1/11] Building CXX object CMakeFiles/tensorstore_kvstore_neuroglancer_uint64_...nsorstore/kvstore/neuroglancer_uint64_sharded/neuroglancer_uint64_sharded.cc.o
FAILED: CMakeFiles/tensorstore_kvstore_neuroglancer_uint64_sharded.dir/tensorstore/kvstore/neuroglancer_uint64_sharded/neuroglancer_uint64_sharded.cc.o
/usr/bin/c++ -DCHROMIUM_ZLIB_NO_CHROMECONF -I/home/dzenan/zarr/ts-deb -I/home/dzenan/zarr/tensorstore -I/home/dzenan/zarr/ts-deb/_deps/absl-src -I/home/dzenan/zarr/ts-deb/_deps/riegeli-build -I/home/dzenan/zarr/ts-deb/_deps/riegeli-src -I/home/dzenan/zarr/ts-deb/_deps/nlohmann_json-build -I/home/dzenan/zarr/ts-deb/_deps/nlohmann_json-build/include -I/home/dzenan/zarr/ts-deb/_deps/nlohmann_json-src -I/home/dzenan/zarr/ts-deb/_deps/nlohmann_json-src/include -I/home/dzenan/zarr/ts-deb/_deps/half-build -I/home/dzenan/zarr/ts-deb/_deps/half-build/include -I/home/dzenan/zarr/ts-deb/_deps/half-src -I/home/dzenan/zarr/ts-deb/_deps/half-src/include -I/home/dzenan/zarr/ts-deb/_deps/zlib-build -I/home/dzenan/zarr/ts-deb/_deps/zlib-build/. -I/home/dzenan/zarr/ts-deb/_deps/zlib-src -I/home/dzenan/zarr/ts-deb/_deps/zlib-src/. -I/home/dzenan/zarr/ts-deb/_deps/zlib-build/contrib/optimizations -I/home/dzenan/zarr/ts-deb/_deps/zlib-src/contrib/optimizations -g -fPIC -Wno-deprecated-declarations -Wno-sign-compare -Wno-unused-but-set-parameter -Wno-maybe-uninitialized -Wno-unknown-warning-option -fsized-deallocation -MD -MT CMakeFiles/tensorstore_kvstore_neuroglancer_uint64_sharded.dir/tensorstore/kvstore/neuroglancer_uint64_sharded/neuroglancer_uint64_sharded.cc.o -MF CMakeFiles/tensorstore_kvstore_neuroglancer_uint64_sharded.dir/tensorstore/kvstore/neuroglancer_uint64_sharded/neuroglancer_uint64_sharded.cc.o.d -o CMakeFiles/tensorstore_kvstore_neuroglancer_uint64_sharded.dir/tensorstore/kvstore/neuroglancer_uint64_sharded/neuroglancer_uint64_sharded.cc.o -c /home/dzenan/zarr/tensorstore/tensorstore/kvstore/neuroglancer_uint64_sharded/neuroglancer_uint64_sharded.cc
/home/dzenan/zarr/tensorstore/tensorstore/kvstore/neuroglancer_uint64_sharded/neuroglancer_uint64_sharded.cc:49:10: fatal error: mz.h: No such file or directory
49 | #include "mz.h"
| ^~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
Error message on Windows is similar.
Also, there is no mention of minizip in target list or in the _deps sub-directory inside build directory.
You need to add "@minizip_ng//:minizip_ng" to the deps list of the specific target (see the BUILD file) in order for it to be available.
Doing that gets me further. This add MINIZIP_minizip_ng target to the build system. But it doesn't build:
Build started...
1>------ Build started: Project: MINIZIP_minizip_ng, Configuration: Release x64 ------
1>mz_strm_bzip.c
1>C:\Misc\tensorstore-vs22\_deps\minizip-src\mz_strm_bzip.c(15,10): fatal error C1083: Cannot open include file: 'bzlib.h': No such file or directory
1>mz_strm_lzma.c
1>C:\Misc\tensorstore-vs22\_deps\minizip-src\mz_strm_lzma.c(15,10): fatal error C1083: Cannot open include file: 'lzma.h': No such file or directory
1>Generating Code...
1>Done building project "MINIZIP_minizip_ng.vcxproj" -- FAILED.
========== Build: 0 succeeded, 1 failed, 4 up-to-date, 0 skipped ==========
========== Elapsed 00:01.176 ==========
Adding
"//tensorstore/internal/compression:bzip2_compressor",
"//tensorstore/internal/compression:zlib_compressor",
to deps does not help. See my fork.
The minizip_ng.BUILD.bazel file needed to be updated to add the bzip2 and lzma dependencies. I pushed an updated version.
I decided to base my zip kvstore drive off of memory. It is way simpler.
I now a have a version of code which can read a zarr file from a zip store. Now is a great time for feedback.
Great that you have an initial working version --- I know the tensorstore codebase is rather complicated.
A few high-level comments:
- We would want a zip driver in tensorstore to work with any underlying kvstore (file, http, gcs, memory, etc.), for accessing the zip file, rather than just directly accessing the zip file using the local filesystem and only supporting that.
- The
memorydriver stores its data in theContextbecause there is no other place to persist the data, and the whole point is to store the data in memory. However, I see that you are using a similar pattern for the zip driver: https://github.com/dzenanz/tensorstore/blob/81156bf313a8d26028420e76e5a3e237c097f5a0/tensorstore/kvstore/zip/zip_key_value_store.cc#L347 Was that intended to solve a specific problem you ran into, or did you just keep that because you based the driver off of the memory kvstore driver?
I do want to support in-memory zip file. But I don't know how to support all the kv store backends.
There are still many leftovers from the memory driver, but I intend to remove them over time (the values member of ZipEncapsulator and related methods). Judging by the comments in tensorstore, I thought that context/resource is a recommended pattern.
You are very welcome to make simplifications/improvements after I have a functional zip support. It would be even better if you took over this work now 😄
I have basic write support now, via https://github.com/dzenanz/tensorstore/commit/9b51deea1868e085ae08089f515d1c40bdb973a9. It needs to be coupled with deleting an existing zip file to work properly.
I now have an implementation in my fork which allows ITK zarr IO to function. It can read and write zips, both from disk and in memory. It does not support update and deletion of keys. I assume it also begs for simplification and refactoring, as I was trying to avoid any complication from tensorstore machinery (which I am clueless about).
Can someone take a look at it here: https://github.com/dzenanz/tensorstore/commit/03143ad3a162ab8a2d4bb952dfd23445c38994f0. How close or far away it is from being accepted as a contribution? What are the most urgent cleanups?
The key missing piece for us to be able to accept it as a contribution is for it to work with any base kvstore, rather than accessing the filesystem directly --- that would allow, for example, someone to use it with a zip file stored on an HTTP server, GCS, etc.
Ideally, it would only read the portions of the zip file that are actually required, e.g. the central directory and the specific entries that are requested. However, I expect that will be quite challenging to support with minizip-ng.
A simpler alternative would be to always read the entire zip file into memory, and then manipulate it with minizip-ng as an in-memory zip. That would mean that the interface allows an arbitrary base kvstore, and in the future we could add partial I/O support.
I suppose the key question is whether this would actually serve your needs, or if the partial I/O, as minzip-ng provides for the local filesystem, is important. Potentially we could use mmap with the file kvstore driver, to effectively support partial I/O while still treating the entire zip as in-memory.
While reading an entire zip into memory would be fine for smaller files, it would totally defeat the point of supporting large, multi-resolution images. I don't know whether minizip-ng would support files larger than 2GB (in general). It would not support them for in-memory operations. So the point of large images might be moot.
I still haven't started looking into supporting cloud. What are the additional restrictions zip kvstore would need to have to support layering on top of GCS or HTTP server?
Incremental approach would work for us. Start with what we have (now it is in https://github.com/InsightSoftwareConsortium/tensorstore), and expand support for more kv stores (e.g. HTTP, GCS, S3) later.
A possible later improvement would include reading the central directory and json files into memory, and chunks on-demand via file offsets (assuming no zip compression is applied).
We already have cloud support on our TODO list, here.
I would like to do a bit of cleanup and create a pull request. @jbms @laramiel Can you make some comments? Earlier questions up in the thread: https://github.com/google/tensorstore/issues/57#issuecomment-1507223279. We already merged the changes in ITK's Zarr module.
I do think "zip" support is a very useful addition to tensorstore, and I'm sure many other users would also make use of it.
But a key design principle of Tensorstore is to equally support both the local storage and cloud storage. The issue isn't about adding support for specific cloud storage protocols to the zip driver, like https, gcs, s3. Rather, issue is that the zip kvstore driver should be an adapter on top of a base kvstore, like the neuroglancer_uint64_sharded or ocdbt kvstores, and be agnostic to the underlying storage:
https://google.github.io/tensorstore/kvstore/neuroglancer_uint64_sharded/index.html https://google.github.io/tensorstore/kvstore/ocdbt/index.html
For example, to access "path/in/zip" within a zip file at c:\my\zip\file.zip would use a spec of:
{"driver": "zip", "path": "path/in/zip", "base": {"driver":"file", "path": "c:\\my\\zip\\file.zip"}}
while to access gcs, it would be:
{"driver": "zip", "path": "path/in/zip", "base": {"driver":"gcs", "bucket": "my-bucket", "path": "path/to/file.zip"}}
The zip driver itself would merely use the kvstore::Driver interface and need not concern itself with what the underlying storage is at all.
Your current implementation is a great proof of concept for adding zip support to tensorstore, but broadly the issues with it currently are:
- The JSON spec does not match other kvstore in tensorstore that serve as adapters on some underlying kvstore, and doesn't lend itself to support other underlying storage in the future.
- The implementation is also specifically designed around accessing the local filesystem directly, and can't really be incrementally changed to do all I/O through the
kvstore::Driverinterface.
Thank you - now it is a lot clearer to me what is desired.
That sounds like a lot of additional effort. In the foreseeable future, we will keep these changes in our fork: https://github.com/InsightSoftwareConsortium/tensorstore. Thank you for support during this development!