Add TIFF Key-Value Store Implementation (take 2)
This PR adds a new (read only) key-value store driver for accessing tiles and strips within TIFF image files. This is my second attempt after finding the time to work on this given previous feedback on #138. I drew heavily upon the zip kvstore implementation and your prior feedback (thanks!), so I hope that things are more faithful to Tensorstore patterns.
The basic idea is this kvstore sits on top of a base kvstore and maps IFDs/tiles/stripes to the following key format: tile/<id>/<row>/<col>.
My plan is to get feedback on whether this something that is suitable to be merged. If I receive a positive response, I work on any feedback I receive in parallel with building out the tiff driver that will pair with this kvstore.
Looking forward to your feedback on this implementation!
P.S. I couldn't for the life of me get subsystem logging to work (aka ABSL_LOG_IF). Would appreciate some advice on how to get that working.
This is great --- I haven't had a chance to review your implementation in detail, but at a high level this is exactly the right implementation strategy.
Do you have a reason to use this tiff kvstore driver on its own? Otherwise I'd be inclined to eliminate the JSON registration for it, such that it can only be created internally (by the to-be-written Tiff TensorStore driver).
Regarding logging: what have you tried? Are you testing with bazel? With bazel you can do --test_env=TENSORSTORE_VERBOSE_LOGGING=tiff
This is great --- I haven't had a chance to review your implementation in detail, but at a high level this is exactly the right implementation strategy.
Do you have a reason to use this tiff kvstore driver on its own? Otherwise I'd be inclined to eliminate the JSON registration for it, such that it can only be created internally (by the to-be-written Tiff TensorStore driver).
Thanks for that feedback. No, I don't think there is any reason to use the tiff kvstore on its own. I can eliminate the JSON registration.
Regarding logging: what have you tried? Are you testing with bazel? With bazel you can do
--test_env=TENSORSTORE_VERBOSE_LOGGING=tiff
Yes, I'm using bazel though I'm not terribly familiar with it. That works! I had previously tried setting the environment variable as that worked for non-subsystem logging in tests, but it wasn't working for conditional logging. Thank you.
I'm going to start chipping away at the tiff driver as I await detailed feedback on this kvstore.
A follow up question for the team:
As I was working on the driver, I ran into a bit of an issue. Right now, the kvstore driver in this PR traverses the TIFF IFD linked list, collects the necessary metadata, and stores it in a directory cache (very much like ZIP). When working on the driver, I needed a way to get that information to the driver so it can issue the appropriate calls to the kvstore.
What I did to do that is create a Future<std::shared_ptr<const tensorstore::internal_tiff_kvstore::TiffParseResult>> GetParseResult function that the driver can use to get the TiffParseResult struct from the kvstore. I was going to use this in my driver's MetadataCache to grab the struct. This is where I ran into an issue.
The MetadataCache implemented in kvs_backed_chunk_driver.h assumes that the metadata lives under a specific key specified by GetMetadataStorageKey. Then, the parent KvsBackedCache grabs that entire file, and hands it off to DecodeMetadata. This is not going to work for me since my tiff kvstore keys are mapped to tiles/strips (e.g. tile/<IFD>/<tile_pos_x>/<tile_pos_y>).
The first idea that came to mind was to override MetadataCache::Entry::DoRead where I would make the GetParseResult call, but that DoRead was marked final by KvsBackedCache.
Right now a few options came to mind:
-
I implement a custom TiffMetadataCache the derives straight from
AsyncCache. This is a bit of extra work, but I'm fine with it if this is your recommendation. However, the thought crossed my mind: if I am going to implement a custom MetadataCache, why not move all of the IFD traversal + metadata parsing into that Class, and eliminate the kvstore entirely? Personally, I think the driver + kvstore option is cleaner, and will make it easier to generalize to multi-file TIFF. But the thought crossed my mind, so I wanted to get your opinion. -
I create a custom key in the Tiff KvStore which, when called, returns the metadata as an absl::Cord. This is probably the easiest to implement, but I don't like the idea of having to check for a special key every time
Readis called. I can certainly get over it if this is what you recommend. -
Forgo a Metadata cache entriely and just grab the TiffParseResult in my driver's Open or OpenState and store the shared pointer somewhere like in the TiffDriver itself or DataCache. This is the least paradigmatic option in my opinion, and doesn't respect metadata staleness bound and general state management in tensorstore, but in practice I don't think this is going to be a real issue with a read only + TIFF format. Nonetheless, I don't like it, but figured I'd put it out here anyways.
Which of these solutions do you recommend? Or perhaps you have a different suggestion altogether?
Thank you in advance!
The closest existing similar code is probably neuroglancer_uint64_sharded.cc
The neuroglancer implementation uses several kvstores; you could do something similar where there is a TiffMetadataKvStore which returns metadata entries for a tiff. This could use the same underlying kvstore for access to the tiff file.
You have made some good observations --- kvs_backed_chunk_driver indeed currently assumes a zarr-like structure.
Ultimately it could be nice if we refactored things so that some of the logic of kvs_backed_chunk_driver could be shared, but doing that refactor may be tricky and I'm not sure how helpful it will be given the nature of the tiff driver.
Instead I'd suggest:
- Implement a TiffChunkCache that inherits from KvsBackedChunkCache and uses your tiff chunk kvstore to read chunks.
- Implement an
internal::Driver(using RegisteredDriver) that contains aCachePtr<TiffChunkCache>and also e.g. a PinnedCacheEntry pointer to the relevant entry in your tiff dir cache. This will have duplicate a small amount of logic from kvs_backed_chunk_driver, but that will also help highlight where any refactoring could be beneficial.
I think your existing tiff dir cache is sufficient and you don't need any additional metadata cache.
O
Thank you both for your feedback. I'll take a stab at this and report back with any questions.
Hello again. I've managed to develop a driver per your suggestions and I believe I have something that's a pretty solid foundation with a reasonable amount of support. There are some outstanding items + polish but I think it makes sense for me to pause here for feedback in case major changes are required and to understand the likelihood of being able to get this merged first.
A few notes on the driver implementation:
- I changed the tiff kvstore to work off of a linear index rather than an xy grid. It made it easier to handle samplesperpixel>1 but did offload more work to getchunkstoragekey, so I did a bit of optimization there.
- I haven't worked on transaction support as I haven't fully wrapped my head around the nuts and bolts of how that works yet. Is this a requirement for a merge? I saw, for example, that the single image driver does not support transactions, so I assume not.
- Same as above but for storage stats.
- I did not commit the test file binaries (generated by the generate.py file). I'll wait until you give me clearance.
- I waffled a bit on how much to build out the metadata property. I ended up putting options under a separate tiff property as they mainly correspond to the interpretation of the data. I can revisit if you have strong opinions.
I think that's it for now. Looking forward to any feedback!