tensorstore
tensorstore copied to clipboard
OME-TIFF Support Draft for Feedback
Hello devs,
Thanks for contributing such a wonderful project to the community. This is a PR draft which proposes a different approach than #62 at integrating OME-TIFF support. This approach is compatible with the broad requirements of TS, namely support for any KV store driver. This is definitely incomplete as it stands (but is a mwe), but before I proceeded any further with the work I was hoping to get validation of the approach that this is something that would make an acceptable contribution. My interest is providing read support for a popular legacy standard which can then be converted / used with TS.
Here are some of the key things I am looking for feedback on.
I created a KVS-backed read streambuf
This allows one to leverage any existing library (like libtiff for example) which supports streambufs in an off-the-shelf manner (of course there are other considerations such as thread safety). The KvsReadStreambuf handles the necessary requests (with buffering) to the kvstore driver. (Note: that I implemented this approach before the contribution of basic zip support, which could benefit from the same strategy).
One aspect of this that is not paradigmatic is that the streambuf can trigger secondary reads from the kvstore driver which is inherently serial. I considered different scenarios where those read requests would be carried out by various (nested) parallel calls made from the kvstore itself, but ultimately those calls would need to be blocking (e.g. traversing a linked list to get the appropriate chunk offset), which I think would result in the same performance outcome anyways.
I see some advantages and disadvantages to this approach. The main advantage is obvious: it allows for much easier integration of complex formats which have existing libraries. The main disadvantages I can think of are that libraries may not be efficient in their use of the streambuf, particularly when seeking, which may affect performance. However, buffering in the streambuf mitigates this to some extent. The other is what I outlined above about serial requests, but especially for formats like OME-TIFF, I don't see a parallel approach being more efficient here - just more complex. Bearing in mind that once metadata is collected, chunk reads are parallelized. The alternative of course is hand-rolling our own implementation, which will either be very basic and missing many features, or a full-blown standards-compliant implementation, which is quite onerous.
Blocker: The current blocker here is that as far as I can tell there is no way to get the total size of an object from the kvstore interface. This becomes quite a problem when filling the buffer and seeks which are relative to the end.
My request: Would it be possible to add a get size method to kvstore driver? With this I think I'd have everything needed for this approach to work successfully. Byte range requests are validated, so I feel like it should be straightforward to return size specifically.
I added optional byte range requests to KVS backed chunk driver
Pretty straightforward, but just looking for approval on this.
File-level (not chunk level) metadata
I opted to implement metadata cache at the level of a TIFF file and storing chunk info in one global map which spans IFDs. I am not sure I see much value gained from the added complexity of keying it at the level of chunks other than it's not as paradigmatic. I am looking for feedback on if you are okay with this since there could be other things I am not considering.
Next steps
Assuming this general strategy is approved, I have a number of more detailed questions about how things work that are still unclear to me that I would appreciate input on. I will then close this and open a proper PR once things are ready to be formally reviewed for inclusion.
Thanks in advance, Hythem
Tagging @jbms and @laramiel for visibility.
Thanks!
It may take us a bit more to get back to you; we've discussed this a little bit, but don't yet have a consensus on what we are going to recommend.
Where we agree: Our current TIFF reader is somewhat inadequate for your goals, and the TIFF file format is challenging to work with in a streaming fashion because the format is somewhat structured as a linked-list of sequential fields. I think that we would like a TIFF reader to be able to access chunked/striped data (even from non-ome-tiff files), as well as supporting multiple images, etc. It appears that would require reading most of the metadata from a tiff file in a somewhat random-access fashion and tracking the appropriate file metadata for each image, somewhat similar to the ZIP driver.
However, as mentioned, we have not achieved consensus among ourselves. At a high level, one concern is that the streambuf implementation is synchronous, which somewhat goes against most of our implementation philosophy so far. I am thinking about updating the ZIP driver to expand support by using minizip and implementing an asynchronous retry loop when the data is unavailable. If we migrate to C++20, it may be possible to achieve the same result with coroutines and co_await.
A few other comments to add to what @laramiel said:
As far as I can tell, this is a plain tiff driver, the OME metadata isn't used. That is actually better, because it would be better to layer OME metadata support on top rather than integrating it into the base driver, but then it doesn't make sense to call the driver "ometiff".
The same basic approach could be used with riegeli::Reader instead of streambuf --- riegeli is basically an improved C++ nput/output stream API. It is still synchronous but has many improvements, including the ability to report errors (via absl::Status) without the use of C++ exceptions. If your basic approach here is to be used, we would almost surely want to use riegeli instead of streambuf.
In your current implementation, none of the strip/tile decoding is actually being done with libtiff, and currently only uncompressed and zstd is supported. That allows each tile/strip to be read in parallel, which is important, but then it seems that libtiff is really just serving to decode the header structure, which isn't all that complicated anyway as far as I can tell. Unless there is a way to utilize libtiff for the strip/tile decoding, it seems to me that it may make sense to just not use libtiff at all. Then the metadata can be read asynchronously, somewhat similarly to how we do it for the zip driver.
Thank you both for your feedback. I completely understand the need to take time to think through the most appropriate path forward.
Regarding broader TIFF support, I should have clarified my initial PR note a bit better. Going into this, I did not know to what extent the OME part would need to be coupled with the TIFF part. After exploring a few different approaches, I settled on the one in the PR. @jbms is correct in that there is no OME metadata used in my PR thus far and this could and should work as a general OME base, with specialized metadata readers for various specifications, though my interest at the moment is in OME alone. So having a general solution is imminently doable and those interested in supporting other specs could do so with relative ease. The ometiff name is currently a misnomer and can rename it.
As for the streambuf/synchronous reading issue, I did design it such that only the metadata reading relies on that component (via libtiff). As @jbms noted, I store the offsets and sizes of each chunk, then the data reading part is fully asynchronous and happens within Tensorstore, which I think is nice. I implemented raw and one compression scheme (zstd) just to make sure both would work, and indeed they do. I had tried doing the decoding in libtiff but it got quite messy handling multiple TIFF objects.
Lastly, the reason why I used streambuf is that it's supported by libtiff (not sure how riegeli::Reader would be?). It's correct that in the current setup it's being used to retrieve metadata only. While in the simple cases I've tested so far it's pretty easy to read the tiff metadata ourselves (and indeed I actually started off doing this), there are dozens of TIFF tags, Big vs standard tiff, etc.. which we get to support for "free" using libtiff, and is standards compliant. So that's the tradeoff that I can see, which is a more limited support if everything is implemented natively in TS, versus something that's not adherent to the design philosophy but unlocks a pretty large space of images.
Whatever you all decide, I am happy to accommodate that as best as I can. I am also happy to be part of the conversation if so desired.
For riegeli::Reader you would need to adapt the tiff read/seek functions as I have done in the current tiff driver:
https://github.com/google/tensorstore/blob/master/tensorstore/internal/image/tiff_reader.cc
For riegeli::Reader you would need to adapt the tiff read/seek functions as I have done in the current tiff driver:
https://github.com/google/tensorstore/blob/master/tensorstore/internal/image/tiff_reader.cc
Thank you! I forgot about that.