xbatcher
xbatcher copied to clipboard
Batches to zarr
Add BatchGenerator.to_zarr
and BatchGenerator.from_zarr
to make it possible to save generated batches to zarr and later load them from zarr. By chunking along the batch dimension this enables fast data-loading at training time.
Codecov Report
Merging #40 (08a9e94) into main (34ca3f9) will decrease coverage by
3.04%
. The diff coverage is85.71%
.
@@ Coverage Diff @@
## main #40 +/- ##
===========================================
- Coverage 100.00% 96.95% -3.05%
===========================================
Files 3 3
Lines 134 164 +30
Branches 30 38 +8
===========================================
+ Hits 134 159 +25
- Misses 0 2 +2
- Partials 0 3 +3
Impacted Files | Coverage Δ | |
---|---|---|
xbatcher/generators.py | 95.65% <85.71%> (-4.35%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 34ca3f9...08a9e94. Read the comment docs.
@leifdenby - thanks for opening this PR and apologies the review wasn't picked up sooner.
My main question/comment is about whether or not we want to think about serializing some of the batch-generator's attributes in the Zarr dataset. It seems like without too much effort, we could effectively reconstruct the full BatchGenerator attribute namespace.
Also, as an aside, this fits nicely within the caching-api's element in the xbatcher roadmap: https://xbatcher.readthedocs.io/en/latest/roadmap.html#caching-apis. We hadn't gotten there yet but I'm glad to see this moving.
My main question/comment is about whether or not we want to think about serializing some of the batch-generator's attributes in the Zarr dataset. It seems like without too much effort, we could effectively reconstruct the full BatchGenerator attribute namespace.
Yes, that sounds like a good idea. Do you have list of attributes in mind? Looking at BatchGenerator.__init__
maybe I should start with input_dims
, input_overlap
, batch_dims
and concat_input_dims
. I don't think it makes sense to store preload_batch
and of course not the source dataset ds
. What do you think?
Also, as an aside, this fits nicely within the caching-api's element in the xbatcher roadmap: https://xbatcher.readthedocs.io/en/latest/roadmap.html#caching-apis. We hadn't gotten there yet but I'm glad to see this moving.
Great! :) I've used this a few times now and it works well for me.
Looking at BatchGenerator.init maybe I should start with input_dims, input_overlap, batch_dims and concat_input_dims
Yes, this is exactly what I was thinking.
Also, looking at the code coverage report, it looks like we're in pretty good shape but could use a bit more testing on the edge cases. I'll leave a few more comments in the code to highlight areas that could use a test.