zarr-python
zarr-python copied to clipboard
Allow disabling filling of missing chunks
This is a first stab at solving #486 by overriding filling of missing chunks.
TODO:
- [x] Add unit tests and/or doctests in docstrings
- [x] Add docstrings and API docs for any new/modified user-facing classes and functions
- [x] Changes documented in docs/release.rst
- [x] Test coverage is 100% (Coveralls passes)
Not sure about the following todo's:
- [ ] New/modified features documented in docs/tutorial.rst
- [ ] Docs build locally (e.g., run
tox -e docs
) - [ ] AppVeyor and Travis CI passes
I've added the initialization of the self._fill_missing_chunk
attribute.
Regarding tests: Where should I add it? zarr/tests/test_core.py
I'd go for:
- Ensure that
_fill_missing_chunks
is set afterArray
initialization. - Ensure that removing a chunk from a store (sufficient to use the default
DictStore
?) theKeyError
- is raised for
_fill_missing_chunks=False
- is not raised otherwise
- is raised for
Regarding the structure: I'm not at all familiar with the internal design of zarr-python. Is this decentralized conditional raising really the way to go, or should this be abstracted away?
Going to re-open to try to get travis green. Coveralls will stay red until a test is added.
Codecov Report
Merging #489 (c7a2b16) into master (0fc6a1e) will increase coverage by
0.00%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## master #489 +/- ##
=======================================
Coverage 99.94% 99.94%
=======================================
Files 32 33 +1
Lines 11256 11285 +29
=======================================
+ Hits 11250 11279 +29
Misses 6 6
Impacted Files | Coverage Δ | |
---|---|---|
zarr/core.py | 100.00% <100.00%> (ø) |
|
zarr/tests/test_missing.py | 100.00% <100.00%> (ø) |
Hi @willirath, I've updated this branch and all existing tests are passing. Are you still interested in taking it forward?
Thanks for pinging me. I'm still interested. It'll take a few days, though.
@willirath just checking if you've had any time to work on this lately. This functionality would be super helpful and thanks for filing the PR! I'm happy to try to work on these tests if you'd like someone else to push it forward.
Hello @willirath! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:
Comment last updated at 2022-03-07 22:22:58 UTC
@bolliger32 I've found some time to continue with this today. (:+1: for pinging me again. Adding some urgency usually helps finishing this kind of work.)
@joshmoore and @jrbourbeau I've added two tests relying on the zarr.create.array
function for creating arrays with a MemoryStore
from which it's easy to pop chunks.
From my PoV, this is ready for review.
@willirath amazing! Many thanks! 🙏
Pinging for review again.
Thanks for the ping, @willirath, definitely time. I personally don't foresee too much review capacity during the holidays, but will leave the tab open. Obviously, if anyone else could jump in, that'd be wonderful! 🎆
Just in case you missed them, #853 and the proceeding #738 from @d-v-b might be of interest.
Ah, nice. With a clearer 2022 head, I notice just how complementary this is to @d-v-b's #753 and @jni's #853. In Zarr 2.11, the default will become to not serialize empty (i.e. fill_value_only) chunks. With this PR, the user can prevent empty chunks from being deserialized. :+1: I'm going to update the branch in order to trigger another round of tests.
(I slightly wonder if there's not a need to unify settings/options/arguments but that's likely out of scope)
Still green after an update to 2.11.1
.
Probably the biggest question from my side is what else should fall into the set_option
category to know if the design is right.
The more I come back to this, @willirath, the more I feel that either values like write_empty_chunks
should also be part of set_options
or fill_missing_chunks
should be an __init__
argument like write_empty_chunks
.
(This is orthogonal to whether these values should actually be .zarray metadata, and in that case, they might should be @property
values like fill_value
)
The more I come back to this, @willirath, the more I feel that either values like
write_empty_chunks
should also be part ofset_options
orfill_missing_chunks
should be an__init__
argument likewrite_empty_chunks
.
I have a need for this feature (see also https://github.com/zarr-developers/zarr-python/issues/486#issuecomment-1677116689). It seems that adding fill_missing_chunks
as an __init__
argument rather than introducing a new set_options
mechanism is the simplest way forward. Happy to provide a PR if no one else is working on it.
The more I come back to this, @willirath, the more I feel that either values like
write_empty_chunks
should also be part ofset_options
orfill_missing_chunks
should be an__init__
argument likewrite_empty_chunks
.I have a need for this feature (see also #486 (comment)). It seems that adding
fill_missing_chunks
as an__init__
argument rather than introducing a newset_options
mechanism is the simplest way forward. Happy to provide a PR if no one else is working on it.
Agreed with setting this parameter in __init__
. We don't need to add set_options
for this. PR would be welcome @tomwhite
Bumping this if still relevant. We are having issues with the "missing rectangles" with S3, similar to https://github.com/pangeo-data/pangeo/issues/691.
This seems to persist with using fsspec
with fs.get_mapper("s3://*.zarr")
, but seems to go away if we use s3fs.S3FileSystem
and fs.get_mapper(...)
. It seems like a rate-limit issue on S3, just like with GCS. I am wondering if there was an upstream s3fs
fix as well (as mentioned in the above issue thread for gcsfs
).
It would be nice to fix upstream in Zarr if I'm understanding the issue correctly.
EDIT:
We were able to fix this issue on our AWS Sagemaker instances awhile back with using this in the header:
import boto3
import s3fs
session = boto3.Session()
credentials = session.get_credentials()
fs = s3fs.S3FileSystem(
key=credentials.access_key,
secret=credentials.secret_key,
token=credentials.token,
)
I just fixed the most recent manifestation of this issue that was showing up on our AWS Batch jobs through Kedro
with distributed.LocalCluster()
on on-demand EC2 instances reading from S3. I used a similar setup as above, but was testing adding the skip_instance_cache
as well. It seems to have fully resolved our issues.
session = boto3.Session()
credentials = session.get_credentials()
fs = s3fs.S3FileSystem(
key=credentials.access_key,
secret=credentials.secret_key,
token=credentials.token,
# This seems to help with writing
# https://github.com/pydata/xarray/issues/3831#issuecomment-1768393788
skip_instance_cache=True,
)
The above code replaces the following setup we were using. I would assume fsspec
calls s3fs
under the hood when s3
is declared as the protocol. And I don't think this is a credential timeout issue since this was happening on 3-minute jobs (although many were being launched in parallel at once and targeting similar zarr stores, hence why I thin kit is a rate-limiting issue). So I'm surprised that just switching to s3fs
fixes the problem for us.
fs = fsspec.filesystem(self._protocol, **self._storage_options)