zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

Allow disabling filling of missing chunks

Open willirath opened this issue 5 years ago • 17 comments

This is a first stab at solving #486 by overriding filling of missing chunks.

TODO:

  • [x] Add unit tests and/or doctests in docstrings
  • [x] Add docstrings and API docs for any new/modified user-facing classes and functions
  • [x] Changes documented in docs/release.rst
  • [x] Test coverage is 100% (Coveralls passes)

Not sure about the following todo's:

  • [ ] New/modified features documented in docs/tutorial.rst
  • [ ] Docs build locally (e.g., run tox -e docs)
  • [ ] AppVeyor and Travis CI passes

willirath avatar Oct 22 '19 14:10 willirath

I've added the initialization of the self._fill_missing_chunk attribute.

Regarding tests: Where should I add it? zarr/tests/test_core.py

I'd go for:

  • Ensure that _fill_missing_chunks is set after Array initialization.
  • Ensure that removing a chunk from a store (sufficient to use the default DictStore?) the KeyError
    • is raised for _fill_missing_chunks=False
    • is not raised otherwise

Regarding the structure: I'm not at all familiar with the internal design of zarr-python. Is this decentralized conditional raising really the way to go, or should this be abstracted away?

willirath avatar Oct 25 '19 09:10 willirath

Going to re-open to try to get travis green. Coveralls will stay red until a test is added.

joshmoore avatar Apr 27 '20 10:04 joshmoore

Codecov Report

Merging #489 (c7a2b16) into master (0fc6a1e) will increase coverage by 0.00%. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #489   +/-   ##
=======================================
  Coverage   99.94%   99.94%           
=======================================
  Files          32       33    +1     
  Lines       11256    11285   +29     
=======================================
+ Hits        11250    11279   +29     
  Misses          6        6           
Impacted Files Coverage Δ
zarr/core.py 100.00% <100.00%> (ø)
zarr/tests/test_missing.py 100.00% <100.00%> (ø)

codecov[bot] avatar Feb 18 '21 16:02 codecov[bot]

Hi @willirath, I've updated this branch and all existing tests are passing. Are you still interested in taking it forward?

joshmoore avatar Feb 18 '21 17:02 joshmoore

Thanks for pinging me. I'm still interested. It'll take a few days, though.

willirath avatar Feb 19 '21 08:02 willirath

@willirath just checking if you've had any time to work on this lately. This functionality would be super helpful and thanks for filing the PR! I'm happy to try to work on these tests if you'd like someone else to push it forward.

bolliger32 avatar Apr 20 '21 23:04 bolliger32

Hello @willirath! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2022-03-07 22:22:58 UTC

pep8speaks avatar Apr 21 '21 09:04 pep8speaks

@bolliger32 I've found some time to continue with this today. (:+1: for pinging me again. Adding some urgency usually helps finishing this kind of work.)

@joshmoore and @jrbourbeau I've added two tests relying on the zarr.create.array function for creating arrays with a MemoryStore from which it's easy to pop chunks.

From my PoV, this is ready for review.

willirath avatar Apr 21 '21 19:04 willirath

@willirath amazing! Many thanks! 🙏

bolliger32 avatar Apr 21 '21 19:04 bolliger32

Pinging for review again.

willirath avatar Dec 23 '21 11:12 willirath

Thanks for the ping, @willirath, definitely time. I personally don't foresee too much review capacity during the holidays, but will leave the tab open. Obviously, if anyone else could jump in, that'd be wonderful! 🎆

Just in case you missed them, #853 and the proceeding #738 from @d-v-b might be of interest.

joshmoore avatar Dec 23 '21 15:12 joshmoore

Ah, nice. With a clearer 2022 head, I notice just how complementary this is to @d-v-b's #753 and @jni's #853. In Zarr 2.11, the default will become to not serialize empty (i.e. fill_value_only) chunks. With this PR, the user can prevent empty chunks from being deserialized. :+1: I'm going to update the branch in order to trigger another round of tests.

(I slightly wonder if there's not a need to unify settings/options/arguments but that's likely out of scope)

joshmoore avatar Jan 06 '22 12:01 joshmoore

Still green after an update to 2.11.1.

Probably the biggest question from my side is what else should fall into the set_option category to know if the design is right.

joshmoore avatar Mar 07 '22 22:03 joshmoore

The more I come back to this, @willirath, the more I feel that either values like write_empty_chunks should also be part of set_options or fill_missing_chunks should be an __init__ argument like write_empty_chunks.

(This is orthogonal to whether these values should actually be .zarray metadata, and in that case, they might should be @property values like fill_value)

joshmoore avatar May 04 '22 11:05 joshmoore

The more I come back to this, @willirath, the more I feel that either values like write_empty_chunks should also be part of set_options or fill_missing_chunks should be an __init__ argument like write_empty_chunks.

I have a need for this feature (see also https://github.com/zarr-developers/zarr-python/issues/486#issuecomment-1677116689). It seems that adding fill_missing_chunks as an __init__ argument rather than introducing a new set_options mechanism is the simplest way forward. Happy to provide a PR if no one else is working on it.

tomwhite avatar Aug 14 '23 11:08 tomwhite

The more I come back to this, @willirath, the more I feel that either values like write_empty_chunks should also be part of set_options or fill_missing_chunks should be an __init__ argument like write_empty_chunks.

I have a need for this feature (see also #486 (comment)). It seems that adding fill_missing_chunks as an __init__ argument rather than introducing a new set_options mechanism is the simplest way forward. Happy to provide a PR if no one else is working on it.

Agreed with setting this parameter in __init__. We don't need to add set_options for this. PR would be welcome @tomwhite

d-v-b avatar Feb 13 '24 17:02 d-v-b

Bumping this if still relevant. We are having issues with the "missing rectangles" with S3, similar to https://github.com/pangeo-data/pangeo/issues/691.

This seems to persist with using fsspec with fs.get_mapper("s3://*.zarr"), but seems to go away if we use s3fs.S3FileSystem and fs.get_mapper(...). It seems like a rate-limit issue on S3, just like with GCS. I am wondering if there was an upstream s3fs fix as well (as mentioned in the above issue thread for gcsfs).

It would be nice to fix upstream in Zarr if I'm understanding the issue correctly.


EDIT:

We were able to fix this issue on our AWS Sagemaker instances awhile back with using this in the header:

import boto3
import s3fs
session = boto3.Session()
credentials = session.get_credentials()
fs = s3fs.S3FileSystem(
    key=credentials.access_key,
    secret=credentials.secret_key,
    token=credentials.token,
)

I just fixed the most recent manifestation of this issue that was showing up on our AWS Batch jobs through Kedro with distributed.LocalCluster() on on-demand EC2 instances reading from S3. I used a similar setup as above, but was testing adding the skip_instance_cache as well. It seems to have fully resolved our issues.

session = boto3.Session()
credentials = session.get_credentials()
fs = s3fs.S3FileSystem(
    key=credentials.access_key,
    secret=credentials.secret_key,
    token=credentials.token,
    # This seems to help with writing 
    # https://github.com/pydata/xarray/issues/3831#issuecomment-1768393788
    skip_instance_cache=True,
)

The above code replaces the following setup we were using. I would assume fsspec calls s3fs under the hood when s3 is declared as the protocol. And I don't think this is a credential timeout issue since this was happening on 3-minute jobs (although many were being launched in parallel at once and targeting similar zarr stores, hence why I thin kit is a rate-limiting issue). So I'm surprised that just switching to s3fs fixes the problem for us.

fs = fsspec.filesystem(self._protocol, **self._storage_options)

riley-brady avatar Mar 14 '24 22:03 riley-brady