zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

How to prevent Zarr from returning NaN for missing chunks?

Open willirath opened this issue 6 years ago • 39 comments
trafficstars

Is there a way of preventing Zarr from returning NaNs if a chunk is missing?

Background of my question: We're seeing problems with either copying data to GCS or with GCS having problems to reliably serve all chunks of a Zarr store.

In arr below, there's two types of NaN filled chunks returned by Zarr.

from dask import array as darr
import numpy as np

arr = darr.from_zarr(""gs://pangeo-data/eNATL60-BLBT02X-ssh/sossheig/")

First, there's a chunk that is completely flagged missing in the data (chunk is over land in an Ocean dataset) but present on GCS (https://console.cloud.google.com/storage/browser/_details/pangeo-data/eNATL60-BLBT02X-ssh/sossheig/0.0.0) and Zarr correctly find all items marked as invalid:

np.isnan(arr.blocks[0, 0, 0]).mean().compute()
# -> 1.0

Then, there's a chunk (https://console.cloud.google.com/storage/browser/_details/pangeo-data/eNATL60-BLBT02X-ssh/sossheig/0.7.3) that is not present (at the time of writing this, I get a "load failed" and a tracking id from GCS) and Zarr returns all items marked invalid as well:

np.isnan(arr.blocks[0, 7, 3]).mean().compute()
# -> 1.0

How do I make Zarr raise an Exception on the latter?

cc: @auraoupa related: pangeo-data/pangeo#691

willirath avatar Oct 19 '19 11:10 willirath

Looks like it's necessary to override https://github.com/zarr-developers/zarr-python/blob/e7708c948d2c0ff91863d675c1072b0dfe9ce2a6/zarr/core.py#L1581 et al to make zarr raise on non-existing chunks?

willirath avatar Oct 19 '19 14:10 willirath

Have you tried setting the fill value?

jakirkham avatar Oct 19 '19 14:10 jakirkham

Yes this is not currently supported, but would be straightforward to add. I've actually been thinking myself recently that this would be useful for very similar reasons, i.e., when you want to ensure that an exception is raised when trying to access data from a missing chunk.

One way to achieve this could be to add a mechanism for activating this behaviour explicitly. E.g., something like:

z = ... # some zarr array z.set_options(fill_missing_chunk=False)

Thoughts and suggestions welcome.

On Sat, 19 Oct 2019, 15:34 Willi Rath, [email protected] wrote:

Looks like it's necessary to override https://github.com/zarr-developers/zarr-python/blob/e7708c948d2c0ff91863d675c1072b0dfe9ce2a6/zarr/core.py#L1581 et al to make zarr raise on non-existing chunks?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/zarr-developers/zarr-python/issues/486?email_source=notifications&email_token=AAFLYQXHM2WFRK5BDATMIDDQPMLIDA5CNFSM4JCPOTIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBXRXRA#issuecomment-544152516, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFLYQXAROADSJEALO3X4YLQPMLIDANCNFSM4JCPOTIA .

alimanfoo avatar Oct 19 '19 22:10 alimanfoo

Just to say that I still think this is a valid thing to address in zarr, but there could/should also be some work upstream in fsspec to ensure that KeyError is only ever raised by the store in the case where it is certain that the underlying file/object does not exist. If some other problem is encountered such as a transient error attempting to read the file/object, then fsspec should raise as some other kind of error, which would get propagated up through zarr. I've raised an issue here: https://github.com/intake/filesystem_spec/issues/255

alimanfoo avatar Mar 24 '20 17:03 alimanfoo

cc @martindurant

jakirkham avatar Mar 29 '20 21:03 jakirkham

(was fixed in https://github.com/intake/filesystem_spec/pull/259 )

martindurant avatar Apr 01 '20 12:04 martindurant

Thanks @martindurant for the upstream fix.

I think I will reopen this issue, however, as there may still be use cases where you want to change zarr's behaviour. E.g., you may know that you definitely do have some missing chunks in the data, and you want to make sure you don't accidentally request any data from a region overlapping a missing chunk.

alimanfoo avatar Apr 01 '20 12:04 alimanfoo

Hi @alimanfoo and @willirath - thanks for raising this. Just ran across this issue while using zarr on google cloud. With huge jobs there are always I/O failures, but I never would have expected this behavior:

import xarray as xr
import numpy as np
import os

ds = xr.Dataset({
    'x': [0, 1],
    'y': [0, 1],
    'myarr': (('x', 'y'), [[0., np.nan], [2., 3.]]),
})

ds.chunk({'x': 1, 'y': 1}).to_zarr('myzarr.zarr')

# chunk file disappears due to bad write, storage failure, gremlins...
os.remove('myzarr.zarr/myarr/1.0')

# I would LOVE for this to return an error
read_ds = xr.open_zarr('myzarr.zarr').compute()

# instead the error is here
xr.testing.assert_equal(ds, read_ds)

This seems pretty problematic when NaNs are actually meaningful and without doing additional inspection of the filesystem and zarr metadata it's impossible to know if a NaN is a NaN or a failed write from a previous step, or just the chaos monkeys that live on cloud systems generally.

Is there a patch you can recommend as a workaround? I'm less familiar with the zarr API as I've only used it via xarray.

delgadom avatar Apr 20 '21 23:04 delgadom

Did you see https://github.com/zarr-developers/zarr-python/pull/489#issuecomment-823656711, @delgadom? Perhaps give that some testing to help drive it forward?

joshmoore avatar Apr 21 '21 07:04 joshmoore

The fspsec mapper is designed to be able to specify which basic exceptions get turned into KeyError (i.e., missing data) and which do not. I cannot quite seem to make ti work for this use-case though, I expect the code needs a little work. Also, there is a difference between creating an fsspec mapper instance and passing it to open_zarr versus passing a "file://..." URL and opening it in zarr (which uses FSStore, a wrapper with its own exception catching mechanism).

I think the bug is in zarr _chunk_getitems calling .getitems(ckeys, on_error="omit") - it should raise for any error except KeyError.

martindurant avatar Apr 21 '21 13:04 martindurant

@martindurant : do you assume #489 is unneeded then?

joshmoore avatar Apr 21 '21 13:04 joshmoore

If we are willing to have people use fsspec for this kind of use case, then it can be fixed in fsspec and zarr's fsspec-specific code. This is an alternative option, one that fsspec would potentially benefit from too (although zarr is probably the only main user of the mapper interface). Of course, I don't mind if there's another way to achieve the same goal.

martindurant avatar Apr 21 '21 13:04 martindurant

I think relying on fsspec to catch these kinds of intermittent read errors from object stores will do.

But I also see use cases where being able to raise the KeyError is necessary (think of a RedisStore which drops keys after some time) and just filling in .fill_value may not be the best thing to do.

willirath avatar Apr 21 '21 19:04 willirath

If I'm interpreting the fsspec code correctly, it is properly catching these errors and raising a KeyError in response (FileNotFoundError is caught by default in FSMap objects), and then #489 is allowing for a flag as to what you want to do with that KeyError in the context of opening a zarr store. So if I'm understanding this correctly, #489 is needed, while the fsspec behavior doesn't necessarily need changing for this case. Does that sound right?

(This is separate from the issue @martindurant brought up about open_zarr handling "file://" URLs and FSMap objects differently.)

If we are willing to have people use fsspec for this kind of use case, then it can be fixed in fsspec and zarr's fsspec-specific code. This is an alternative option, one that fsspec would potentially benefit from too (although zarr is probably the only main user of the mapper interface). Of course, I don't mind if there's another way to achieve the same goal.

I'm also not sure if there's a more fsspec-specific way that #489 should be implemented or whether the current approach makes sense. Does catching KeyError make sense regardless of what MutableMapping is used when instantiating an array, or is something unique to fsspec

bolliger32 avatar Apr 21 '21 20:04 bolliger32

Correct, I am saying that working in the fsspec code is a possible alternative to #489 - you could get the behaviour without that PR, but the fsspec code would need fixing, I don't think it's quite right now.

But yes, in general, KeyError is supposed to mean "use the fill value" when reading chunks. I'm not sure if this is explicitly documented.

martindurant avatar Apr 21 '21 20:04 martindurant

I read https://github.com/zarr-developers/zarr-python/blob/master/docs/spec/v2.rst to say that you should get a None if there's a KeyError when fill_value is None:

fill_value: A scalar value providing the default value to use for uninitialized portions of the array, or null if no fill_value is to be used. ... There is no need for all chunks to be present within an array store. If a chunk is not present then it is considered to be in an uninitialized state. An unitialized chunk MUST be treated as if it was uniformly filled with the value of the "fill_value" field in the array metadata. If the "fill_value" field is null then the contents of the chunk are undefined.

But trying out:

import zarr
import json
import numpy as np
import os

z = zarr.storage.DirectoryStore("fill_value_test")
a = zarr.creation.create(shape=(2, 2), chunks=(1, 1), store=z,
                         fill_value=None, dtype=np.int8)
a[:] = 1
with open("fill_value_test/.zarray") as o:
    meta = json.load(o)
print(f"fill={meta.get('fill_value', 'missing')}")
print(f"value={a[0][1]}")
print(f"value={a[0][0]}")
print(f"value={a[:]}")

os.remove("fill_value_test/0.0")

that's not the behavior I see. Newly created arrays seem to get random values for the missing chunk.

joshmoore avatar Apr 22 '21 06:04 joshmoore

@willirath: did you look into whether the behavior of a null fill value was behaving as you would have expected?

joshmoore avatar Apr 26 '21 06:04 joshmoore

@joshmoore No I didn't check this.

willirath avatar Apr 26 '21 06:04 willirath

FYI I tried to use fsspec's missing_exceptions, but it no longer works from commit 4e633ad9aa434304296900790c4c65e0fa0dfa12 onwards.

Here's the code I used to reproduce this:

import fsspec
import zarr

fs = fsspec.filesystem("file")

# create an array with no chunks on disk
mapper = fs.get_mapper("tmp.zarr")
za = zarr.open(mapper, mode="w", shape=(3, 3), chunks=(2, 2))

# ensure no exceptions are converted to KeyError
mapper = fs.get_mapper("tmp.zarr", missing_exceptions=())

# following should fail since chunks are missing
print(zarr.open(mapper, mode="r")[:])

tomwhite avatar Aug 14 '23 11:08 tomwhite

Thanks for reporting this Tom. I think I see why 4e633ad caused this problem. Now that all FSMap objects are coerced to FSStore objects, it looks like we have redundant mechanisms for dealing with exceptions. FSStore introduced the additional exceptions parameter, which is not being propagated correctly when we coerce from FSMap

https://github.com/zarr-developers/zarr-python/blob/f542fca7d0d42ee050e9a49d57ad0f5346f62de3/zarr/storage.py#L1425-L1430

I thought I could make your example work by doing the following

store = zarr.storage.FSStore("tmp.zarr", exceptions=(), missing_exceptions=())

try:
    store['0.0']
except FileNotFoundError as e:
    print(type(e))
except KeyError as e:
    print(type(e))

At this point, we can verify that we are not raising a KeyError. However, it still fails to raise an error when accessing the array!

a2 = zarr.open(store, mode="r")
print(a2[:])

I have not been able to track down why that is.

rabernat avatar Aug 14 '23 13:08 rabernat

I suppose that coercing an FSMap->FSStore should do its best to get all available options; but we should still figure out why the exception is not bubbling up. Is zarr doing an explicit exists() call?

martindurant avatar Aug 14 '23 13:08 martindurant

I suppose that coercing an FSMap->FSStore should do its best to get all available options

One problem here is that there is redundant handling of exceptions in FSStore. The logic is implemented in the mapper object store.map via the missing_exceptions parameter and then again the FSStore level via the exceptions attribute (see __getitem__ code above).

rabernat avatar Aug 14 '23 14:08 rabernat

Ah you are right - if the user is coming with a ready made mapper, then the store simply won't see those exceptions. In that case, the higher level exceptions never get used unless they were exclusive - missing_exceptions=() seems right, then (but having them the same wouldn't hurt either).

martindurant avatar Aug 14 '23 14:08 martindurant

I'm a bit confused after reading this issue if improvements are still needed in both fsspec and zarr-python or just zarr-python (e.g., https://github.com/zarr-developers/zarr-python/pull/489). Is anyone able to clarify whether https://github.com/zarr-developers/zarr-python/pull/489 would be expected to work, or if https://github.com/zarr-developers/zarr-python/issues/486#issuecomment-1677116689 is blocking that PR? For reference, I believe this could help us solve issues with accessing our downscaled CMIP6 data on Planetary Computer (e.g., https://github.com/carbonplan/cmip6-downscaling/issues/323).

maxrjones avatar Feb 13 '24 16:02 maxrjones

We were talking about the mapper interface and FSStore, both of which are within zarr-python. fsspec's exception handling in FSMap is stable, and the question above is how zarr should handle it when given one of these rather than creating it's own via FSStore (the latter is now the normal path, but the former still works).

I suppose for complete control, you can always do

fsm = FSMap(..., missing_exceptions=())
store = FSStore(fsm, missing_exceptions=(...))  # check call, I'm not sure how to pass this
group = zarr.open(store)

Also note that some storage backends differentiate between types of error. In particular, referenceFS raises ReferenceNotReachable(RuntimeError) for a key that should be there, because it's in the reference list, but for some maybe intermittent reason, failed to load.

martindurant avatar Feb 13 '24 16:02 martindurant

I'm a bit confused after reading this issue if improvements are still needed in both fsspec and zarr-python or just zarr-python (e.g., #489). Is anyone able to clarify whether #489 would be expected to work, or if #486 (comment) is blocking that PR? For reference, I believe this could help us solve issues with accessing our downscaled CMIP6 data on Planetary Computer (e.g., carbonplan/cmip6-downscaling#323).

I think there's still a problem in Zarr that needs fixing, whether or not #489 is added.

In the following code FSStore will always pass on_error="omit" to fsspec, regardless of the setting of exceptions, or missing_exceptions:

https://github.com/zarr-developers/zarr-python/blob/3db41760e18fb0a69b5066e8c7aba9752a8c474e/zarr/storage.py#L1414-L1421

Not sure what the fix is though...

tomwhite avatar Feb 15 '24 15:02 tomwhite

I'm so confused about this. I'm trying to run this example notebook, sometimes it returns data, and sometimes NaNs, even when querying the same spatiotemporal bounds on different calls! I'm not sure where to ask this, since it may be an S3 thing, but if someone knows the answer: why is this not consistent? How do I ensure my calls receive data?

I'm a Zarr-head, and believe this should be the easiest way to access this CF-compliant data, but how is this usable if you're playing Russian roulette trying to access data and coming up with blanks most of the time?

I'm not sure I'd rather see errors cancelling an entire operation, especially if some data is coming through. But I would like to be able to differentiate between missing data and understand what's hanging up S3 calls.

fsspec v '2024.3.1' zarr v '2.17.1'

openSourcerer9000 avatar Mar 28 '24 17:03 openSourcerer9000

I'm not sure I'd rather see errors cancelling an entire operation, especially if some data is coming through. But I would like to be able to differentiate between missing data and understand what's hanging up S3 calls.

We really don't have a way of doing that. In another project I work on, dask-awkward, we have implemented IO reports that list all the failures, but that's in the context of dask tasks. Zarr only allows for data or nothing; you would need to have at least one other special value to show failure.

martindurant avatar Mar 28 '24 17:03 martindurant

Can you explain what you mean by special value?

How does anyone actually use this in practice? Hit the bucket a hundred times waiting to see if data ever appears? The NaNs are not something consistent. I'm querying four datasets in the same bucket one after another in a pd apply operation with no wait time. one time I get hit-miss-miss-miss (hit returning data and miss returning NANs). I try again a while later and get miss-miss-miss-hit. it doesn't seem like an API timeout issue or it might be consistently hit-miss-miss-miss. how many times would I have to run this to be confident that the middle 2 datasets are actually empty?

openSourcerer9000 avatar Mar 28 '24 18:03 openSourcerer9000

Using an FSMap directly with xarray open_dataset (or indeed zarr.open) is still allowed but deprecated - zarr will construct an FSStore over it, which is why any arguments you are passing are probably getting lost.

ds_single = xr.open_dataset(
    url, 
    backend_kwargs=dict(
        storage_options=dict(
            anon=True, exceptions=(FileNotFoundError, )
        ), 
    consolidated=True
    )
)

martindurant avatar Mar 28 '24 18:03 martindurant