cf-xarray icon indicating copy to clipboard operation
cf-xarray copied to clipboard

Support `flag_masks`

Open dcherian opened this issue 4 years ago • 24 comments

From http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#flags

int basin(lat, lon);
       standard_name: region;
       flag_values: 1, 2, 3;
       flag_meanings:"atlantic_arctic_ocean indo_pacific_ocean global_ocean";
data:
   basin: 1, 1, 1, 1, 2, ..... ;

It'd be nice to provide .cf.__eq__ and allow

ds.where(basin.cf == "atlantic_arctic_ocean")  # becomes ds.where(basin == 1)

or ds.where(basin.cf.flag == "atlantic_arctic_ocean"))?

We could do something similar for .isin too

I don't understand flag_masks but we could probably do something similar

  byte sensor_status_qc(time, depth, lat, lon) ;
    sensor_status_qc:long_name = "Sensor Status" ;
    sensor_status_qc:_FillValue = 0b ;
    sensor_status_qc:valid_range = 1b, 63b ;
    sensor_status_qc:flag_masks = 1b, 2b, 4b, 8b, 16b, 32b ;
    sensor_status_qc:flag_meanings = "low_battery processor_fault
                                      memory_fault disk_fault
                                      software_fault
                                      maintenance_required" ;

dcherian avatar Apr 13 '21 02:04 dcherian

Am I right, that this does not work yet? I am about to work over at https://github.com/jbusecke/cmip6_preprocessing/issues/119 for a bit, and just wanted to check.

jbusecke avatar Jul 28 '21 15:07 jbusecke

No but you can implement it!

How are you planning to use it? in .where?

dcherian avatar Jul 28 '21 15:07 dcherian

Yes. This is a prototype of the code I am writing right now:

from cmip6_preprocessing.regionmask import merged_mask, _default_merge_dict

def basin_mask(ds):
    basins = regionmask.defined_regions.natural_earth.ocean_basins_50
    basins = merged_mask(basins, ds)
    
    # add cf compatible metadata
    basins.name = 'basin'
    flag_meanings = [m.replace(' ','_').lower() for m in list(_default_merge_dict().keys())]
    flag_values = np.arange(len(flag_meanings)).astype(str)
    basins.attrs.update({'flag_values':','.join(flag_values), 'flag_meanings':' '.join(flag_meanings)})
    return basins

def mask_basin(ds, basin, drop=False):
    mask = basin_mask(ds)
    
    # this should be done with cf_xarray once https://github.com/xarray-contrib/cf-xarray/issues/199 is implemented.
    # but for now some hardcoded hack...
    flag_meanings = mask.attrs['flag_meanings'].split(' ')
    flag_values = mask.attrs['flag_values'].split(',')
    flag_zipped = zip(flag_meanings, flag_values)
    print(flag_zipped)
    if basin not in flag_meanings:
        raise ValueError(f'Did not find basin [{basin}] in given values [{flag_meanings}]')
    else:
        flag_matched = [f for f in flag_zipped if f[0] == basin][0]
    
    print(flag_matched)
    ds_masked = ds.where(mask==int(flag_matched[1]), drop=drop)
    ds_masked.attrs['basin_masked'] = flag_matched[0]
    
    return ds_masked

This works very nicely in my work context (with any CMIP6 model):

ds = data_dict['CMIP.MIROC.MIROC6.historical.Omon.gn']
# basin_test = basin_mask(ds)
ds_masked = mask_basin(ds, 'southern_ocean')
plt.figure()
ds.thetao.isel(time=0, lev=0, member_id=0).plot()
plt.figure()
ds_masked.thetao.isel(time=0, lev=0, member_id=0).plot()

image

image

I would like to replace these lines

    flag_meanings = mask.attrs['flag_meanings'].split(' ')
    flag_values = mask.attrs['flag_values'].split(',')
    flag_zipped = zip(flag_meanings, flag_values)
    print(flag_zipped)
    if basin not in flag_meanings:
        raise ValueError(f'Did not find basin [{basin}] in given values [{flag_meanings}]')
    else:
        flag_matched = [f for f in flag_zipped if f[0] == basin][0]
    
    print(flag_matched)
    ds_masked = ds.where(mask==int(flag_matched[1]), drop=drop)

with e.g.

ds_masked = ds.where(mask.cf == basin)

Thought?

jbusecke avatar Jul 28 '21 15:07 jbusecke

Looks nice. I think I like the more explicit version with a CfFlags class:

  1. mask.cf.flag == basin and
  2. mask.cf.flag.isin(["atlantic_ocean", "pacific_ocean"])

We should also add a flags section to the repr.

PR?

dcherian avatar Jul 28 '21 16:07 dcherian

I dont think I can put one in today (real WTF energy over here haha). But would love to do this at a later point!

jbusecke avatar Jul 28 '21 16:07 jbusecke

Leaving this open for flag masks. Contributions welcome!

dcherian avatar Jul 29 '21 16:07 dcherian

Oh nice, didn't know about this - maybe I could use this for 2D masks in regionmask somehow. (It's always a bit a bummer that there is no way to identify regions by name for 2D masks.)

mathause avatar Mar 17 '22 10:03 mathause

Can you provide an example of what you want?

The documentation is decent: https://cf-xarray.readthedocs.io/en/latest/flags.html

dcherian avatar Mar 17 '22 12:03 dcherian

@mathause if we could work functionality like in my example above into regionmask that would be amazing. Together with this I might be able to completely remove the regionmask module from cmip6_preprocessing (I would really like that!). Let me know if you would like some review, comments or anything else that would help.

jbusecke avatar Mar 17 '22 17:03 jbusecke

Can you provide an example of what you want?

regionmask encodes the region a grid point with a number. So this would allow to also know the name of the region going along with the number. Very similar to the example given in the documentation. So I guess flag_masks is not required for this (but I am not entirely sure I correctly understand this part.)

if we could work functionality like in my example above into regionmask that would be amazing.

@jbusecke: which part do you mean? Should we discuss this in regionmask/regionmask#346?

mathause avatar Mar 30 '22 13:03 mathause

I think flag masks are more for quality control stuff but I don't really understand them either

dcherian avatar Mar 30 '22 20:03 dcherian

@jbusecke: which part do you mean? Should we discuss this in https://github.com/regionmask/regionmask/issues/346?

Sorry this was not clear. What I meant was that if both https://github.com/nvkelso/natural-earth-vector/issues/697 and https://github.com/regionmask/regionmask/issues/346 would work, I would not have to do any of this in cmip6_pp (or maybe just provide a thin wrapper. That would be amazing! Ill move over to your issue and ref this for further discussion.

jbusecke avatar Mar 31 '22 14:03 jbusecke

Hi! I needed to extract flags masks recently and delved into it a bit. I never used cf-xarray but I am up for working on adding support for flag_masks. I stumbled upon this bit of code that can unpack (binary) flags values into a array of booleans. I guess it can adapted for our uses, and maybe expanded directly for dask arrays. This method of unpacking all flags at once can inflate data (in my works it goes from a single NC_SHORT variable to 16 boolean arrays). Maybe I can work on extracting only the flags the user is interested in.

So if no one is already on that I can start working on a PR !

Descanonge avatar Jul 13 '22 15:07 Descanonge

A PR is very welcome!

extract flags masks

What does this mean? You need a time series of booleans for each bit in the mask?

@rsignell-usgs, @ocefpaf: do you have suggestions on what might be useful here?

dcherian avatar Jul 13 '22 15:07 dcherian

extract flags masks

What does this mean? You need a time series of booleans for each bit in the mask?

The flag variable codes for as many boolean masks as indicated by the flag_masks attribute. So instead of unpacking everything maybe we could only retrieve one boolean mask (with the same dimensions as the flag variable) based on its corresponding flag_meanings.

Descanonge avatar Jul 13 '22 16:07 Descanonge

So for

    sensor_status_qc:flag_masks = 1b, 2b, 4b, 8b, 16b, 32b ;
    sensor_status_qc:flag_meanings = "low_battery processor_fault
                                      memory_fault disk_fault
                                      software_fault
                                      maintenance_required" ;

So you want sensor_status_qc.cf == "memory_fault" to give a boolean time series?

dcherian avatar Jul 13 '22 16:07 dcherian

For example 3.5 it should return a boolean array with dimensions (time, depth, lat, lon), as for byte sensor_status_qc(time, depth, lat, lon).

Descanonge avatar Jul 13 '22 16:07 Descanonge

Ah yes that's right. I would be happy to merge a PR with this functionality =) > The core code for flag variables is https://github.com/xarray-contrib/cf-xarray/blob/1f590210a1e9d1443c6b3f273fede5adc42085b2/cf_xarray/accessor.py#L924-L935

dcherian avatar Jul 13 '22 16:07 dcherian

I'll start working on that then. I still have to see how to make it work with example 3.6 which combines flag_masks and flag_values...

Descanonge avatar Jul 13 '22 16:07 Descanonge

@DocOtak Are "flag masks" something you deal with at CCHDO? Do you have opinions on how cf-xarray can provide a useful interface for taking advantage of them?

dcherian avatar Jul 28 '22 15:07 dcherian

Pinging @kwilcox here b/c he may be interested in this.

ocefpaf avatar Jul 28 '22 18:07 ocefpaf

@dcherian We are still using the WOCE QC flags (all three versions of them) so flag masks aren’t something we deal with. Flag masks as I read them, are basically a “compression” technique for encoding multiple states and the allowed combinations. Basically, I think the most useful interface would be one that hides flag masks and just lets me ask if a flag value is set and not care how it is encoded.

Given the following example from the CF document:

  byte sensor_status_qc(time, depth, lat, lon) ;
    sensor_status_qc:long_name = "Sensor Status" ;
    sensor_status_qc:standard_name = "status_flag" ;
    sensor_status_qc:_FillValue = 0b ;
    sensor_status_qc:valid_range = 1b, 15b ;
    sensor_status_qc:flag_masks = 1b, 2b, 12b, 12b, 12b ;
    sensor_status_qc:flag_values = 1b, 2b, 4b, 8b, 12b ;
    sensor_status_qc:flag_meanings =
         "low_battery
          hardware_fault
          offline_mode calibration_mode maintenance_mode" ;

There are three independent states values and I might expect to be able to do boolean logic on these:

  • low_battery (true/false so only one bit needed)
  • hardware_fault (true/false so only one bit needed)
  • *_mode (with 3 possible values, two bits needed)

The masks are basically which bits you look at to determine the value for these states (I didn’t look, but assume the bitwise and has precedence over the equality test):

  • low_battery -> where (variable value) & 1b == 1b
  • hardware_fault -> where (variable value) & 2b == 2b
  • offline_mode -> where (variable value) & 12b == 4b
  • calibration_mode -> where (variable value) & 12b == 8b
  • maintenance_mode -> where (variable value) & 12b == 12b

I suspect the easiest thing to understand as a user would be just some “virtual” boolean variables with a similar property interface as xr.Dataset, assuming the flag meanings are compatible with being properties, just fallback to the usual “getter” protocol if not. (I’m also guessing these should be somewhat lazy in their evaluation):

  • ds.cf.flag.low_battery => (some boolean array)
  • ds.cf.flag.offline_mode => (some boolean array)

This would let me do the flag logic myself and not really need to figure out a (new?) query interface…

Things with low battery and a hardware fault:

  • ds.cf.flag.low_battery & ds.cf.flag.hardware_fault

Things in maintenance_mode:

  • ds.cf.flag.maintenance_mode

This combination is not allowed by the flag masks/values and would result in an all false, I wouldn’t expect it to throw:

  • ds.cf.offline_mode & ds.cf.calibration_mode

It can get verbose… things with a low battery an no other issues:

  • ds.cf.flag.low_battery & ~ds.cf.flag.low_battery & ~(ds.cf.flag.offline_mode | ds.cf.flag.calibration_mode | ds.cf.flag. maintenance_mode)

DocOtak avatar Jul 28 '22 23:07 DocOtak

ds.cf.flag.low_battery => (some boolean array)

I like this idea.

ds.cf.flags could return a dictionary or namedtuple mapping "flag_meaning" to a boolean mask.

dcherian avatar Aug 09 '22 21:08 dcherian

Thank you @DocOtak, this is well explained, and I also like the proposed interface. I can look into implementing it.

One thing though, the interface implemented so far had inequalities supported and I am not sure in what cases it could be useful to write

ds.cf > 'some_flag'

I did not port it for flag_masks as it make things a bit more complicated. But maybe someone here has a use case for it ?

Descanonge avatar Aug 09 '22 22:08 Descanonge

For those interested, #354 is almost ready but could use some testing for correctness and feedback on whether the current API is useful.

Thanks to @Descanonge !

The current design is

da.cf.flags -> Dataset

which I quite like.

dcherian avatar Feb 23 '23 04:02 dcherian

This is great to hear, excited to see it go live!

Thanks for your work on this!

mcastells avatar Apr 25 '23 19:04 mcastells

Can you test out #354 for your use case please?

dcherian avatar Apr 25 '23 19:04 dcherian

Sure, I'll give it a shot.

mcastells avatar Apr 25 '23 21:04 mcastells