provenance Feature Request: support for xarray objects

First, fantastic work! provenance has a lot of features I am looking for 😀.

Second, I would like to extend provenance functionality to support one of my use cases i.e. tracking provenance of xarray objects. Currently, trying provenance on xarray objects results in an error.

See this example

In [1]: import provenance as p 
   ...:  
   ...:  
   ...: p.load_config({'blobstores': 
   ...:                {'disk': {'type': 'disk', 
   ...:                          'cachedir': 'artifacts', 
   ...:                          'read': True, 
   ...:                          'write': True, 
   ...:                          'read_through_write': False, 
   ...:                          'delete': True}}, 
   ...:                'artifact_repos': 
   ...:                {'local': {'type': 'postgres', 
   ...:                           'db': 'postgresql://localhost/provenance-basic-example', 
   ...:                           'store': 'disk', 
   ...:                           'read': True, 
   ...:                           'write': True, 
   ...:                           'create_db': True, 
   ...:                           'read_through_write': False, 
   ...:                           'delete': True}}, 
   ...:                'default_repo': 'local'})                                                                                                                                       

Out[1]: <provenance.repos.Config at 0x116021fd0>

In [2]:                                                                                                                                                                                

In [2]: import xarray as xr                                                                                                                                                            

In [3]: ds = xr.tutorial.open_dataset('rasm')                                                                                                                                          

In [4]: ds                                                                                                                                                                             
Out[4]: 
<xarray.Dataset>
Dimensions:  (time: 36, x: 275, y: 205)
Coordinates:
  * time     (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
    xc       (y, x) float64 ...
    yc       (y, x) float64 ...
Dimensions without coordinates: x, y
Data variables:
    Tair     (time, y, x) float64 ...
Attributes:
    title:                     /workspace/jhamman/processed/R1002RBRxaaa01a/l...
    institution:               U.W.
    source:                    RACM R1002RBRxaaa01a
    output_frequency:          daily
    output_mode:               averaged
    convention:                CF-1.4
    references:                Based on the initial model of Liang et al., 19...
    comment:                   Output from the Variable Infiltration Capacity...
    nco_openmp_thread_number:  1
    NCO:                       "4.6.0"
    history:                   Tue Dec 27 14:15:22 2016: ncatted -a dimension...

In [5]: @p.provenance 
   ...: def anomaly(ds, groupby='time.year'): 
   ...:     group = ds.groupby(groupby) 
   ...:     clim = group.mean() 
   ...:     return ds - clim 
   ...:                                                                                                                                                                                

In [6]: anom = anomaly(ds.Tair)                                                                                                                                                        
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~/opt/miniconda3/envs/sandbox/lib/python3.8/site-packages/xarray/core/common.py in __setattr__(self, name, value)
    261         try:
--> 262             object.__setattr__(self, name, value)
    263         except AttributeError as e:

AttributeError: 'DataArray' object has no attribute '_provenance_metadata'

The above exception was the direct cause of the following exception:

AttributeError                            Traceback (most recent call last)
<ipython-input-6-a0084989764c> in <module>
----> 1 anom = anomaly(ds.Tair)

~/devel/ncar/provenance/provenance/core.py in wrapped(f)
    680         if tags:
    681             _custom_fields['tags'] = tags
--> 682         f._provenance_metadata = {
    683             'version': version,
    684             'name': name,

~/opt/miniconda3/envs/sandbox/lib/python3.8/site-packages/xarray/core/common.py in __setattr__(self, name, value)
    268             ):
    269                 raise
--> 270             raise AttributeError(
    271                 "cannot set attribute %r on a %r object. Use __setitem__ style"
    272                 "assignment (e.g., `ds['name'] = ...`) instead of assigning variables."

AttributeError: cannot set attribute '_provenance_metadata' on a 'DataArray' object. Use __setitem__ styleassignment (e.g., `ds['name'] = ...`) instead of assigning variables.

I would like to help with this but I want to confirm whether this is something provenance devs would be willing to have/support and/or I may be missing something?

May 22 '20 05:05 andersy005

Hi @andersy005, in this case I think the problem is that you need to call the provenance decorator... so instead of @p.provenance it should be @p.provenance().

More generally, I haven't tested xarray in provenance. By default all objects will be pickled with job which is probably not what you want for xarray. To avoid the default pickling logic you should register a serializer for the xarray type so you can use the provided IO functions for xarray. Take a look at this example of how panda dataframes are setup to use parquet instead of pickle: https://github.com/bmabey/provenance/blob/d946b583e5cbe30c7a2cb2f6c74eaec2ef4d09ab/provenance/serializers.py#L78-L88

May 23 '20 00:05 bmabey

Thank you for getting back to me @bmabey!

To avoid the default pickling logic you should register a serializer for the xarray type so you can use the provided IO functions for xarray.

👍

I tried registering xarray serializers, but I ran into a TypeError: save_global() missing 1 required positional argument: 'obj'. Do you happen to know whether this something that has to do with cloudpickle or something else I may be missing? I am on cloudpickle v1.4.1

import xarray as xr
import provenance as p


ds = xr.tutorial.open_dataset('rasm')

# Register xarray serializers via netCDF

def xr_dataset_netcdf_dump(ds, filename, **kwargs):
    return ds.to_netcdf(filename, **kwargs)

def xr_dataset_netcdf_load(filename, **kwargs):
    return xr.open_dataset(filename, **kwargs)

p.serializers.register_serializer('xr_dataset', xr_dataset_netcdf_dump, xr_dataset_netcdf_load,
                                  classes=[xr.Dataset])


@p.provenance()
def anomaly(ds, groupby='time.year'):
    """Compute annual annomalies"""
    group = ds.groupby(groupby)
    clim = group.mean()
    return ds - clim

%%time
anom = anomaly(ds.Tair)

Stacktrace

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<timed exec> in <module>

<boltons.funcutils.FunctionBuilder-4> in anomaly(ds, groupby)

~/devel/ncar/provenance/provenance/core.py in _provenance_wrapper(*args, **kargs)
    274             inputs['filehash'] = value_id
    275 
--> 276         input_hashes, input_artifact_ids = hash_inputs(inputs, repos.get_check_mutations(), func_info)
    277 
    278         id = create_id(input_hashes, **func_info['identifiers'])

~/devel/ncar/provenance/provenance/core.py in hash_inputs(inputs, check_mutations, func_info)
    106 
    107     for k, v in inputs['kargs'].items():
--> 108         h, artifacts = hash(v, hasher=ah.artifact_hasher())
    109         kargs[k] = h
    110         for a in artifacts:

~/devel/ncar/provenance/provenance/hashing.py in hash(obj, hasher, hash_name, coerce_mmap)
    279             hasher = Hasher(hash_name=hash_name)
    280 
--> 281     return hasher.hash(obj)
    282 
    283 

~/devel/ncar/provenance/provenance/artifact_hasher.py in hash(self, obj)
     41 
     42     def hash(self, obj):
---> 43         return (h.NumpyHasher.hash(self, obj), self.artifacts.values())
     44 
     45 

~/devel/ncar/provenance/provenance/hashing.py in hash(self, obj)
     79     def hash(self, obj):
     80         try:
---> 81             self.dump(obj)
     82         except pickle.PicklingError as e:
     83             e.args += ('PicklingError while hashing %r: %r' % (obj, e),)

~/opt/miniconda3/envs/sandbox/lib/python3.8/site-packages/cloudpickle/cloudpickle_fast.py in dump(self, obj)
    546     def dump(self, obj):
    547         try:
--> 548             return Pickler.dump(self, obj)
    549         except RuntimeError as e:
    550             if "recursion" in e.args[0]:

TypeError: save_global() missing 1 required positional argument: 'obj'

May 23 '20 04:05 andersy005

@andersy005 The serializers that you wrote look good. For context, pickle (well, cloudpickle) is still being used to compute the hash of the object that is used as a key in the database. Do you know if xarray has a more custom and efficient way of computing hashes for the dataset? For example, zarr Arrays have a digest method that returns you a digest/hash of the data. If xarray has something similar we should use that since it would be faster and and more reliably that the default hasher we have taken from joblib. (To override provenances default hashing behavior you would register a value_repr function for the Dataset type. Let me know if there is a digest available for xarray and I can provide an example on how to do this if it would be helpful.)

As a sanity test... can you save your xarray Dataset using joblib? If so it may be that with newer versions this would all just work.

May 25 '20 23:05 bmabey