Feature Request: support for xarray objects
First, fantastic work! provenance has a lot of features I am looking for 😀.
Second, I would like to extend provenance functionality to support one of my use cases i.e. tracking provenance of xarray objects. Currently, trying provenance on xarray objects results in an error.
In [1]: import provenance as p
...:
...:
...: p.load_config({'blobstores':
...: {'disk': {'type': 'disk',
...: 'cachedir': 'artifacts',
...: 'read': True,
...: 'write': True,
...: 'read_through_write': False,
...: 'delete': True}},
...: 'artifact_repos':
...: {'local': {'type': 'postgres',
...: 'db': 'postgresql://localhost/provenance-basic-example',
...: 'store': 'disk',
...: 'read': True,
...: 'write': True,
...: 'create_db': True,
...: 'read_through_write': False,
...: 'delete': True}},
...: 'default_repo': 'local'})
Out[1]: <provenance.repos.Config at 0x116021fd0>
In [2]:
In [2]: import xarray as xr
In [3]: ds = xr.tutorial.open_dataset('rasm')
In [4]: ds
Out[4]:
<xarray.Dataset>
Dimensions: (time: 36, x: 275, y: 205)
Coordinates:
* time (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
xc (y, x) float64 ...
yc (y, x) float64 ...
Dimensions without coordinates: x, y
Data variables:
Tair (time, y, x) float64 ...
Attributes:
title: /workspace/jhamman/processed/R1002RBRxaaa01a/l...
institution: U.W.
source: RACM R1002RBRxaaa01a
output_frequency: daily
output_mode: averaged
convention: CF-1.4
references: Based on the initial model of Liang et al., 19...
comment: Output from the Variable Infiltration Capacity...
nco_openmp_thread_number: 1
NCO: "4.6.0"
history: Tue Dec 27 14:15:22 2016: ncatted -a dimension...
In [5]: @p.provenance
...: def anomaly(ds, groupby='time.year'):
...: group = ds.groupby(groupby)
...: clim = group.mean()
...: return ds - clim
...:
In [6]: anom = anomaly(ds.Tair)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~/opt/miniconda3/envs/sandbox/lib/python3.8/site-packages/xarray/core/common.py in __setattr__(self, name, value)
261 try:
--> 262 object.__setattr__(self, name, value)
263 except AttributeError as e:
AttributeError: 'DataArray' object has no attribute '_provenance_metadata'
The above exception was the direct cause of the following exception:
AttributeError Traceback (most recent call last)
<ipython-input-6-a0084989764c> in <module>
----> 1 anom = anomaly(ds.Tair)
~/devel/ncar/provenance/provenance/core.py in wrapped(f)
680 if tags:
681 _custom_fields['tags'] = tags
--> 682 f._provenance_metadata = {
683 'version': version,
684 'name': name,
~/opt/miniconda3/envs/sandbox/lib/python3.8/site-packages/xarray/core/common.py in __setattr__(self, name, value)
268 ):
269 raise
--> 270 raise AttributeError(
271 "cannot set attribute %r on a %r object. Use __setitem__ style"
272 "assignment (e.g., `ds['name'] = ...`) instead of assigning variables."
AttributeError: cannot set attribute '_provenance_metadata' on a 'DataArray' object. Use __setitem__ styleassignment (e.g., `ds['name'] = ...`) instead of assigning variables.
I would like to help with this but I want to confirm whether this is something provenance devs would be willing to have/support and/or I may be missing something?
Hi @andersy005, in this case I think the problem is that you need to call the provenance decorator... so instead of @p.provenance it should be @p.provenance().
More generally, I haven't tested xarray in provenance. By default all objects will be pickled with job which is probably not what you want for xarray. To avoid the default pickling logic you should register a serializer for the xarray type so you can use the provided IO functions for xarray. Take a look at this example of how panda dataframes are setup to use parquet instead of pickle: https://github.com/bmabey/provenance/blob/d946b583e5cbe30c7a2cb2f6c74eaec2ef4d09ab/provenance/serializers.py#L78-L88
Thank you for getting back to me @bmabey!
To avoid the default pickling logic you should register a serializer for the xarray type so you can use the provided IO functions for xarray.
👍
I tried registering xarray serializers, but I ran into a TypeError: save_global() missing 1 required positional argument: 'obj'. Do you happen to know whether this something that has to do with cloudpickle or something else I may be missing? I am on cloudpickle v1.4.1
import xarray as xr
import provenance as p
ds = xr.tutorial.open_dataset('rasm')
# Register xarray serializers via netCDF
def xr_dataset_netcdf_dump(ds, filename, **kwargs):
return ds.to_netcdf(filename, **kwargs)
def xr_dataset_netcdf_load(filename, **kwargs):
return xr.open_dataset(filename, **kwargs)
p.serializers.register_serializer('xr_dataset', xr_dataset_netcdf_dump, xr_dataset_netcdf_load,
classes=[xr.Dataset])
@p.provenance()
def anomaly(ds, groupby='time.year'):
"""Compute annual annomalies"""
group = ds.groupby(groupby)
clim = group.mean()
return ds - clim
%%time
anom = anomaly(ds.Tair)
Stacktrace
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<timed exec> in <module>
<boltons.funcutils.FunctionBuilder-4> in anomaly(ds, groupby)
~/devel/ncar/provenance/provenance/core.py in _provenance_wrapper(*args, **kargs)
274 inputs['filehash'] = value_id
275
--> 276 input_hashes, input_artifact_ids = hash_inputs(inputs, repos.get_check_mutations(), func_info)
277
278 id = create_id(input_hashes, **func_info['identifiers'])
~/devel/ncar/provenance/provenance/core.py in hash_inputs(inputs, check_mutations, func_info)
106
107 for k, v in inputs['kargs'].items():
--> 108 h, artifacts = hash(v, hasher=ah.artifact_hasher())
109 kargs[k] = h
110 for a in artifacts:
~/devel/ncar/provenance/provenance/hashing.py in hash(obj, hasher, hash_name, coerce_mmap)
279 hasher = Hasher(hash_name=hash_name)
280
--> 281 return hasher.hash(obj)
282
283
~/devel/ncar/provenance/provenance/artifact_hasher.py in hash(self, obj)
41
42 def hash(self, obj):
---> 43 return (h.NumpyHasher.hash(self, obj), self.artifacts.values())
44
45
~/devel/ncar/provenance/provenance/hashing.py in hash(self, obj)
79 def hash(self, obj):
80 try:
---> 81 self.dump(obj)
82 except pickle.PicklingError as e:
83 e.args += ('PicklingError while hashing %r: %r' % (obj, e),)
~/opt/miniconda3/envs/sandbox/lib/python3.8/site-packages/cloudpickle/cloudpickle_fast.py in dump(self, obj)
546 def dump(self, obj):
547 try:
--> 548 return Pickler.dump(self, obj)
549 except RuntimeError as e:
550 if "recursion" in e.args[0]:
TypeError: save_global() missing 1 required positional argument: 'obj'
@andersy005 The serializers that you wrote look good. For context, pickle (well, cloudpickle) is still being used to compute the hash of the object that is used as a key in the database. Do you know if xarray has a more custom and efficient way of computing hashes for the dataset? For example, zarr Arrays have a digest method that returns you a digest/hash of the data. If xarray has something similar we should use that since it would be faster and and more reliably that the default hasher we have taken from joblib. (To override provenances default hashing behavior you would register a value_repr function for the Dataset type. Let me know if there is a digest available for xarray and I can provide an example on how to do this if it would be helpful.)
As a sanity test... can you save your xarray Dataset using joblib? If so it may be that with newer versions this would all just work.