Description

Variables or coordinates with object dtype often cause problems when saving.

In PyMC I'm refactoring stats to include a "warning" that that is a custom object containing detailed information about sampler divergences. This is, of course, not ideal, but a compromise until we can settle on a data structure that would be serializeable.

For this to work, we need the "warning" stat to be automatically dropped when doing .to_netcdf(), but this is an ArviZ-level implementation.

Therefore, this PR adds helper functions to drop object-dtyped variables/coords before saving.

I believe this should give additional robustness against issues such as #2079.

Checklist

[x] Follows official PR format
[x] New features are properly documented (with an example if appropriate)?
[x] Includes new or updated tests to cover the new feature
[x] Code style correct (follows pylint and black guidelines)
[x] Changes are listed in changelog

:books: Documentation preview :books:: https://arviz--2134.org.readthedocs.build/en/2134/

Oct 11 '22 14:10 michaelosthege

I think this will also drop all text variables?

Oct 11 '22 15:10 ahartikainen

I think this will also drop all text variables?

With the last push I included modified the coords dropping to have one str coordinate and it's not dropped.

The CI should go green now too.

...ready to review =)

cc @OriolAbril

Oct 11 '22 21:10 michaelosthege

Codecov Report

Merging #2134 (a21c3a9) into main (b2503b5) will increase coverage by 0.01%. The diff coverage is 100.00%.

:exclamation: Current head a21c3a9 differs from pull request most recent head f6ac0a5. Consider uploading reports for the commit f6ac0a5 to get more accurate results

@@            Coverage Diff             @@
##             main    #2134      +/-   ##
==========================================
+ Coverage   90.73%   90.74%   +0.01%     
==========================================
  Files         118      118              
  Lines       12566    12582      +16     
==========================================
+ Hits        11402    11418      +16     
  Misses       1164     1164

Impacted Files	Coverage Δ
arviz/data/inference_data.py	`83.75% <100.00%> (+0.80%)`	:arrow_up:
arviz/data/io_pystan.py	`96.61% <0.00%> (-0.28%)`	:arrow_down:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Oct 11 '22 22:10 codecov[bot]

I think this will also drop all text variables?

With the last push I included modified the coords dropping to have one str coordinate and it's not dropped.

The CI should go green now too.

...ready to review =)

cc @OriolAbril

Was the string coordinate saved as object?

e.g. if one saves one of the example datasets, will it drop coords?

Oct 12 '22 04:10 ahartikainen

I mean, that object dtype is totally valid for strings and there should not be any problems saving them to netcdf/zarr, it is just the compression that has problems with them and current main does not try to compress them.

I think we probably should have some other way to handle these things.

Ideally we should not push invalid data to InferenceData and then try to automatically fix these in our end. What if the object type used can be saved to netcdf by xarray but we just remove it. Should user now start save inference data objects manually?

Oct 12 '22 10:10 ahartikainen

I don't think object is valid for strings. See https://github.com/arviz-devs/arviz_example_data/pull/8 where now the "school" coordinate is of <U16 dtype which has .hasobject == False.

And arbitrary objects will always cause problems, even if the underlying netcdf/zlib libraries can sometimes convert some of them..

Oct 12 '22 10:10 michaelosthege

I don't think object is valid for strings. See arviz-devs/arviz_example_data#8 where now the "school" coordinate is of <U16 dtype which has .hasobject == False.

And arbitrary objects will always cause problems, even if the underlying netcdf/zlib libraries can sometimes convert some of them..

They are

https://github.com/pydata/xarray/issues/2059

Oct 12 '22 11:10 ahartikainen

Also to enable adding strings to a particular array of strings basically need to use object type.

Oct 12 '22 11:10 ahartikainen

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Oct 12 '22 18:10 review-notebook-app[bot]

Also to enable adding strings to a particular array of strings basically need to use object type.

Unfortunately it appears you're right.. When I updated the (non-)center_eight InferenceData in https://github.com/arviz-devs/arviz_example_data/pull/8 the "school" coord had an <U16 dtype before saving. Loading, however, results in a dtype('O').

With that it seems almost impossible to tell what constitutes an "invalid" data type before attempting to save it.

Also, it appears that some tests are now failing on the updated example data, so this update should probably be done in a separate PR.

Oct 12 '22 18:10 michaelosthege

Part of the problem comes from numpy + string support. Not sure if there is going to be any fixes.

I'm not sure if there is a way for us to test serializing the first item for each variable (with or without compression) and then dropping invalid items.

Oct 12 '22 21:10 ahartikainen

I don't think there is any reasonable way forward given the no guarantee for round trip when it comes to dtypes. In my opinion scanning variables to check which can be saved and which not with try except is way too much work, and not the right place for such code, we save datasets as bulk so going variable per variable and coordinate per coordinate is much more work and might need interfacing with netcdf directly.

I think we should close this PR and try to get logging into xarray so users know which variables and/or coordinates are the ones breaking the write process

Oct 13 '22 10:10 OriolAbril

Yes, I have made my PyMC changes independent of this, and more informative errors at the xarray level sound great

Oct 13 '22 12:10 michaelosthege

arviz
arviz copied to clipboard

Drop object-dtyped variables and coords before saving

Description

Checklist

Codecov Report

arviz arviz copied to clipboard

Drop object-dtyped variables and coords before saving

Description

Checklist

Codecov Report

arviz
arviz copied to clipboard