arviz
arviz copied to clipboard
Drop object-dtyped variables and coords before saving
Description
Variables or coordinates with object dtype often cause problems when saving.
In PyMC I'm refactoring stats to include a "warning" that that is a custom object containing detailed information about sampler divergences.
This is, of course, not ideal, but a compromise until we can settle on a data structure that would be serializeable.
For this to work, we need the "warning" stat to be automatically dropped when doing .to_netcdf(), but this is an ArviZ-level implementation.
Therefore, this PR adds helper functions to drop object-dtyped variables/coords before saving.
I believe this should give additional robustness against issues such as #2079.
Checklist
- [x] Follows official PR format
- [x] New features are properly documented (with an example if appropriate)?
- [x] Includes new or updated tests to cover the new feature
- [x] Code style correct (follows pylint and black guidelines)
- [x] Changes are listed in changelog
:books: Documentation preview :books:: https://arviz--2134.org.readthedocs.build/en/2134/
I think this will also drop all text variables?
I think this will also drop all text variables?
With the last push I included modified the coords dropping to have one str coordinate and it's not dropped.
The CI should go green now too.
...ready to review =)
cc @OriolAbril
Codecov Report
Merging #2134 (a21c3a9) into main (b2503b5) will increase coverage by
0.01%. The diff coverage is100.00%.
:exclamation: Current head a21c3a9 differs from pull request most recent head f6ac0a5. Consider uploading reports for the commit f6ac0a5 to get more accurate results
@@ Coverage Diff @@
## main #2134 +/- ##
==========================================
+ Coverage 90.73% 90.74% +0.01%
==========================================
Files 118 118
Lines 12566 12582 +16
==========================================
+ Hits 11402 11418 +16
Misses 1164 1164
| Impacted Files | Coverage Δ | |
|---|---|---|
| arviz/data/inference_data.py | 83.75% <100.00%> (+0.80%) |
:arrow_up: |
| arviz/data/io_pystan.py | 96.61% <0.00%> (-0.28%) |
:arrow_down: |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
I think this will also drop all text variables?
With the last push I included modified the coords dropping to have one
strcoordinate and it's not dropped.The CI should go green now too.
...ready to review =)
cc @OriolAbril
Was the string coordinate saved as object?
e.g. if one saves one of the example datasets, will it drop coords?
I mean, that object dtype is totally valid for strings and there should not be any problems saving them to netcdf/zarr, it is just the compression that has problems with them and current main does not try to compress them.
I think we probably should have some other way to handle these things.
Ideally we should not push invalid data to InferenceData and then try to automatically fix these in our end. What if the object type used can be saved to netcdf by xarray but we just remove it. Should user now start save inference data objects manually?
I don't think object is valid for strings. See https://github.com/arviz-devs/arviz_example_data/pull/8 where now the "school" coordinate is of <U16 dtype which has .hasobject == False.
And arbitrary objects will always cause problems, even if the underlying netcdf/zlib libraries can sometimes convert some of them..
I don't think
objectis valid for strings. See arviz-devs/arviz_example_data#8 where now the"school"coordinate is of<U16dtype which has.hasobject == False.And arbitrary
objects will always cause problems, even if the underlying netcdf/zlib libraries can sometimes convert some of them..
They are
https://github.com/pydata/xarray/issues/2059
Also to enable adding strings to a particular array of strings basically need to use object type.
Check out this pull request on ![]()
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
Also to enable adding strings to a particular array of strings basically need to use object type.
Unfortunately it appears you're right..
When I updated the (non-)center_eight InferenceData in https://github.com/arviz-devs/arviz_example_data/pull/8 the "school" coord had an <U16 dtype before saving.
Loading, however, results in a dtype('O').
With that it seems almost impossible to tell what constitutes an "invalid" data type before attempting to save it.
Also, it appears that some tests are now failing on the updated example data, so this update should probably be done in a separate PR.
Part of the problem comes from numpy + string support. Not sure if there is going to be any fixes.
I'm not sure if there is a way for us to test serializing the first item for each variable (with or without compression) and then dropping invalid items.
I don't think there is any reasonable way forward given the no guarantee for round trip when it comes to dtypes. In my opinion scanning variables to check which can be saved and which not with try except is way too much work, and not the right place for such code, we save datasets as bulk so going variable per variable and coordinate per coordinate is much more work and might need interfacing with netcdf directly.
I think we should close this PR and try to get logging into xarray so users know which variables and/or coordinates are the ones breaking the write process
Yes, I have made my PyMC changes independent of this, and more informative errors at the xarray level sound great