Optimsed lazy indexing - h5netcdf backend - Active storage reductions
Fixes #501
This PR has three parts, which ideally would have been in three separate PRs, but they needed to be co-developed, and splitting them out retrospectively turned out to be to hard. The three parts are:
-
Optimised lazy indexing
- For data in files, a layer in the Dask graph which indexes the array object (such as a
NetCDF4Arrayobject) no longer triggers a read from disk. This has to be done within another layer by converting the indexed array object to anumpyarray (typically with the newcf_asnyarrayfunction). When an already-indexed array object is indexed again, the two indices are effectively combined to create a new index. E.g. If the array has shape(10, 10), the first index is[:, :], the second index is[::2, [1, 5, 3, 7]], and the third index is[::-2, 1::2], then this is equivalent to the single index[8::-4, 5:8:2]. Only the data for the combined index needs to retrieved from disk, which is a huge performance improvement, as the amount of IO is typically vastly reduced. The same happens for array objects in memory (such asnumpyarrays), but for these the performance improvements are negligible.
- For data in files, a layer in the Dask graph which indexes the array object (such as a
-
h5netcdf
- Introducing the ability to use
h5netcdfas a backend. Essentially making cf-python compatible with https://github.com/NCAS-CMS/cfdm/pull/307.
- Introducing the ability to use
-
Active storage reductions
- Implement the option to do active storage reductions - i.e. to move selected collapse calculations from the client's CPU to a remote server that is "close" (in a network sense) to the data being collapsed.
To help alleviate the difficulties of reviewing a 3-in-1 PR, I have marked up every code change with which of the three parts it relates to, and provided a checklist for all of the changes.
Optimised lazy indexing
- [ ] cf/docstring/docstring.py:637: # REVIEW: getitem:
_docstring_substitution_definitions: 'asanyarray'
- [ ] cf/test/test_NetCDF4Array.py:138: # REVIEW: getitem:
test_NetCDF4Array: testNetCDF4Array.shape - [ ] cf/test/test_NetCDF4Array.py:148: # REVIEW: getitem:
test_NetCDF4Array: testNetCDF4Array.index
- [ ] cf/test/test_Data.py:1482: # REVIEW: getitem:
test_Data__getitem__: Chained subspaces reading from disk - [ ] cf/test/test_Data.py:3293: # REVIEW: getitem:
test_Data_rechunk: rechunking after a getitem - [ ] cf/test/test_Data.py:4523: # # REVIEW: getitem:
test_Data_active_storage: testData.active_storage - [ ] cf/test/test_Data.py:4571: # REVIEW: getitem:
test_Data_cull_graph: prevent new asanyarray layer - [ ] cf/test/test_Data.py:4830: # REVIEW: getitem:
test_Data_is_masked: testData.is_masked
- [ ] cf/test/test_Field.py:1469: # REVIEW: getitem:
test_Field_indices: make sure works when 'g.array' is not masked - [ ] cf/test/test_Field.py:1488: # REVIEW: getitem:
test_Field_indices: make sure works when 'g.array' is not masked
- [ ] cf/test/test_FullArray.py:1:# REVIEW: getitem:
test_FullArray: new test module
- [ ] cf/mixin/propertiesdata.py:4695: # REVIEW: getitem:
to_dask_array: new keyword 'asanyarray'
- [ ] cf/read_write/netcdf/netcdfwrite.py:752: # REVIEW: getitem:
_cfa_write_non_standard_terms: set 'asanyarray' - [ ] cf/read_write/netcdf/netcdfwrite.py:813: # REVIEW: getitem:
_cfa_unique: convert a to a usable array - [ ] cf/read_write/netcdf/netcdfwrite.py:969: # REVIEW: getitem:
_cfa_aggregation_instructions: set 'asanyarray'
- [ ] cf/data/dask_regrid.py:5:# REVIEW: getitem:
regrid.py: importcf_asanyarray - [ ] cf/data/dask_regrid.py:180: # REVIEW: getitem:
regrid: convert a to a usable array
- [ ] cf/data/data.py:47:# REVIEW: getitem:
data.py: import cf_asanyarray, cf_filled, cf_is_masked - [ ] cf/data/data.py:378: # REVIEW: getitem:
__init__: set 'asanyarray' - [ ] cf/data/data.py:482: # REVIEW: getitem:
__init__: Set whether or not to callnp.asanyarrayon chunks to convert them to numpy arrays. - [ ] cf/data/data.py:516: # REVIEW: getitem:
__init__: set 'asanyarray' - [ ] cf/data/data.py:639: # REVIEW: getitem:
cf_contains: set 'asanyarray' - [ ] cf/data/data.py:792: # REVIEW: getitem:
__len__: set 'asanyarray' - [ ] cf/data/data.py:901: # REVIEW: getitem:
__getitem__: set 'asanyarray' - [ ] cf/data/data.py:961: # REVIEW: getitem:
__getitem__: set 'asanyarray=True' because subspaced chunks might not be in memory - [ ] cf/data/data.py:1184: # REVIEW: getitem:
__asanyarray__: new property__asanyarray__ - [ ] cf/data/data.py:1417: # REVIEW: getitem:
_set_dask: new keyword 'asanyarray' - [ ] cf/data/data.py:1476: # REVIEW: getitem:
_set_dask: set 'asanyarray' - [ ] cf/data/data.py:2552: # REVIEW: getitem:
percentile: set 'asanyarray' - [ ] cf/data/data.py:3052: # REVIEW: getitem:
percentile: rectify comment - [ ] cf/data/data.py:3240: # REVIEW: getitem:
rechunk: set 'asanyarray' - [ ] cf/data/data.py:3299: # REVIEW: getitem:
_asdatetime: set 'asanyarray' - [ ] cf/data/data.py:3357: # REVIEW: getitem:
_asreftime: set 'asanyarray' - [ ] cf/data/data.py:3970: # REVIEW: getitem:
_regrid: set 'asanyarray' - [ ] cf/data/data.py:4215: # REVIEW: getitem:
concatenate: set 'asanyarray' - [ ] cf/data/data.py:4248: # REVIEW: getitem:
concatenate: define the asanyarray status - [ ] cf/data/data.py:4259: # REVIEW: getitem:
concatenate: set 'asanyarray' - [ ] cf/data/data.py:4905: # REVIEW: getitem:
chunks: set 'asanyarray' - [ ] cf/data/data.py:4963: # REVIEW: getitem:
Units: set 'asanyarray' - [ ] cf/data/data.py:5033: # REVIEW: getitem:
dtype: set 'asanyarray' - [ ] cf/data/data.py:5149: # REVIEW: getitem:
is_masked: set 'asanyarray' - [ ] cf/data/data.py:5195: # REVIEW: getitem:
nbytes: set 'asanyarray' - [ ] cf/data/data.py:5232: # REVIEW: getitem:
ndim: set 'asanyarray' - [ ] cf/data/data.py:5257: # REVIEW: getitem:
npartitions: set 'asanyarray' - [ ] cf/data/data.py:5281: # REVIEW: getitem:
numblocksset 'asanyarray' - [ ] cf/data/data.py:5314: # REVIEW: getitem:
shape: set 'asanyarray' - [ ] cf/data/data.py:5356: # REVIEW: getitem:
sizeset 'asanyarray' - [ ] cf/data/data.py:6555: # REVIEW: getitem:
convert_reference_time: set 'asanyarray' - [ ] cf/data/data.py:6636: # REVIEW: getitem:
get_deterministic_name: set 'asanyarray' - [ ] cf/data/data.py:6706: # REVIEW: getitem:
get_filenames: set 'asanyarray' - [ ] cf/data/data.py:6843: # REVIEW: getitem:
add_file_location: set 'asanyarray' - [ ] cf/data/data.py:8287: # REVIEW: getitem:
unique: set 'asanyarray' - [ ] cf/data/data.py:9029: # REVIEW: getitem:
hardmask: set 'asanyarray' - [ ] cf/data/data.py:9152: # REVIEW: getitem:
soften_mask: set 'asanyarray' - [ ] cf/data/data.py:9184: # REVIEW: getitem:
file_locations: set 'asanyarray' - [ ] cf/data/data.py:9245: # REVIEW: getitem:
filled: set 'asanyarray' - [ ] cf/data/data.py:9877: # REVIEW: getitem:
to_dask_array: new keyword 'asanyarray' - [ ] cf/data/data.py:10209: # REVIEW: getitem:
del_file_location: set 'asanyarray' - [ ] cf/data/data.py:11394: # REVIEW: getitem:
where: set 'asanyarray' - [ ] cf/data/data.py:11412: # REVIEW: getitem:
where: set 'asanyarray' - [ ] cf/data/data.py:11458: # REVIEW: getitem:
where: 'da.asanyarray' is no longer required - [ ] cf/data/data.py:11685: # REVIEW: getitem:
cull_graph: set 'asanyarray' - [ ] cf/data/data.py:11957: # REVIEW: getitem:
todict: new keywords 'apply_mask_hardness', 'asanyarray'
- [ ] cf/data/array/h5netcdfarray.py:53: # REVIEW: getitem:
_get_array: new method to convert subspace to numpy array.
- [ ] cf/data/array/mixin/init.py:8:# REVIEW: getitem:
__init__.py: importIndexMixin
- [ ] cf/data/array/mixin/indexmixin.py:10:# REVIEW: getitem:
IndexMixin: new mixin classIndexMixin
- [ ] cf/data/array/netcdf4array.py:50: # REVIEW: getitem:
_get_array: Ignore this for h5 review - [ ] cf/data/array/netcdf4array.py:51: # REVIEW: getitem:
_get_array: new method to convert subspace to numpy array
- [ ] cf/data/array/umarray.py:174: # REVIEW: getitem:
_get_array: new method to convert subspace to numpy array - [ ] cf/data/array/umarray.py:275: # REVIEW: getitem:
_set_FillValue: record _FillValue in attributes - [ ] cf/data/array/umarray.py:370: # REVIEW: getitem:
_set_units: record units in attributes - [ ] cf/data/array/umarray.py:373: # REVIEW: getitem:
_set_unpack: record unpack in attributes
- [ ] cf/data/array/fullarray.py:124: # REVIEW: getitem:
_get_array: new method to convert subspace to numpy array - [ ] cf/data/array/fullarray.py:159: # REVIEW: getitem:
array: New property to convert subspace to numpy array
- [ ] cf/data/dask_utils.py:130: # REVIEW: getitem:
cf_contains: convert a to a usable array - [ ] cf/data/dask_utils.py:166: # REVIEW: getitem:
cf_convolve1d: convert a to a usable array - [ ] cf/data/dask_utils.py:210: # REVIEW: getitem:
cf_harden_mask: convert a to a usable array - [ ] cf/data/dask_utils.py:282: # REVIEW: getitem:
cf_percentile: convert a to a usable array - [ ] cf/data/dask_utils.py:378: # REVIEW: getitem:
cf_soften_mask: convert a to a usable array - [ ] cf/data/dask_utils.py:436: # REVIEW: getitem:
cf_where: convert array, condition, x, y to usable arrays - [ ] cf/data/dask_utils.py:573: # REVIEW: getitem:
cf_rt2dt: convert a to a usable array - [ ] cf/data/dask_utils.py:629: # REVIEW: getitem:
cf_dt2rt: convert a to a usable array - [ ] cf/data/dask_utils.py:671: # REVIEW: getitem:
cf_units: convert a to a usable array - [ ] cf/data/dask_utils.py:696: # REVIEW: getitem:
cf_is_masked: convert a to a usable array - [ ] cf/data/dask_utils.py:730: # REVIEW: getitem:
cf_filled: convert a to a usable array - [ ] cf/data/dask_utils.py:735:# REVIEW: getitem:
cf_asanyarray: convert a to a usable array - [ ] cf/data/dask_utils.py:755: # REVIEW: getitem:
cf_asanyarray: convert a to a usable array
- [ ] cf/data/fragment/mixin/fragmentarraymixin.py:15: # REVIEW: getitem:
_get_array: new method to convert subspace to numpy array - [ ] cf/data/fragment/mixin/fragmentarraymixin.py:169: # REVIEW: getitem:
_size_1_axis: refactor to useoriginal_shape
- [ ] cf/data/fragment/netcdffragmentarray.py:10:# REVIEW: getitem:
NetCDFFragmentArray: new inheritance to allow for different netCDF backends - [ ] cf/data/fragment/netcdffragmentarray.py:179: # REVIEW: getitem:
_get_array: new method to convert subspace to numpy array
- [ ] cf/data/creation.py:63: # REVIEW: getitem:
to_dask: set 'asanyarray' - [ ] cf/data/creation.py:87: # REVIEW: getitem:
to_dask: The file lock is now on theArrayobject (in its_get_arraymethod), rather than being set on the Dask array itself.
- [ ] cf/data/utils.py:877: # REVIEW: getitem:
collapse: set 'asanyarray'
- [ ] cf/functions.py:2449:# REVIEW: getitem:
get_subspace: remove deprecated function
h5netcdf
- [ ] cf/test/test_active_storage.py:1:# REVIEW: h5:
test_active_storage.py: new test module
- [ ] cf/test/test_NetCDF4Array.py:15:# REVIEW: h5:
test_NetCDF4Array.py: renamed 'NetCDFArray' to 'NetCDF4Array' - [ ] cf/test/test_NetCDF4Array.py:44: # REVIEW: h5:
test_NetCDF4Array: renamed 'NetCDFArray' to 'NetCDF4Array' - [ ] cf/test/test_NetCDF4Array.py:65: # REVIEW: h5:
test_NetCDF4Array: renamed 'NetCDFArray' to 'NetCDF4Array' - [ ] cf/test/test_NetCDF4Array.py:76: # REVIEW: h5:
test_NetCDF4Array: renamed 'NetCDFArray' to 'NetCDF4Array' - [ ] cf/test/test_NetCDF4Array.py:112: # REVIEW: h5:
test_NetCDF4Array: renamed 'NetCDFArray' to 'NetCDF4Array' - [ ] cf/test/test_NetCDF4Array.py:120: # REVIEW: h5:
test_NetCDF4Array: renamed 'NetCDFArray' to 'NetCDF4Array'
- [ ] cf/test/test_read_write.py:82: # REVIEW: h5:
test_read_mask: rename numpy to np - [ ] cf/test/test_read_write.py:564: # REVIEW: h5:
test_write_datatype: rename numpy to np
- [ ] cf/read_write/um/umread.py:1976: # REVIEW: h5:
create_data: replace units/calendar API with attributes
- [ ] cf/read_write/write.py:16:# REVIEW: h5:
write: docstring improvements
- [ ] cf/read_write/netcdf/netcdfread.py:212: # REVIEW: h5:
_create_data: control caching - [ ] cf/read_write/netcdf/netcdfread.py:257: # REVIEW: h5:
_create_data: replace units/calendar API with attributes - [ ] cf/read_write/netcdf/netcdfread.py:266: # REVIEW: h5:
_create_data: don't cache data from CFA variables - [ ] cf/read_write/netcdf/netcdfread.py:627: # REVIEW: h5:
_create_cfanetcdfarray: docstring/comment improvements - [ ] cf/read_write/netcdf/netcdfread.py:702: # REVIEW: h5:
_create_cfanetcdfarray: choose the correct netCDF backend - [ ] cf/read_write/netcdf/netcdfread.py:756: # REVIEW: h5:
_create_cfanetcdfarray_term: fix unknown fragment shape - [ ] cf/read_write/netcdf/netcdfread.py:775: # REVIEW: h5:
_create_cfanetcdfarray_term: choose the correct netCDF backend - [ ] cf/read_write/netcdf/netcdfread.py:965: # REVIEW: h5:
_cfa_parse_aggregated_data: usecfdm.netcdf_indexerto get data - [ ] cf/read_write/netcdf/netcdfwrite.py:582: # REVIEW: h5: Deleted function _convert_to_builtin_type was a CFA-0.4 thing
- [ ] cf/read_write/read.py:61: # REVIEW: h5:
read: new 'unpack' parameter to control auto-unpacking (previously always True) - [ ] cf/read_write/read.py:67: # REVIEW: h5:
read: new 'netcdf_backend' parameter to control how to read files - [ ] cf/read_write/read.py:69: # REVIEW: h5:
read: new 'storage_options' parameter to control access to S3 - [ ] cf/read_write/read.py:71: # REVIEW: h5:
read: 'cache' parameter to control whether or not to get to cache selected data elements
- [ ] cf/data/array/init.py:4:# REVIEW: h5:
__init__.py: importCFAH5netcdfArray - [ ] cf/data/array/init.py:7:# REVIEW: h5:
__init__.py: importCFAH5netcdfArray - [ ] cf/data/array/init.py:12:# REVIEW: h5:
__init__.py: importH5netcdfArray - [ ] cf/data/array/init.py:16:# REVIEW: h5:
__init__.py: importNetCDF4Array
- [ ] cf/data/array/h5netcdfarray.py:1:# REVIEW: h5:
H5netcdfArray: New class to access netCDF withh5netcdf - [ ] cf/data/array/h5netcdfarray.py:52: # REVIEW: h5:
_get_array: Ignore this for h5 review
- [ ] cf/data/array/mixin/cfamixin.py:39: # REVIEW: h5:
__init__: replace units/calendar API with attributes - [ ] cf/data/array/mixin/cfamixin.py:228: # REVIEW: h5:
_parse_cfa: Refactoring of code that used to be in__init__ - [ ] cf/data/array/mixin/cfamixin.py:469: # REVIEW: h5:
get_storage_options: new method to get file access options
- [ ] cf/data/array/netcdfarray.py:1:# REVIEW: h5:
NetCDFArray: Replaced byNetCDF4Array
- [ ] cf/data/array/cfah5netcdfarray.py:1:# REVIEW: h5:
CFAH5netcdfArray: New class for accessing CFA withh5netcdf
- [ ] cf/data/array/locks.py:1:# REVIEW: h5:
locks.py: New module to provide file locks
- [ ] cf/data/array/netcdf4array.py:1:# REVIEW: h5:
NetCDF4Array: New class to access netCDF withnetCDF4, replacesNetCDFArray
- [ ] cf/data/array/umarray.py:15: # REVIEW: h5:
__init__: replace units/calendar API with attributes
- [ ] cf/data/array/cfanetcdf4array.py:1:# REVIEW: h5:
CFAnetCDF4Array: New class for accessing CFA withnetCDF4
- [ ] cf/data/array/fullarray.py:19: # REVIEW: h5:
__init__: replace units/calendar API with attributes
- [ ] cf/data/fragment/init.py:3:# REVIEW: h5:
__init__.py: importH5netcdfFragmentArray - [ ] cf/data/fragment/init.py:7:# REVIEW: h5:
__init__.py: importNetCDF4FragmentArray
- [ ] cf/data/fragment/fullfragmentarray.py:12: # REVIEW: h5:
__init__: replace units/calendar API with attributes
- [ ] cf/data/fragment/netcdffragmentarray.py:27: # REVIEW: h5:
__init__: replace units/calendar API with attributes
- [ ] cf/data/fragment/h5netcdffragmentarray.py:5:# REVIEW: h5:
H5netcdfFragmentArray: New class to access netCDF fragment withh5netcdf
- [ ] cf/data/fragment/umfragmentarray.py:12: # REVIEW: h5:
__init__: replace units/calendar API with attributes
- [ ] cf/data/fragment/umfragmentarray.py:13: # REVIEW: h5:
__init__: new keyword 'storage_options'
- [ ] cf/data/fragment/netcdf4fragmentarray.py:5:# REVIEW: h5:
NetCDF4FragmentArray: New class to access netCDF fragment withnetCDF4
- [ ] cf/functions.py:2900: # REVIEW: h5:
relpath: minor refactor - [ ] cf/functions.py:2939: # REVIEW: h5:
relpath: minor refactor - [ ] cf/functions.py:2979: # REVIEW: h5:
relpath: minor refactor - [ ] cf/functions.py:3350:# REVIEW: h5:
environment: new dependencies
- [ ] cf/cfimplementation.py:30:# REVIEW: h5:
cfimplementation.py: importCFAH5netcdfArray,CFANetCDF4Array,H5netcdfArray,NetCDF4Array - [ ] cf/cfimplementation.py:119: # REVIEW: h5:
initialise_CFANetCDF4Array: new method to initialiseCFANetCDF4Array - [ ] cf/cfimplementation.py:136: # REVIEW: h5:
initialise_CFAH5netcdfArray: new method to initialiseCFAH5netcdfArray
- [ ] cf/regrid/regridoperator.py:730: # REVIEW: h5: new name and location of file lock
- [ ] cf/regrid/regrid.py:2469: # REVIEW: h5: new name and location of file lock
Active storage
- [ ] cf/test/test_functions.py:50: # REVIEW: active:
test_configuration: testcf.active_storage, cf.active_storage_url, cf.active_storage_max_requests
- [ ] cf/data/collapse/collapse_active.py:1:# REVIEW: active:
collapse_active.py: new module for active storage functionality
- [ ] cf/data/collapse/dask_collapse.py:1:# REVIEW: active:
dask_collapse.py: all unlabelled changes in this module are general tidying, and should be reviewed at the same time as active storage - [ ] cf/data/collapse/dask_collapse.py:235:# REVIEW: active:
cf_mean_chunk: active storage decoration - [ ] cf/data/collapse/dask_collapse.py:382:# REVIEW: active:
cf_max_chunk: active storage decoration - [ ] cf/data/collapse/dask_collapse.py:537:# REVIEW: active:
cf_min_chunk: active storage decoration - [ ] cf/data/collapse/dask_collapse.py:644:# REVIEW: active:
cf_range_chunk: active storage decoration - [ ] cf/data/collapse/dask_collapse.py:758:# REVIEW: active:
cf_rms_chunk: active storage decoration - [ ] cf/data/collapse/dask_collapse.py:843:# REVIEW: active:
cf_sample_size_chunk: active storage decoration - [ ] cf/data/collapse/dask_collapse.py:957:# REVIEW: active:
cf_sum_chunk: active storage decoration - [ ] cf/data/collapse/dask_collapse.py:1093:# REVIEW: active:
cf_sum_of_weights_chunk: active storage decoration - [ ] cf/data/collapse/dask_collapse.py:1137:# REVIEW: active:
cf_sum_of_weights2_chunk: active storage decoration - [ ] cf/data/collapse/dask_collapse.py:1183:# REVIEW: active:
cf_unique_chunk: active storage decoration - [ ] cf/data/collapse/dask_collapse.py:1248:# REVIEW: active:
cf_var_chunk: active storage decoration
- [ ] cf/data/array/mixin/init.py:1:# REVIEW: active:
__init__.py: importActiveStorageMixin
- [ ] cf/data/array/mixin/arraymixin.py:21: # REVIEW: active:
_meta: Moved to here fromFileArrayMixin
- [ ] cf/data/array/mixin/activestoragemixin.py:1:# REVIEW: active:
ActiveStorageMixin: new mixin classActiveStorageMixin
- [ ] cf/data/creation.py:86: # REVIEW: active:
to_dask: '_dask_meta' renamed to '_meta' for consistency with Dask
- [ ] cf/data/utils.py:866: # REVIEW: active:
collapse: pass the active storage status onto the collapse functions - [ ] cf/data/utils.py:995: # REVIEW: active:
parse_weights: minor refactor
- [ ] cf/functions.py:166:# REVIEW: active:
configuration: new keywords 'active_storage', 'active_storage_url', 'active_storage_max_requests' - [ ] cf/functions.py:428:# REVIEW: active:
_configuration: new keywords 'active_storage', 'active_storage_url', 'active_storage_max_requests' - [ ] cf/functions.py:587:# REVIEW: active:
regrid_logging: new examples - [ ] cf/functions.py:719:# REVIEW: active:
relaxed_identities: new examples - [ ] cf/functions.py:853:# REVIEW: active:
tempdir: new examples - [ ] cf/functions.py:1205:# REVIEW: active:
active_storage: new function - [ ] cf/functions.py:1275:# REVIEW: active:
active_storage_url: new function - [ ] cf/functions.py:1340:# REVIEW: active:
active_storage_max_requests: new function
- [ ] cf/field.py:5211: # REVIEW: active: active storage docstring
- [ ] cf/field.py:7092: # REVIEW: active:
collapse: include size 1 axes in collapse
- [ ] cf/constants.py:66: # REVIEW: active:
CONSTANTS: new constants 'active_storage', 'active_storage_url', 'active_storage_max_requests'
From the review:
ERROR: test_configuration (test_functions.functionTest.test_configuration)
Traceback (most recent call last): File "/home/slb93/git-repos/cf-python/cf/functions.py", line 1265, in _parse from activestorage import Active # noqa: F401 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ModuleNotFoundError: No module named 'activestorage'
Now nicely handled: 050913581485d628a4bce914bd1da1b0c22e638f
From the https://github.com/NCAS-CMS/cf-python/pull/805#pullrequestreview-2373806503:
I get eight test_Field_regrid_* tests failing on these assertions:
These test all pass for me ... I wonder what's going on, here?
From the https://github.com/NCAS-CMS/cf-python/pull/805#pullrequestreview-2373806503:
====================================================================== ERROR: test_GENERAL (test_general.generalTest.test_GENERAL)
File "/home/slb93/git-repos/cfdm/cfdm/read_write/netcdf/netcdfwrite.py", line 2476, in _createVariable g["nc"][ncvar] = g["netcdf"].createVariable(**kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "src/netCDF4/_netCDF4.pyx", line 2967, in netCDF4._netCDF4.Dataset.createVariable File "src/netCDF4/_netCDF4.pyx", line 4305, in netCDF4._netCDF4.Variable.init ValueError: chunksize cannot exceed dimension size
This, it turns out, is a bug in cfdm relating to the setting of HDF5 chunksizes for scalar arrays, which shall be fixed over there ...
EDIT: No, it's a bug in cf-python, afterall! Fixed in b72c17bde48bb5f9f7f5e0d60ebbca2a0f29195e. (Note that all will get refactored (for the better) when we daskify cfdm.)
From the https://github.com/NCAS-CMS/cf-python/pull/805#pullrequestreview-2373806503:
Consistency
This wasn't introduced in this PR, but when checking the inheritance structure I noticed that one of the collapse methods has an inconsistent argument ordering relative to the others namely means_abs having weights before axis unlike all others which accept weights (see below), so it would be good to make that consistent here.
I agree. That's one for 4.0.0 perhaps (as it would be a breaking API change). Do you agree?
From the https://github.com/NCAS-CMS/cf-python/pull/805#pullrequestreview-2373806503:
Names wise, I would prefer we rename the classes with H5netcdf in to H5NetCDF which follows the others in using 'NetCDF' capitalisation and follow Python Pascal Case style:
I would agree with you in general! Here, h5netcdf and netCDF4 are proper nouns, being the names of libraries, so I in this case I would say that they camel case to H5netcdf and NetCDF4 respectively.
From https://github.com/NCAS-CMS/cf-python/pull/805#pullrequestreview-2382057240:
More generally, there are a lot of methods missing from the new Array classes API docs:
Fixed: 885a67d44b658b071e971beea288a0935b782275
./check_docs_api_coverage now only flags up the 8 methods that we expect (relating to UGRID, and we are still deciding what do about!)
Hi Sadie - I think I've responding to everything - over to you :)
Thanks for addressing my feedback. I'll reply to your response comments here - though in general if I have put the 'thumbs up' emoji on a comment I am happy and won't comment further!
These test all pass for me ... I wonder what's going on, here?
This time round (after fetching your new commits) they all pass, so either something fixed them or we can treat them as flaky and ignore those failures. Flaky tests are of course a problem in themselves, but not to concern us for this PR. :)
EDIT: No, it's a bug in cf-python, afterall! Fixed in https://github.com/NCAS-CMS/cf-python/commit/b72c17bde48bb5f9f7f5e0d60ebbca2a0f29195e. (Note that all will get refactored (for the better) when we daskify cfdm.)
Great, always good to fix a bug! test_general now all passes for me locally.
I agree. That's one for 4.0.0 perhaps (as it would be a breaking API change). Do you agree?
Sure, shall I open an Issue for it and add it to #776?
./check_docs_api_coverage now only flags up the 8 methods that we expect (relating to UGRID, and we are still deciding what do about!)
Nice, thanks for sorting the rest. Do you know what the '6 missing .rst files' warning is about?:
$ ./check_docs_api_coverage ─╯
Method cf.Field.del_mesh_id not in docs/source/class/cf.Field.rst
Method cf.Field.get_mesh_id not in docs/source/class/cf.Field.rst
Method cf.Field.has_mesh_id not in docs/source/class/cf.Field.rst
Method cf.Field.set_mesh_id not in docs/source/class/cf.Field.rst
Method cf.Domain.del_mesh_id not in docs/source/class/cf.Domain.rst
Method cf.Domain.get_mesh_id not in docs/source/class/cf.Domain.rst
Method cf.Domain.has_mesh_id not in docs/source/class/cf.Domain.rst
Method cf.Domain.set_mesh_id not in docs/source/class/cf.Domain.rst
File {rst_file} does not exist
File {rst_file} does not exist
File {rst_file} does not exist
File {rst_file} does not exist
File {rst_file} does not exist
File {rst_file} does not exist
Traceback (most recent call last):
File "/home/slb93/git-repos/cf-python/docs/source/check_docs_api_coverage.py", line 78, in <module>
raise ValueError(
ValueError: Found 8 undocumented methods and 6 missing .rst files
It may not concern additions from this PR, and can always be sorted at release-time, but in case it is to do with the PR it would be good to do a little investigation.
I would agree with you in general! Here, h5netcdf and netCDF4 are proper nouns, being the names of libraries, so I in this case I would say that they camel case to H5netcdf and NetCDF4 respectively.
Fair point. In which case, they can stay that way, no problem. I respect a grammatical justification!
As for the in-line comments, I have a small number left to respond to, and I will mark a small number you migjt have missed as well. Then we should be good to merge, though I'll approve the PR again to confirm :)
I have responded to everything now, back to you @davidhassell...
Hi Sadie - all done, now, I think!
I'm not sure about:
/home/slb93/git-repos/cf-python/cf/docstring/docstring.py:306: SyntaxWarning: invalid escape sequence '\_'
"{{split_every: `int` or `dict`, optional}}": """split_every: `int` or `dict`, optional
as I don't see it when I run pre-commit, I'm using Python 3.12.2 - what about you?
Excellent, thanks @davidhassell.
I don't see it when I run pre-commit, I'm using Python 3.12.2 - what about you?
I am using 3.12.0. No worries, we can investigate separately since it points to a line you didn't touch so is very likely unrelated to this PR. Everything is ready I think so please merge!
A great many thanks to @sadielbartholomew for her customary careful and insightful review of this the huge PR. Much appreciated. Merging now.