Saving Parcels output directly in zarr format
With this PR, we are implementing directly saving Parcels trajectory output data in zarr format. With the merging of #1165 in v2.3.1, it was already possible to store the output after conversion to a zarr file; after this PR there is no need for conversion at all.
Advantages are:
- No need for converting the pickled dumps to file at the end of a simulation (with all the memory issues like #1091)
- No need for explicit
ParticleFile.export()call - Directly analyse the Parcels output during a simulation
- Much simpler (hence easier-to-maintain) code
Only major disadvantage so far is that zarr folders are not as 'transportable' as netCDF files. But it's fairly trivial to convert zarr data to a netcdf file using xarray.open_zarr() and xarray.to_netcdf()
Developments to do before merging:
- [ ] Test in real-world examples how much faster/slower the new writing is
- [x] Implement support for control of
chunkssizes in theParticleFileobject - [x] Further simplify code of
ParticleSet.write()(especially thecollection.toDictionary()method) - [x] Further clean-up now-redundant methods and code
- [ ] Double check of tutorials (especially the three 'main ones' on the top of https://oceanparcels.org/#tutorials) to be in line with new output format
OK, I just did a first quick analysis of the timing of this new zarr-output. And the results look great!
I use the code below, which creates and deletes large numbers (more than 5 million in total) of particles on-the-fly; so is quite heavy test for writing files
from parcels import FieldSet, JITParticle, ParticleSet
import time
npart = int(1e5)
fieldset = FieldSet.from_data({'U': 0., 'V': 0.}, {'lat': 0., 'lon': 0.})
def RandomDeath(fieldset, particle, time):
if ParcelsRandom.random() > 0.9: # noqa
particle.delete()
tic = time.perf_counter()
pset = ParticleSet(fieldset, pclass=JITParticle, lon=[0]*npart, lat=[0]*npart, repeatdt=2)
pset.execute(RandomDeath, runtime=100, dt=1,
output_file=pset.ParticleFile('parcels_largeoutputfiles', outputdt=10))
toc = time.perf_counter()
Results are in the table below
| Old (netcdf) version | New (zarr) version | |
|---|---|---|
| Timing | 152 seconds * | 59 seconds |
| Filesize | 1.5 GB | 76 MB |
(* of which ~100 seconds in execute and ~50 seconds in parcels_convert_npydir_to_netcdf)
So it seems from this small, simple example that the new zarr output can in fact be 60% faster than the old netcdf output; and (as we already knew), that output is a 95% smaller
Comment: while working on #1247 , I get errors that are related to the new ParticleSet writing.
____________________________________________________________________________ test_plotting[hist2d-aos] ____________________________________________________________________________
pset_mode = 'aos', mode = 'hist2d', tmpdir = local('/tmp/pytest-of-christian/pytest-8/test_plotting_hist2d_aos_0')
@pytest.mark.parametrize('pset_mode', pset_modes)
@pytest.mark.parametrize('mode', ['2d', '3d', 'movie2d', 'hist2d'])
def test_plotting(pset_mode, mode, tmpdir):
if mode == '3d' and sys.platform in ['linux', 'linux2']:
logger.info('Skipping 3d test in linux Travis, since it fails to find display to connect')
return
> fp = create_outputfiles(tmpdir, pset_mode)
tests/test_scripts.py:49:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/test_scripts.py:38: in create_outputfiles
output_file.close()
/var/scratch/test_env/lib/python3.8/site-packages/parcels-2.3.3.dev8-py3.8.egg/parcels/particlefile/baseparticlefile.py:217: in close
self.export()
/var/scratch/test_env/lib/python3.8/site-packages/parcels-2.3.3.dev8-py3.8.egg/parcels/particlefile/particlefileaos.py:179: in export
ds.to_netcdf(self.fname)
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/core/dataset.py:1900: in to_netcdf
return to_netcdf(
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/api.py:1077: in to_netcdf
dump_to_store(
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/api.py:1124: in dump_to_store
store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/common.py:262: in store
variables, attributes = self.encode(variables, attributes)
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/common.py:352: in encode
variables = {k: self.encode_variable(v) for k, v in variables.items()}
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/common.py:352: in <dictcomp>
variables = {k: self.encode_variable(v) for k, v in variables.items()}
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/scipy_.py:203: in encode_variable
variable = encode_nc3_variable(variable)
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/netcdf3.py:95: in encode_nc3_variable
data = coerce_nc3_dtype(var.data)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
arr = array([[ 0, 0, 0,
0, ...807,
9223372036854775807, 9223372036854775807, 9223372036854775807,
9223372036854775807]], dtype=int64)
def coerce_nc3_dtype(arr):
"""Coerce an array to a data type that can be stored in a netCDF-3 file
This function performs the dtype conversions as specified by the
``_nc3_dtype_coercions`` mapping:
int64 -> int32
uint64 -> int32
uint32 -> int32
uint16 -> int16
uint8 -> int8
bool -> int8
Data is checked for equality, or equivalence (non-NaN values) using the
``(cast_array == original_array).all()``.
"""
dtype = str(arr.dtype)
if dtype in _nc3_dtype_coercions:
new_dtype = _nc3_dtype_coercions[dtype]
# TODO: raise a warning whenever casting the data-type instead?
cast_arr = arr.astype(new_dtype)
if not (cast_arr == arr).all():
> raise ValueError(
f"could not safely cast array from dtype {dtype} to {new_dtype}"
)
E ValueError: could not safely cast array from dtype int64 to int32
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/netcdf3.py:66: ValueError
------------------------------------------------------------------------------ Captured stderr call -------------------------------------------------------------------------------
INFO: Compiled ObjectJITParticleAdvectionRK4 ==> /tmp/parcels-1000/lib91743b572285f1361b6ef8c2cdad600c_0.so
-------------------------------------------------------------------------------- Captured log call --------------------------------------------------------------------------------
INFO parcels.tools.loggers:basekernel.py:255 Compiled ObjectJITParticleAdvectionRK4 ==> /tmp/parcels-1000/lib91743b572285f1361b6ef8c2cdad600c_0.so
In short: When writing the particle set, there is some incorrect conversion of its attributes in terms of its dtype when writing the file as NetCDF. This is currently not picked up by any tests cause the set zarr as dependency, auto-detect zarr, and only test zarr in all the tests. I myself don't have zarr installed, which is why I spotted the error.
I can't fix it anymore due to time and cause I also didn't look on the zarr implementation. But may be good to have a look on here (@erikvansebille ).