parcels Saving Parcels output directly in zarr format

With this PR, we are implementing directly saving Parcels trajectory output data in zarr format. With the merging of #1165 in v2.3.1, it was already possible to store the output after conversion to a zarr file; after this PR there is no need for conversion at all.

Advantages are:

No need for converting the pickled dumps to file at the end of a simulation (with all the memory issues like #1091)
No need for explicit ParticleFile.export() call
Directly analyse the Parcels output during a simulation
Much simpler (hence easier-to-maintain) code

Only major disadvantage so far is that zarr folders are not as 'transportable' as netCDF files. But it's fairly trivial to convert zarr data to a netcdf file using xarray.open_zarr() and xarray.to_netcdf()

Developments to do before merging:

[ ] Test in real-world examples how much faster/slower the new writing is
[x] Implement support for control of chunks sizes in the ParticleFile object
[x] Further simplify code of ParticleSet.write() (especially the collection.toDictionary() method)
[x] Further clean-up now-redundant methods and code
[ ] Double check of tutorials (especially the three 'main ones' on the top of https://oceanparcels.org/#tutorials) to be in line with new output format

Jul 20 '22 11:07 erikvansebille

OK, I just did a first quick analysis of the timing of this new zarr-output. And the results look great!

I use the code below, which creates and deletes large numbers (more than 5 million in total) of particles on-the-fly; so is quite heavy test for writing files

from parcels import FieldSet, JITParticle, ParticleSet
import time

npart = int(1e5)
fieldset = FieldSet.from_data({'U': 0., 'V': 0.}, {'lat': 0., 'lon': 0.})

def RandomDeath(fieldset, particle, time):
    if ParcelsRandom.random() > 0.9:  # noqa
        particle.delete()

tic = time.perf_counter()
pset = ParticleSet(fieldset, pclass=JITParticle, lon=[0]*npart, lat=[0]*npart, repeatdt=2)
pset.execute(RandomDeath, runtime=100, dt=1,
             output_file=pset.ParticleFile('parcels_largeoutputfiles', outputdt=10))
toc = time.perf_counter()

Results are in the table below

	Old (netcdf) version	New (zarr) version
Timing	152 seconds *	59 seconds
Filesize	1.5 GB	76 MB

(* of which ~100 seconds in execute and ~50 seconds in parcels_convert_npydir_to_netcdf)

So it seems from this small, simple example that the new zarr output can in fact be 60% faster than the old netcdf output; and (as we already knew), that output is a 95% smaller

Jul 20 '22 12:07 erikvansebille

Comment: while working on #1247 , I get errors that are related to the new ParticleSet writing.

____________________________________________________________________________ test_plotting[hist2d-aos] ____________________________________________________________________________

pset_mode = 'aos', mode = 'hist2d', tmpdir = local('/tmp/pytest-of-christian/pytest-8/test_plotting_hist2d_aos_0')

    @pytest.mark.parametrize('pset_mode', pset_modes)
    @pytest.mark.parametrize('mode', ['2d', '3d', 'movie2d', 'hist2d'])
    def test_plotting(pset_mode, mode, tmpdir):
        if mode == '3d' and sys.platform in ['linux', 'linux2']:
            logger.info('Skipping 3d test in linux Travis, since it fails to find display to connect')
            return
>       fp = create_outputfiles(tmpdir, pset_mode)

tests/test_scripts.py:49: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/test_scripts.py:38: in create_outputfiles
    output_file.close()
/var/scratch/test_env/lib/python3.8/site-packages/parcels-2.3.3.dev8-py3.8.egg/parcels/particlefile/baseparticlefile.py:217: in close
    self.export()
/var/scratch/test_env/lib/python3.8/site-packages/parcels-2.3.3.dev8-py3.8.egg/parcels/particlefile/particlefileaos.py:179: in export
    ds.to_netcdf(self.fname)
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/core/dataset.py:1900: in to_netcdf
    return to_netcdf(
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/api.py:1077: in to_netcdf
    dump_to_store(
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/api.py:1124: in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/common.py:262: in store
    variables, attributes = self.encode(variables, attributes)
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/common.py:352: in encode
    variables = {k: self.encode_variable(v) for k, v in variables.items()}
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/common.py:352: in <dictcomp>
    variables = {k: self.encode_variable(v) for k, v in variables.items()}
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/scipy_.py:203: in encode_variable
    variable = encode_nc3_variable(variable)
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/netcdf3.py:95: in encode_nc3_variable
    data = coerce_nc3_dtype(var.data)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

arr = array([[                  0,                   0,                   0,
                          0,                   ...807,
        9223372036854775807, 9223372036854775807, 9223372036854775807,
        9223372036854775807]], dtype=int64)

    def coerce_nc3_dtype(arr):
        """Coerce an array to a data type that can be stored in a netCDF-3 file
    
        This function performs the dtype conversions as specified by the
        ``_nc3_dtype_coercions`` mapping:
            int64  -> int32
            uint64 -> int32
            uint32 -> int32
            uint16 -> int16
            uint8  -> int8
            bool   -> int8
    
        Data is checked for equality, or equivalence (non-NaN values) using the
        ``(cast_array == original_array).all()``.
        """
        dtype = str(arr.dtype)
        if dtype in _nc3_dtype_coercions:
            new_dtype = _nc3_dtype_coercions[dtype]
            # TODO: raise a warning whenever casting the data-type instead?
            cast_arr = arr.astype(new_dtype)
            if not (cast_arr == arr).all():
>               raise ValueError(
                    f"could not safely cast array from dtype {dtype} to {new_dtype}"
                )
E               ValueError: could not safely cast array from dtype int64 to int32

/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/netcdf3.py:66: ValueError
------------------------------------------------------------------------------ Captured stderr call -------------------------------------------------------------------------------
INFO: Compiled ObjectJITParticleAdvectionRK4 ==> /tmp/parcels-1000/lib91743b572285f1361b6ef8c2cdad600c_0.so
-------------------------------------------------------------------------------- Captured log call --------------------------------------------------------------------------------
INFO     parcels.tools.loggers:basekernel.py:255 Compiled ObjectJITParticleAdvectionRK4 ==> /tmp/parcels-1000/lib91743b572285f1361b6ef8c2cdad600c_0.so

In short: When writing the particle set, there is some incorrect conversion of its attributes in terms of its dtype when writing the file as NetCDF. This is currently not picked up by any tests cause the set zarr as dependency, auto-detect zarr, and only test zarr in all the tests. I myself don't have zarr installed, which is why I spotted the error.

I can't fix it anymore due to time and cause I also didn't look on the zarr implementation. But may be good to have a look on here (@erikvansebille ).

Sep 28 '22 14:09 CKehl