PySPOD 2stage_reader dimension error

Hello

when I try to read the data with the 2stage_reader and MPI I get the following error:

ValueError: could not broadcast input array from shape (79,319140) into shape (80,319140)

Any idea?

My data has shape (241, 319140, 5)

Best

Oct 26 '24 07:10 FrankFrank9

Hello. Can you please provide more details (at least the line number where it fails) and, ideally, a reproducer?

Oct 26 '24 08:10 mrogowski

Hi @mrogowski , thanks for reaching out.

Here it is a small reproducible:

import numpy as np
from mpi4py import MPI
from pyspod.spod.standard import Standard as spod_standard
from pyspod.spod.streaming import Streaming as spod_streaming
from scipy.io import netcdf_file

comm, rank, size = MPI.COMM_WORLD, MPI.COMM_WORLD.Get_rank(), MPI.COMM_WORLD.Get_size()

rankfname = f"spod_rank{rank}.nc"
vars = ["rho", "u", "v", "w", "p"]
with netcdf_file(rankfname, "w") as f:
    # Create dimensions once
    f.createDimension("time", 80)
    f.createDimension("nspts", 319140)

    # Loop to create variables and store data
    for idx, var in enumerate(vars):
        _var = f.createVariable(var, "f4", ("time", "nspts"))  # Use 'f4' for clarity
        # Generate random data
        _var[:] = np.random.rand(80, 319140)


allfiles = comm.allgather(rankfname)

params = {
    "time_step": 0.1,
    "n_space_dims": 3,
    "n_variables": 5,
    "n_dft": 128,
    "overlap": 75,
    "n_modes_save": 5,
    "savedir": ".",
    "mean_type": "longtime",
    "normalize_weights": False,
    "normalize_data": False,
    "conf_level": 0.95,
    "reuse_blocks": False,
    "savefft": False,
    "dtype": "single",
    "fullspectrum": False,
    "savefreq_disk2": True,
    "savefreq_disk": False,
}

standard = spod_standard(params=params, comm=comm)
spod = standard.fit(data_list=allfiles, variables=vars)

I get the error:

 File ".../lib/python3.12/site-packages/pyspod/utils/reader.py", line 352, in get_data_for_time
    input_data[cum_read:cum_read+read_cnt,:,idx] = vals.reshape(vals.shape[0],-1)#.copy()
    ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: could not broadcast input array from shape (79,319140) into shape (80,319140)

Hope this helps

Oct 26 '24 10:10 FrankFrank9

Can you try adding something like

    time = f.createVariable("time", "f4", ("time",))
    time[:] = np.arange(ntime)

to the function creating NetCDF files? It seems that Xarray does not behave the way I expect it to (.sel being inclusive of both ends of the range) if there are no values set. Does it work for you then?

Oct 26 '24 14:10 mrogowski

Still getting this

  File ".../python3.12/site-packages/pyspod/utils/reader.py", line 352, in get_data_for_time
    input_data[cum_read:cum_read+read_cnt,:,idx] = vals.reshape(vals.shape[0],-1)#.copy()
    ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: could not broadcast input array from shape (80,319140) into shape (0,319140)

Oct 26 '24 15:10 FrankFrank9

Did you modify it as follows? It works for me:

import numpy as np
from mpi4py import MPI
from pyspod.spod.standard import Standard as spod_standard
from pyspod.spod.streaming import Streaming as spod_streaming
from scipy.io import netcdf_file

comm, rank, size = MPI.COMM_WORLD, MPI.COMM_WORLD.Get_rank(), MPI.COMM_WORLD.Get_size()

rankfname = f"spod_rank{rank}.nc"
vars = ["rho", "u", "v", "w", "p"]
with netcdf_file(rankfname, "w") as f:
    # Create dimensions once
    f.createDimension("time", 80)
    f.createDimension("nspts", 319140)

    time = f.createVariable("time", "f4", ("time",))
    time[:] = np.arange(80)

    # Loop to create variables and store data
    for idx, var in enumerate(vars):
        _var = f.createVariable(var, "f4", ("time", "nspts"))  # Use 'f4' for clarity
        # Generate random data
        _var[:] = np.random.rand(80, 319140)


allfiles = comm.allgather(rankfname)

params = {
    "time_step": 0.1,
    "n_space_dims": 3,
    "n_variables": 5,
    "n_dft": 128,
    "overlap": 75,
    "n_modes_save": 5,
    "savedir": ".",
    "mean_type": "longtime",
    "normalize_weights": False,
    "normalize_data": False,
    "conf_level": 0.95,
    "reuse_blocks": False,
    "savefft": False,
    "dtype": "single",
    "fullspectrum": False,
    "savefreq_disk2": True,
    "savefreq_disk": False,
}

standard = spod_standard(params=params, comm=comm)
spod = standard.fit(data_list=allfiles, variables=vars)

Oct 26 '24 15:10 mrogowski

how did you install pyspod?

Oct 26 '24 15:10 FrankFrank9

pip install git+https://github.com/MathEXLab/PySPOD

Oct 26 '24 15:10 mrogowski

Ok now seems to be working.

One last question: with this can I fit into memory data than even span more than the memory available in 1 node only?

Oct 26 '24 16:10 FrankFrank9

The data will be distributed, so it will be split over all the nodes involved in the job.

Oct 26 '24 18:10 mrogowski

The data will be distributed, so it will be split over all the nodes involved in the job.

Yes but say that the total data is 2 TB and my node has 512GB RAM. Can I do that over 5 nodes then?

Oct 26 '24 18:10 FrankFrank9

Yes, it is the total memory available that counts (so 5x512 GB). You will need to leave some memory free for PySPOD processing and computations, so I am not sure if 5, 6 or 7 nodes would be enough here. Generally, you should be able to fit any dataset in memory if you have enough nodes. We've run PySPOD on tens of TB on thousands of nodes.

Oct 26 '24 19:10 mrogowski

Ok I'm asking because I was memory limited when using the 1 stage reader

Oct 26 '24 19:10 FrankFrank9