2stage_reader dimension error
Hello
when I try to read the data with the 2stage_reader and MPI I get the following error:
ValueError: could not broadcast input array from shape (79,319140) into shape (80,319140)
Any idea?
My data has shape (241, 319140, 5)
Best
Hello. Can you please provide more details (at least the line number where it fails) and, ideally, a reproducer?
Hi @mrogowski , thanks for reaching out.
Here it is a small reproducible:
import numpy as np
from mpi4py import MPI
from pyspod.spod.standard import Standard as spod_standard
from pyspod.spod.streaming import Streaming as spod_streaming
from scipy.io import netcdf_file
comm, rank, size = MPI.COMM_WORLD, MPI.COMM_WORLD.Get_rank(), MPI.COMM_WORLD.Get_size()
rankfname = f"spod_rank{rank}.nc"
vars = ["rho", "u", "v", "w", "p"]
with netcdf_file(rankfname, "w") as f:
# Create dimensions once
f.createDimension("time", 80)
f.createDimension("nspts", 319140)
# Loop to create variables and store data
for idx, var in enumerate(vars):
_var = f.createVariable(var, "f4", ("time", "nspts")) # Use 'f4' for clarity
# Generate random data
_var[:] = np.random.rand(80, 319140)
allfiles = comm.allgather(rankfname)
params = {
"time_step": 0.1,
"n_space_dims": 3,
"n_variables": 5,
"n_dft": 128,
"overlap": 75,
"n_modes_save": 5,
"savedir": ".",
"mean_type": "longtime",
"normalize_weights": False,
"normalize_data": False,
"conf_level": 0.95,
"reuse_blocks": False,
"savefft": False,
"dtype": "single",
"fullspectrum": False,
"savefreq_disk2": True,
"savefreq_disk": False,
}
standard = spod_standard(params=params, comm=comm)
spod = standard.fit(data_list=allfiles, variables=vars)
I get the error:
File ".../lib/python3.12/site-packages/pyspod/utils/reader.py", line 352, in get_data_for_time
input_data[cum_read:cum_read+read_cnt,:,idx] = vals.reshape(vals.shape[0],-1)#.copy()
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: could not broadcast input array from shape (79,319140) into shape (80,319140)
Hope this helps
Can you try adding something like
time = f.createVariable("time", "f4", ("time",))
time[:] = np.arange(ntime)
to the function creating NetCDF files? It seems that Xarray does not behave the way I expect it to (.sel being inclusive of both ends of the range) if there are no values set. Does it work for you then?
Still getting this
File ".../python3.12/site-packages/pyspod/utils/reader.py", line 352, in get_data_for_time
input_data[cum_read:cum_read+read_cnt,:,idx] = vals.reshape(vals.shape[0],-1)#.copy()
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: could not broadcast input array from shape (80,319140) into shape (0,319140)
Did you modify it as follows? It works for me:
import numpy as np
from mpi4py import MPI
from pyspod.spod.standard import Standard as spod_standard
from pyspod.spod.streaming import Streaming as spod_streaming
from scipy.io import netcdf_file
comm, rank, size = MPI.COMM_WORLD, MPI.COMM_WORLD.Get_rank(), MPI.COMM_WORLD.Get_size()
rankfname = f"spod_rank{rank}.nc"
vars = ["rho", "u", "v", "w", "p"]
with netcdf_file(rankfname, "w") as f:
# Create dimensions once
f.createDimension("time", 80)
f.createDimension("nspts", 319140)
time = f.createVariable("time", "f4", ("time",))
time[:] = np.arange(80)
# Loop to create variables and store data
for idx, var in enumerate(vars):
_var = f.createVariable(var, "f4", ("time", "nspts")) # Use 'f4' for clarity
# Generate random data
_var[:] = np.random.rand(80, 319140)
allfiles = comm.allgather(rankfname)
params = {
"time_step": 0.1,
"n_space_dims": 3,
"n_variables": 5,
"n_dft": 128,
"overlap": 75,
"n_modes_save": 5,
"savedir": ".",
"mean_type": "longtime",
"normalize_weights": False,
"normalize_data": False,
"conf_level": 0.95,
"reuse_blocks": False,
"savefft": False,
"dtype": "single",
"fullspectrum": False,
"savefreq_disk2": True,
"savefreq_disk": False,
}
standard = spod_standard(params=params, comm=comm)
spod = standard.fit(data_list=allfiles, variables=vars)
how did you install pyspod?
pip install git+https://github.com/MathEXLab/PySPOD
Ok now seems to be working.
One last question: with this can I fit into memory data than even span more than the memory available in 1 node only?
The data will be distributed, so it will be split over all the nodes involved in the job.
The data will be distributed, so it will be split over all the nodes involved in the job.
Yes but say that the total data is 2 TB and my node has 512GB RAM. Can I do that over 5 nodes then?
Yes, it is the total memory available that counts (so 5x512 GB). You will need to leave some memory free for PySPOD processing and computations, so I am not sure if 5, 6 or 7 nodes would be enough here. Generally, you should be able to fit any dataset in memory if you have enough nodes. We've run PySPOD on tens of TB on thousands of nodes.
Ok I'm asking because I was memory limited when using the 1 stage reader