netcdf-c icon indicating copy to clipboard operation
netcdf-c copied to clipboard

[Python] NetCDF4 file growing 50% in size with MPI enabled.

Open leuchthelp opened this issue 2 months ago • 3 comments

This is a cross-post from the NetCDF4 Python repository. Will provide all available information again

To report a non-security related issue, please provide:

  • the version of the software with which you are encountering an issue
  • environmental information (i.e. Operating System, compiler info, java version, python version, etc.)
  • a description of the issue with the steps needed to reproduce it

If you have a general question about the software, please view our Suggested Support Process.

Please consider me to be a novice when it comes to using NetCDF4 and all things related.

Version: - installed via spack v0.23.1

compiler: [email protected]

[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]~cxx~fortran+hl~ipo~java~map+mpi+shared+subfiling~szip+threadsafe+tools
[email protected]

Both on:

  • Ubuntu 22.04 - 6.6.87.2-microsoft-standard-WSL2
  • Levante - 4.18.0-553.42.1.el8_10.x86_64
  • Additionally verified by a member of the DKRZ not running the exact environment used (i.e different software versions) (can get details if needed)

Any file create via the NetCDF4 python API grows exactly 50% in size (i.e 10->15GB, 20->30GB, 30->45GB, ...).

The code provided here (test.py) can be used to reproduce the issue. Simply enabling MPI via the netCDF4.Dataset(path, "w", format="NETCDF4", parallel=True) option results in a file being 50% larger than intended. Setting the flag the False creates the expected filesize. mpiexec, ,mpirun or -n N do not specifically need to be supplied for this effect to show. Simply running it with python test.py and setting the flag to True is enough to reproduce the issue. A way this can be viewed is by using a tool such as binocle to view the raw binary data.

from mpi4py import MPI
import netCDF4
import numpy as np

def create(path, form, dtype="f8", parallel=False):    
    
    root = netCDF4.Dataset(path, "w", format="NETCDF4", parallel=parallel)  # type: ignore

    root.createGroup("/")
    used = 0
    
    for variable, element in form.items():
        shape = element[0]
        chunks = element[1]
        dimensions = []
        
        for size in shape:
            root.createDimension(f"{used}", size)
            dimensions.append(f"{used}")
            used += 1
        
        if len(chunks) != 0: 
            x = root.createVariable(variable, dtype, dimensions, chunksizes=chunks)
        else: 
            x = root.createVariable(variable, dtype, dimensions)
        
        if parallel == False:
            print(len(np.random.random_sample(shape)))
            x[:] = np.random.random_sample(shape)
        else:
            rank = MPI.COMM_WORLD.rank  # type: ignore
            rsize = MPI.COMM_WORLD.size  # type: ignore
            total_size = shape[0]
            size = int(total_size / rsize)
            
            rstart = rank * size
            rend = rstart + size
            
            print(f"shape: {shape}, chunks: {chunks}, dimensions: {dimensions}, total chunksize: {total_size}, size per rank:{size} rank: {rank}, rsize: {rsize}, rstart: {rstart}, rend: {rend}")
            
            print(len(np.random.random_sample(size)))
            x[rstart:rend] = np.random.random_sample(size)
            MPI.COMM_WORLD.Barrier()  # type: ignore
            print(f"var: {x}, ncattrs after fill: {x.ncattrs()}, as dict: {x.__dict__}")
            

def main():
    
    create(form={"X": [[10 * 134217728], []]}, path="test.nc", parallel=True)

if __name__=="__main__":
    main()

This is an Image obtained from the broken, 50% larger file. This is zoomed out very far, though at the very beginning one would be able to see the header. Image

This is what the file should look like. A lot less empty space before the data. Image

Additional output obtained by aforementioned member of the DKRZ:

~/Git/Testprogramme/NetCDF/IO on master ● λ ncdump -h test_false.nc
netcdf test_false {
dimensions:
    \0 = 1342177280 ;
variables:
    double X(\0) ;
}
~/Git/Testprogramme/NetCDF/IO on master ● λ ncdump -h test_true.nc
netcdf test_true {
dimensions:
    \0 = 1342177280 ;
variables:
    double X(\0) ;
}
~/Git/Testprogramme/NetCDF/IO on master ● λ ls -lh test_*
-rw-r--r-- 1 user user 11G Sep 22 14:59 test_false.nc
-rw-r--r-- 1 user user 16G Sep 22 14:59 test_true.nc
~/Git/Testprogramme/NetCDF/IO on master ● λ du -shc test_*
11G    test_false.nc
11G    test_true.nc
21G    total

leuchthelp avatar Oct 10 '25 16:10 leuchthelp

Additional comment from original thread:

by @florianziemen

My rough memory from running into a similar problem years ago is that netCDF4 seems to allocate the space for the dimension variable (in uncompressed). If you do X("X") instead of X("0"), this problem should vanish. If you create a multi-dimensional array, the empty space should be much smaller. Still worth looking into for a fix, but this might point towards the core of the issue. @leuchthelp can you check if this applies?


Additional testing:

Apologies for the delay, had to ensure I fully understood the intention to verify correctly.

For simplicity I reduced the expected filesize from 10GB to 1GB. All changes mentioned can be found here: test-working-with-change.py

If you do X("X") instead of X("0"), this problem should vanish.

Correct, this does seem to be the case, changing the relevant lines to: Ln 20: root.createDimension("X", size) from Ln 18: root.createDimension(f"{used}", size) & x = root.createVariable(variable, dtype, ("X", )) from x = root.createVariable(variable, dtype, dimensions) resulted in the expected filesize.

Zooming in 4x shows the data right after the header without much empty space, as would be expected Image

If you create a multi-dimensional array, the empty space should be much smaller.

Correct as well, creating a file with form={"X": [[512, 512, 512], []]} meaning a 512x512x512 multi-dimensional array with the original code provided above, leaving it entirely unchanged, also creates the expected filesize and does not exhibit this growing behavior.

Zooming in 4x for a multi-dimensional array shows more empty space than with the fixed version of the code for 1D, but the filesize reported matches 1GB. As such it matches your description as well. Image

Seemingly creating a 1D variable and writing to it with MPI results in a 50% larger file, when the dimension is named "dynamically" via the X("0"), ..., X("N") pattern. Changing to the X("X") naming pattern seemingly resolves the issue. This however does not allow the use-case required.

leuchthelp avatar Oct 10 '25 16:10 leuchthelp

I am unfamiliar with the python API, but is there a way to run this test with fillvalues turned off? I'm curious if that has an effect.

WardF avatar Oct 16 '25 22:10 WardF

but is there a way to run this test with fillvalues turned off?

Sorry for the delayed response, time zones are fun.

There seems to be a way by setting fill_value=False within createVariable() like root.createVariable(variable, dtype, dimensions, fill_value=False).

From my quick testing this does not change the behavior. The file still becomes 50% larger than intended.

I will additionally tag @jswhit for more insight, as I'm not to familiar with the python API myself, beyond this simple example.

leuchthelp avatar Oct 17 '25 09:10 leuchthelp