netcdf4-python icon indicating copy to clipboard operation
netcdf4-python copied to clipboard

Memory leak in hdf5 1.14.2 that appears to be fixed in 1.14.6

Open roballsopp opened this issue 4 months ago • 5 comments

netCDF4 version: 1.7.2 (also seems to happen in 1.6.5) Environment: ubuntu 24.04 docker image, python 3.12

I have several long running processes that write many netcdf files over time. Given long enough, using the current 1.7.2 release of netCDF4, these processes will eventually run out of memory and crash. This is fairly severe for one of my processes, in which netCDF4 gobbles up about 2GB of memory roughly every 4 hours. For others, it can take several days, but eventually all the memory appears to go to netCDF4. I think I've tracked this to a memory leak in the underlying hdf5 library. I haven't determined all of the affected versions, but I know for sure 1.14.2 (which seems to ship with the 1.7.2 wheel for netCDF4) is affected.

When I hold all other things constant and compile netCDF4 against the hdf5 1.14.6 release, the memory issues are gone, and growth is flat over time. I've constructed a basic reproduction using docker and memray:

# memtest.py
import shutil
import uuid

import netCDF4 as nc
import numpy as np


print("netcdf ver:", nc.getlibversion())
print("hdf5 ver:", nc.__hdf5libversion__)

for i in range(15000):
    file_id = str(uuid.uuid4())
    out_file = f"{file_id}.nc"
    shutil.copy2("memtest_template.nc", out_file)

    with nc.Dataset(out_file, "r+") as ds:
        ds.variables["my_var"][:] = np.random.random((10000,))
# cdl to generate memtest_template.nc
netcdf memtest {
dimensions:
	my_dim = UNLIMITED ; // (0 currently)
variables: 
	double my_var (my_dim) ; 
		my_var:long_name = "my_dimmy_dim" ;
		my_var:_Storage = "chunked" ; 
		my_var:_ChunkSizes = 1000 ; 
}
# Dockerfile 1 - has the memory leak
ARG LIBNETCDF_VERSION=4.9.3
ARG LIBHDF5_VERSION=1_14_2
ARG LIBNETCDF_INSTALL_DIR=/opt/netcdf

FROM ubuntu:24.04
ARG LIBNETCDF_VERSION LIBHDF5_VERSION LIBNETCDF_INSTALL_DIR

RUN apt-get update -y \
    && apt-get install -y --no-install-recommends python3.12-dev python3-venv gcc g++ m4 zlib1g-dev libaec-dev libxml2-dev make pkg-config \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /netcdf

ADD ["https://downloads.unidata.ucar.edu/netcdf-c/$LIBNETCDF_VERSION/netcdf-c-$LIBNETCDF_VERSION.tar.gz", "netcdf-c.tar.gz"]
ADD ["https://github.com/HDFGroup/hdf5/releases/download/hdf5-$LIBHDF5_VERSION/hdf5-$LIBHDF5_VERSION.tar.gz", "hdf5.tar.gz"]

# Following instructions here: https://github.com/HDFGroup/hdf5/blob/hdf5_1.14.6/release_docs/INSTALL_Autotools.txt
RUN tar xf hdf5.tar.gz \
    && tar xf netcdf-c.tar.gz

WORKDIR /netcdf/hdf5_build

RUN /netcdf/hdfsrc/configure \
    --prefix="$LIBNETCDF_INSTALL_DIR" \
    --enable-build-mode=production \
    --disable-threadsafe \
    --enable-pkgconfig \
    --enable-hl \
    --with-szlib=/usr \
    && make -j$(nproc) \
    && make install 

WORKDIR /netcdf/nc_build

RUN /netcdf/netcdf-c-$LIBNETCDF_VERSION/configure \
    --prefix="$LIBNETCDF_INSTALL_DIR" \
    --enable-netcdf-4 \
    --disable-dap \
    CPPFLAGS="-I$LIBNETCDF_INSTALL_DIR/include" \
    LDFLAGS="-L$LIBNETCDF_INSTALL_DIR/lib" \
    && make -j$(nproc) \
    && make install

WORKDIR /test

ENV VIRTUAL_ENV=/test/venv
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
ENV HDF5_DIR=$LIBNETCDF_INSTALL_DIR
ENV HDF5_LIBDIR=$LIBNETCDF_INSTALL_DIR/lib
ENV HDF5_INCDIR=$LIBNETCDF_INSTALL_DIR/include
ENV NETCDF4_DIR=$LIBNETCDF_INSTALL_DIR

RUN python3.12 -m venv --upgrade --upgrade-deps $VIRTUAL_ENV \
    && pip install --no-cache-dir --no-binary=netcdf4 numpy~=1.26.4 netcdf4~=1.7.2 memray

COPY memtest.py memtest_template.nc ./

RUN memray run --native --trace-python-allocators -o memray_data.bin memtest.py
RUN python -m memray flamegraph --leaks -o flamegraph.html memray_data.bin


ENTRYPOINT ["cat", "flamegraph.html"]
# Dockerfile 2 - does not have a leak
ARG LIBNETCDF_VERSION=4.9.3
ARG LIBHDF5_VERSION=1.14.6
ARG LIBNETCDF_INSTALL_DIR=/opt/netcdf

FROM ubuntu:24.04
ARG LIBNETCDF_VERSION LIBHDF5_VERSION LIBNETCDF_INSTALL_DIR

RUN apt-get update -y \
    && apt-get install -y --no-install-recommends python3.12-dev python3-venv gcc g++ m4 zlib1g-dev libaec-dev libxml2-dev make pkg-config \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /netcdf

ADD ["https://downloads.unidata.ucar.edu/netcdf-c/$LIBNETCDF_VERSION/netcdf-c-$LIBNETCDF_VERSION.tar.gz", "netcdf-c.tar.gz"]
ADD ["https://github.com/HDFGroup/hdf5/releases/download/hdf5_$LIBHDF5_VERSION/hdf5-$LIBHDF5_VERSION.tar.gz", "hdf5.tar.gz"]

# Following instructions here: https://github.com/HDFGroup/hdf5/blob/hdf5_1.14.6/release_docs/INSTALL_Autotools.txt
RUN tar xf hdf5.tar.gz \
    && tar xf netcdf-c.tar.gz

WORKDIR /netcdf/hdf5_build

RUN /netcdf/hdf5-$LIBHDF5_VERSION/configure \
    --prefix="$LIBNETCDF_INSTALL_DIR" \
    --enable-build-mode=production \
    --disable-threadsafe \
    --enable-pkgconfig \
    --enable-hl \
    --with-szlib=/usr \
    && make -j$(nproc) \
    && make install 

WORKDIR /netcdf/nc_build

RUN /netcdf/netcdf-c-$LIBNETCDF_VERSION/configure \
    --prefix="$LIBNETCDF_INSTALL_DIR" \
    --enable-netcdf-4 \
    --disable-dap \
    CPPFLAGS="-I$LIBNETCDF_INSTALL_DIR/include" \
    LDFLAGS="-L$LIBNETCDF_INSTALL_DIR/lib" \
    && make -j$(nproc) \
    && make install

WORKDIR /test

ENV VIRTUAL_ENV=/test/venv
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
ENV HDF5_DIR=$LIBNETCDF_INSTALL_DIR
ENV HDF5_LIBDIR=$LIBNETCDF_INSTALL_DIR/lib
ENV HDF5_INCDIR=$LIBNETCDF_INSTALL_DIR/include
ENV NETCDF4_DIR=$LIBNETCDF_INSTALL_DIR

RUN python3.12 -m venv --upgrade --upgrade-deps $VIRTUAL_ENV \
    && pip install --no-cache-dir --no-binary=netcdf4 numpy~=1.26.4 netcdf4~=1.7.2 memray

COPY memtest.py memtest_template.nc ./

RUN memray run --native --trace-python-allocators -o memray_data.bin memtest.py
RUN python -m memray flamegraph --leaks -o flamegraph.html memray_data.bin


ENTRYPOINT ["cat", "flamegraph.html"]

Build both Dockerfiles with docker build -f Dockerfile-memtest1 -t memtest1 . and docker build -f Dockerfile-memtest2 -t memtest2 ., then run:

docker run memtest1 > flamegraph1.html
docker run memtest2 > flamegraph2.html

Comparing the two graphs, you'll see memory grows with Dockerfile 1 (hdf5 1.14.2), and not with Dockerfile 2 (hdf5 1.14.6), and that the memory growth appears to occur entirely in the H5FL__malloc calls underneath with nc.Dataset(out_file, "r+") as ds.

Image Image

Is it possible to cut another release of netCDF4 that is built against hdf5 1.14.6?

roballsopp avatar Aug 18 '25 03:08 roballsopp

Checked a few more hdf5 versions, and the leak is still present in 1.12.3 and 1.14.3, but appears to be fixed as of 1.14.4.2.

roballsopp avatar Aug 18 '25 14:08 roballsopp

Thanks for the report - it sounds like we should use the HDF5 > 1.14.4.1 when we make binary wheels for the next release. @ocefpaf can you update netcdf-manylinux?

jswhit avatar Aug 18 '25 22:08 jswhit

@ocefpaf can you update netcdf-manylinux?

Latest images are using 1.14.6:

https://github.com/ocefpaf/netcdf-manylinux/blob/fc3edeac483c9a5469891fe774396ec3bab8ece9/Dockerfile_x86_64#L5

We can do a post release or mint new ones.

ocefpaf avatar Aug 19 '25 09:08 ocefpaf

@ocefpaf what version of the hdf5 lib do the current mac and windows wheels use?

jswhit avatar Aug 19 '25 16:08 jswhit

Both Windows and macOS would get hdf5 1.14.6 if built today. The former we are pinning netcdf-c to 4.9.2, but the latter will get whatever latest is in homebrew. (I am planning to move those to a more controlled build at some point, just need to find some time for it.)

ocefpaf avatar Aug 19 '25 17:08 ocefpaf