rasterio icon indicating copy to clipboard operation
rasterio copied to clipboard

Memory leak with .read()

Open fratambot opened this issue 1 year ago • 17 comments

Discussed in https://github.com/rasterio/rasterio/discussions/3033

Originally posted by fratambot February 23, 2024 I really like rasterio and I would like to use it over gdal python bindings in our deep learning projects but we noticed a memory leak and I have no idea how to avoid it.

When loading our .tif files into numpy arrays in the __getitem__ of our torch.utils.data.Dataset class, if we use:

from osgeo import gdal

img_gdal = gdal.Open(str(self.imgs[idx]))
img = img_gdal.ReadAsArray(interleave="pixel")

we have a constant RAM usage. On the other hand, if we use:

import rasterio

with rasterio.open(str(self.imgs[idx]), "r") as img_ds:
    img = np.transpose(img_ds.read(), (1, 2, 0))

the RAM usage keeps growing until the end of the process which consist in loading batches of training images.

Any idea why this is happening ? Many thanks in advance !

(I'm using rasterio 1.3.9 from conda-forge)

fratambot avatar Feb 23 '24 09:02 fratambot

Can you provide the following info?

  1. What is the format of the file? Are you able to publicly share one of the files?
  2. Do you still see the memory leak if you remove the np.transpose call?
import rasterio

with rasterio.open(str(self.imgs[idx]), "r") as img_ds:
    img = img_ds.read()

groutr avatar Feb 23 '24 18:02 groutr

@fratambot what are you using to measure memory usage? Do you see entire images worth of memory not being deallocated or smaller amounts? Can you provide a small program and graphs?

I will delete the duplicate #3033.

sgillies avatar Feb 23 '24 20:02 sgillies

Hello Ryan and Sean, The data comes from a public dataset: https://ignf.github.io/FLAIR/#FLAIR1 and they are in .tif format; size of every image is ~1 MB. I was seeing the memory leak though the app we use to monitor our servers (netdata cloud).

I will work on a minimum reproducible example and properly measure the memory consumption and I'll report back to you !

fratambot avatar Feb 26 '24 08:02 fratambot

@fratambot another useful piece of info when you report back would be the output of

import rasterio
rasterio.show_versions()

groutr avatar Feb 26 '24 16:02 groutr

I used the toy dataset and opened each file 500 times using my snippet above (https://github.com/rasterio/rasterio/issues/3034#issuecomment-1961780372). I measured memory usage with memory_profiler and came up with this plot. 500test

groutr avatar Feb 27 '24 00:02 groutr

Hello Ryan, here's the output of rasterio.show_versions() :

rasterio info:
  rasterio: 1.3.9
      GDAL: 3.7.3
      PROJ: 9.3.0
      GEOS: 3.12.0
 PROJ DATA: /opt/miniconda3/envs/flair-new/share/proj
 GDAL DATA: /opt/miniconda3/envs/flair-new/share/gdal

System:
    python: 3.11.6 | packaged by conda-forge | (main, Oct  3 2023, 10:40:35) [GCC 12.3.0]
executable: /opt/miniconda3/envs/flair-new/bin/python
   machine: Linux-5.15.0-69-generic-x86_64-with-glibc2.31

Python deps:
    affine: 2.4.0
     attrs: 23.1.0
   certifi: 2023.11.17
     click: 8.1.7
     cligj: 0.7.2
    cython: None
     numpy: 1.26.0
    snuggs: 1.4.7
click-plugins: None
setuptools: 68.0.0

I wrote a small script in which I read 64 times each one of the 200 images of the toy dataset and each as if it was a batch. I optionally perform some transformations to the numpy array before passing to the next batch. I compared GDAL vs rasterio in both scenario and I couldn't see any memory leak. I used tracemalloc and store the current memory usage and the peak memory usage while looping over the batches.

Here's the script :

import tracemalloc
import pickle
from pathlib import Path
import numpy as np
import rasterio
from osgeo import gdal


def test_memory_leak(data_path):
    data = sorted(list(Path(data_path).glob("*/*/*/*.tif")))
    print(len(data))

    memory_usage = {
        "current" : [],
        "peak" : []
    }

    tracemalloc.start()
    for i in range(len(data)):
        batch = []
        current, peak = tracemalloc.get_traced_memory()
        print("current / peak = ", current/1.e6, peak/1.e6)
        memory_usage["current"].append(current/1.e6)
        memory_usage["peak"].append(peak/1.e6)
        for j in range(64):
            img = load_img(data, i)
            #img = transform(img)
            batch.append(img)
    
    current, peak = tracemalloc.get_traced_memory()
    print("final current / peak = ", current/1.e6, peak/1.e6)
    tracemalloc.stop()

    with open("notebooks/gdal.pkl", "wb") as f:
        pickle.dump(memory_usage, f)

    
    print("Done")
    return

def load_img(data, idx):
    # with rasterio.open(str(data[idx]), "r") as img_ds:
    #     img = img_ds.read()
    img_gdal = gdal.Open(str(data[idx]))
    img = img_gdal.ReadAsArray(interleave="pixel")

    return img

def transform(img):
    img = img[:,:,:2]
    img = np.transpose(img, (2, 0, 1))

    return img

if __name__ == "__main__":
    data_path = "path_to_my_dataset_folder"
    test_memory_leak(data_path)

Here are the plots :

mem_leak_transf mem_leak_no_transf

I think the issue might be related to pytorch. I'll perform more test using the pytorch Dataloader and the T.Compose transformations...

Here's a screenshot of the app we use to monitor the RAM usage of our servers just to show you I'm not crazy ;)

netdata_memory

The only difference is that in the first run I load the .tif with gdal and in the second run with rasterio and it seems that something is not freed but, as I was saying, a lot of other things happen after that such as transformations of numpy arrays into tensors which are moved to cuda device and further transformed.

It's still not clear to me what the issue really is... I'll keep investigating !

fratambot avatar Feb 27 '24 11:02 fratambot

@fratambot thank you, we take this seriously! We've fixed one leak since 1.3.9, but it's not related https://github.com/rasterio/rasterio/blob/441d8c85ba1b7170a0b19313a4d654c5fab6b77a/CHANGES.txt#L35.

sgillies avatar Feb 27 '24 17:02 sgillies

I was inspecting my rasterio environment I used to do my plot and realized it was an older version of rasterio (1.3.6 I think). I updated to 1.3.9 and the memory usage with rasterio is basically a flat line now (but does rise 1-2Mb over the duration of the test).

Can you clarify something about your screenshot of RAM usage on the servers. Is that running your full processing pipeline, or just the loading script in your comment?

groutr avatar Feb 29 '24 05:02 groutr

Hello everybody,

First of all I realized that tracemalloc is not the best tool to measure the overall memory usage of a script so I moved to memory-profile.

Secondly and most importantly, I managed to reproduce the memory leak with a simple script and it's a weird one !

TL;DR The leak happens when we load both the images and the masks and we transform the images using .to_tensor() from torchvision and the masks using from_numpy() from pytorch. The bad news is that's typically what we do when we want to train a model for deep learning 😓

Here's the minimal script I used :

from pathlib import Path
import rasterio
from osgeo import gdal
import torch
import torchvision.transforms as T


#rasterio.show_versions()


def test_memory_leak(img_path, msk_path):
    img_paths = None
    msk_paths = None
    img_paths = sorted(list(Path(img_path).glob("*/*/*/*.tif")))
    msk_paths = sorted(list(Path(msk_path).glob("*/*/*/*.tif")))
    if img_paths:
        data_size = len(img_paths)
    else:
        data_size = len(msk_paths)

    nb_epochs = 10
    batch_size = 32
    nb_batches = int(data_size/batch_size)
    print("nb_batches = ", nb_batches)

    for epoch in range(nb_epochs):
        print("epoch = ", epoch+1)
        for batch in range(nb_batches):
            img_list = []
            msk_list = []
            for idx in range(batch*batch_size, (batch+1)*batch_size):
                # DATA LOADER
                sample = dataset_rio(img_paths, msk_paths, idx)
                #sample = dataset_gdal(img_paths, msk_paths, idx)
                
                # TRANSFORM
                sample = transform_sample_tensor(sample)
                #sample = transform_sample_numpy(sample) # IT NEVER LEAKS

                # DO SOMETHING WITH THE SAMPLE
                if "img" in sample:
                    img_list.append(sample["img"])
                if "msk" in sample:
                    msk_list.append(sample["msk"])

 
    print("Done")
    return


# DATA LOADERS
def dataset_rio(img_paths, msk_paths, idx):
    if img_paths:
        with rasterio.open(str(img_paths[idx]), "r") as img_ds:
            img = img_ds.read()
    if msk_paths:
        with rasterio.open(str(msk_paths[idx]), "r") as msk_ds:
            msk = msk_ds.read()

    if img_paths and msk_paths:
        sample = {"img": img, "msk" : msk}
    elif not img_paths:
        sample = {"msk" : msk}
    elif not msk_paths:
        sample = {"img" : img}
    else:
        raise ValueError("you need to pass at least 1 path")
    
    return sample


def dataset_gdal(img_paths, msk_paths, idx):
    img_gdal = gdal.Open(str(img_paths[idx]))
    img = img_gdal.ReadAsArray(interleave="pixel")
    
    msk_gdal = gdal.Open(str(msk_paths[idx]))
    msk = msk_gdal.ReadAsArray(interleave="pixel")

    sample = {"img": img, "msk" : msk}

    return sample


# TRANSFORMATIONS
def transform_sample_tensor(sample):

    if "img" in sample:
        img = sample["img"]
        img = T.functional.to_tensor(img)
        #img = torch.from_numpy(img)

    if "msk" in sample:
        msk = sample["msk"]
        #msk = T.functional.to_tensor(msk)
        msk = torch.from_numpy(msk)

    if ("img" in sample) and ("msk" in sample):
        sample = {"img" : img, "msk" : msk}
    elif "img" not in sample:
        print("only mask")
        sample = {"msk" : msk}
    elif "msk" not in sample:
        print("only img")
        sample = {"img" : img}
    else:
        raise ValueError("I cannot find any sample to transform")

    return sample


def transform_sample_numpy(sample):

    img, msk = sample["img"], sample["msk"]
    img = img[:,:,:3]
    msk[msk > 13] = 13

    return {"img" : img, "msk" : msk}


if __name__ == "__main__":
    # >>  mprof run python test_memory_leak.py 
    # >>  mprof plot

    img_path = "path_to/flair_1_toy_aerial_train"
    msk_path = "path_to/flair_1_toy_labels_train"
    test_memory_leak(img_path, msk_path)

(all the if-elif-else logic is just to use the same function with both the images and masks datasets or only one of them)

Here's the memory profile when the leaks happens (notice the order of magnitude of the y-axis) :

rio_IMG-to_tensor_MSK-from_np_LEAK

If the sample is loaded using dataset_gdal(), there's no leak :

gdal_IMG-to_tensor_MSK-from_np

Coming back to using dataset_rio(), if we do nothing with the sample, i.e. everything below the comment # DO SOMETHING WITH THE SAMPLE is commented, the leak disappears (but IRL this never happens) :

rio_IMG-to_tensor_MSK-from_np_do-nothing

Weirdly enough, if we transform the image using .from_numpy() and the mask using .to_tensor(), there's no leak (but IRL this never happens) :

rio_IMG-from_np_MSK-to_tensor

The data type is the same for both the images and the masks (uint8) and the only difference is in the number of channels. Here's a profile of the rasterio dataset for the first image :

driver :  GTiff
dtype :  uint8
nodata :  None
width :  512
height :  512
count :  5
crs :  EPSG:2154
transform :  | 0.20, 0.00, 922524.80|
| 0.00,-0.20, 6301512.00|
| 0.00, 0.00, 1.00|


Band 1 [uint8] [color gray] :  Statistics(min=32.0, max=219.0, mean=82.958557128906, std=43.971926219213)
Band 2 [uint8] [color undefined] :  Statistics(min=46.0, max=224.0, mean=88.913074493408, std=33.015264288545)
Band 3 [uint8] [color undefined] :  Statistics(min=46.0, max=218.0, mean=82.606864929199, std=24.936723073525)
Band 4 [uint8] [color undefined] :  Statistics(min=28.0, max=224.0, mean=92.435501098633, std=26.890214699652)
Band 5 [uint8] [color undefined] :  Statistics(min=0.0, max=100.0, mean=26.231979370117, std=28.680426421257)

And for one of the first mask :

driver :  GTiff
dtype :  uint8
nodata :  None
width :  512
height :  512
count :  1
crs :  LOCAL_CS["RGF93 v1 / Lambert-93",UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH]]
transform :  | 0.20, 0.00, 922524.80|
| 0.00,-0.20, 6301512.00|
| 0.00, 0.00, 1.00|


Band 1 [uint8] [color gray] :  Statistics(min=1.0, max=10.0, mean=6.9922790527344, std=1.9747478592662)

If we transform both the images and the masks using .to_tensor() or .from_numpy(), no leak is observed. rio_to_tensor rio_from_numpy If we transform only the numpy array and never use tensors, no leak is observed. rio_numpy_transformations If we use only images or only masks, no leak is observed. rio_only_img rio_only_msk

Unfortunately, none of this scenarios happen IRL.

As I was saying at the beginning, for training a model we always want to use .to_tensor() for images because it automagically reshape any numpy array of shape [h,w,c] into [c,h,w] and it normalizes the pixel values between 0.0 and 1.0 which is what a neural network expects. On the other hand, we never want to normalize the pixel value of the masks because the integer value represents a class (in semantic segmentation) and we just want to transform it to a tensor using .from_numpy() which is exactly the scenario for which the leak manifests itself 💔.

The issue is most likely related with the fact that the tensor returned by .from_numpy() shares the same memory as the numpy array. However, the fact that it manifests only when we mix the methods and that if we use .from_numpy() on the image instead of the mask and .to_tensor() on the mask instead of the images the leak disappears, throws me off, as if it depends also on the number of raster's bands 😕

Hopefully you will find the reason and fix it because I would really like to replace every python code line using gdal (whose documentations is almost inexistent) with rasterio in all of our projects !

fratambot avatar Feb 29 '24 16:02 fratambot

@fratambot One thing to note is that the shapes of the arrays returned from gdal and rasterio are different. gdal returns an img array of shape (512, 512, 5), but rasterio returns (5, 512, 512). Depending on what torchvision is expecting, that might lead to some weird processing.

groutr avatar Feb 29 '24 18:02 groutr

@fratambot One thing to note is that the shapes of the arrays returned from gdal and rasterio are different. gdal returns an img array of shape (512, 512, 5), but rasterio returns (5, 512, 512). Depending on what torchvision is expecting, that might lead to some weird processing.

Yes, I'm aware ! And that's why in the first script I was performing the np.translate() on reading in order to respect the (internal ?) convention the numpy array have shape [h,w,c] so that you can plot them easily. I removed it in the last script in order to not introduce any aleatory effect and it shouldn't be related to the leak but I can take a look at it tomorrow !

Actually that could explain the different behaviour with respect to masks which are 1D 🤔

fratambot avatar Feb 29 '24 18:02 fratambot

@fratambot I ran your code sample with an environment that tried to match versions as best I could. I could not reproduce your issue on my machine.

fratambot

I sourced all my packages from conda-forge. I used the latest compatible pytorch (2.1.2) and torchvision (0.16.1)

rasterio info:
  rasterio: 1.3.9
      GDAL: 3.7.3
      PROJ: 9.3.0
      GEOS: 3.12.0
 PROJ DATA: /home/grout/mambaforge/envs/rasterioleak2/share/proj
 GDAL DATA: /home/grout/mambaforge/envs/rasterioleak2/share/gdal

System:
    python: 3.11.6 | packaged by conda-forge | (main, Oct  3 2023, 10:40:35) [GCC 12.3.0]
executable: /home/grout/mambaforge/envs/rasterioleak2/bin/python3.11
   machine: Linux-6.7.6-arch1-1-x86_64-with-glibc2.39

Python deps:
    affine: 2.4.0
     attrs: 23.1.0
   certifi: 2024.02.02
     click: 8.1.7
     cligj: 0.7.2
    cython: None
     numpy: 1.26.0
    snuggs: 1.4.7
click-plugins: None
setuptools: 69.1.1

Could you post the output of conda list of your environment. Perhaps there is another package that is causing a strange interaction.

groutr avatar Feb 29 '24 18:02 groutr

@fratambot I ran your code sample with an environment that tried to match versions as best I could. I could not reproduce your issue on my machine.

fratambot

I sourced all my packages from conda-forge. I used the latest compatible pytorch (2.1.2) and torchvision (0.16.1)

rasterio info:
  rasterio: 1.3.9
      GDAL: 3.7.3
      PROJ: 9.3.0
      GEOS: 3.12.0
 PROJ DATA: /home/grout/mambaforge/envs/rasterioleak2/share/proj
 GDAL DATA: /home/grout/mambaforge/envs/rasterioleak2/share/gdal

System:
    python: 3.11.6 | packaged by conda-forge | (main, Oct  3 2023, 10:40:35) [GCC 12.3.0]
executable: /home/grout/mambaforge/envs/rasterioleak2/bin/python3.11
   machine: Linux-6.7.6-arch1-1-x86_64-with-glibc2.39

Python deps:
    affine: 2.4.0
     attrs: 23.1.0
   certifi: 2024.02.02
     click: 8.1.7
     cligj: 0.7.2
    cython: None
     numpy: 1.26.0
    snuggs: 1.4.7
click-plugins: None
setuptools: 69.1.1

Could you post the output of conda list of your environment. Perhaps there is another package that is causing a strange interaction.

I'll send it to you tomorrow (here in EU it's evening) but that's good news !! 🤩 I know I have both rasterio and gdal which is not a good idea but I need both of them for testing !

fratambot avatar Feb 29 '24 18:02 fratambot

@fratambot I have no idea what your production environment is like, but here is my environment spec if you want to recreate it and test. rasterioleak2.json (the json extension is only for upload. It is not a json file and you'll need to change the extension to .yml).

You can recreate the environment with

conda env create --file rasterioleak2.yml -n <name_of_new_environment>

groutr avatar Feb 29 '24 19:02 groutr

Hello Ryan,

I recreated your environment but I still observe the leak 😕

still_leaking

Your memory profile looks like the "inverted scenario" in which the leak is not observed, though. Just to be sure, this script will always produce the leak for me :

from pathlib import Path
import rasterio
import torch
import torchvision.transforms as T


rasterio.show_versions()


def test_memory_leak(img_path, msk_path):
    img_paths = sorted(list(Path(img_path).glob("*/*/*/*.tif")))
    msk_paths = sorted(list(Path(msk_path).glob("*/*/*/*.tif")))


    nb_epochs = 10
    batch_size = 32
    nb_batches = int(len(img_paths)/batch_size)
    print("nb_batches = ", nb_batches)


    for epoch in range(nb_epochs):
        print("epoch = ", epoch+1)
        for batch in range(nb_batches):
            img_list = []
            msk_list = []
            for idx in range(batch*batch_size, (batch+1)*batch_size):
                # DATA LOADER
                sample = dataset_rio(img_paths, msk_paths, idx)
                
                # TRANSFORM
                sample = transform_sample_tensor(sample)

                # DO SOMETHING WITH THE SAMPLE
                img_list.append(sample["img"])
                msk_list.append(sample["msk"])

 
    print("Done")
    return


# DATA LOADERS
def dataset_rio(img_paths, msk_paths, idx):
    with rasterio.open(str(img_paths[idx]), "r") as img_ds:
        img = img_ds.read()
    with rasterio.open(str(msk_paths[idx]), "r") as msk_ds:
        msk = msk_ds.read()

    sample = {"img": img, "msk" : msk}

    return sample


# TRANSFORMATIONS
def transform_sample_tensor(sample):

    img = sample["img"]
    img = T.functional.to_tensor(img)

    msk = sample["msk"]
    msk = torch.from_numpy(msk)

    sample = {"img" : img, "msk" : msk}

    return sample


if __name__ == "__main__":
    # >>  mprof run python test_memory_reproduce_leak.py 
    # >>  mprof plot

    img_path = "/home/ftamborra/Data/img/flair_samples/flair_1_toy_dataset/flair_1_toy_aerial_train"
    msk_path = "/home/ftamborra/Data/img/flair_samples/flair_1_toy_dataset/flair_1_toy_labels_train"
    test_memory_leak(img_path, msk_path)

My rasterio version is identical to yours :

rasterio info:
  rasterio: 1.3.9
      GDAL: 3.7.3
      PROJ: 9.3.0
      GEOS: 3.12.0
 PROJ DATA: /home/ftamborra/miniconda3/envs/rasterioleak/share/proj
 GDAL DATA: /home/ftamborra/miniconda3/envs/rasterioleak/share/gdal

System:
    python: 3.11.6 | packaged by conda-forge | (main, Oct  3 2023, 10:40:35) [GCC 12.3.0]
executable: /home/ftamborra/miniconda3/envs/rasterioleak/bin/python
   machine: Linux-6.5.0-21-generic-x86_64-with-glibc2.35

Python deps:
    affine: 2.4.0
     attrs: 23.1.0
   certifi: 2024.02.02
     click: 8.1.7
     cligj: 0.7.2
    cython: None
     numpy: 1.26.0
    snuggs: 1.4.7
click-plugins: None
setuptools: 69.1.1

And also the environment : my_rasterioleak.json

The only difference I notice is that I have a slightly older Linux kernel (I'm on Ubuntu 22.04) and I use miniconda while you seem to use miniforge (?), but still...the packages are exactly the same...

fratambot avatar Mar 01 '24 09:03 fratambot

In case this helps, here's what goes on in read().

If you don't pass an array in (the out parameter), a new empty or zeroed out array is created in Python using numpy.array(). In Cython/C, we call PyArray_DATA() to get a pointer to the underlying buffer of that array.

https://github.com/rasterio/rasterio/blob/5db194fe99c566479858f71baa0b7d22e560dd22/rasterio/_io.pyx#L86

The pointer is passed to GDALRasterIO(), which fills it with pixel values, and then the array is returned to the caller of read(). At that point, whether the array's allocated memory is released or not is up to the calling application.

Now, there is another pool of memory in play with GDAL: its raster block cache, which defaults to 5% of your computer's RAM. Block keys are partially formed from dataset names, so as you read from more files, the cache will grow in size over time, approaching the maximum configured value.

Here's a rough diagram of the I/O:

array << read(out=array) << array <- GDAL block reader << block cache << GDAL driver implementation << dataset

I don't use Tensorflow, so I don't know if it has its own caching system. I wouldn't be surprised if it did.

sgillies avatar Mar 01 '24 21:03 sgillies

@fratambot I recreated your environment and used your latest version of the script and I still cannot replicate your memory usage on my machine. I was hoping it could be a related to using an older/newer version of pytorch/torchvision.

If gdal does not result in a memory usage issue, verify that the rasterio is providing the exact same array (shape, dtype, and flags) that gdal is providing to pytorch. If rasterio and gdal loading results in the exact same array, it's weird to me that rasterio would have provoke such strange memory problems while gdal does not.

sample = dataset_rio(img_paths, msk_paths, idx)
samplegdal = dataset_gdal(img_paths, msk_paths, idx)
assert sample.dtype == samplegdal.dtype
assert sample.shape == samplegdal.shape
assert sample.flags == samplegdal.flags
assert np.allclose(sample, samplegdal)

groutr avatar Mar 02 '24 22:03 groutr

Hello @groutr ,

It is indeed really weird that we don't have the same behavior since the environment is the same (even the packages build are specified in the yaml !). Does it depend on the different Linux kernel then ? 🤷

Rasterio and gdal cannot provide the exact same array as far as I understand. Rasterio shape is different from gdal by design and if I transpose it, then the flags, identical before the transposition, differ, since the OWNDATA field switches from False to True.

If I ignore this difference and perform the following transformation in order to have numpy arrays of shape identical to gdal's :

def dataset_rio(img_paths, msk_paths, idx):
    with rasterio.open(str(img_paths[idx]), "r") as img_ds:
        img = np.transpose(img_ds.read(), (1,2,0))

    with rasterio.open(str(msk_paths[idx]), "r") as msk_ds:
        msk = np.squeeze(msk_ds.read())

    sample = {"img": img, "msk" : msk}

    return sample

I still observe the memory leak: rio_leak_with_transpose_and_squeeze

However, following @sgillies explanation, if I pass an empty array to the rasterio .read() method :

def dataset_rio(img_paths, msk_paths, idx):
    img_raw = np.empty(shape=(5,512,512))
    msk_raw = np.empty(shape=(1,512,512))
    with rasterio.open(str(img_paths[idx]), "r") as img_ds:
        img = np.transpose(img_ds.read(out=img_raw), (1,2,0))
    with rasterio.open(str(msk_paths[idx]), "r") as msk_ds:
        msk = np.squeeze(msk_ds.read(out=msk_raw))

    sample = {"img": img, "msk" : msk}

    return sample

There's no leak :

rio_pass_empty

As far as I'm concerned, this solves the problem and the issue can be closed ! 🥳

But maybe it would be nice if rasterio could perform internally and implicitly a similar operation in terms of memory allocation without forcing the user to know in advance the shape of the numpy arrays (which might not be known). Or, if it's torchvision and its cache to blame, at least specify in the doc that's good practice to pass an empty numpy array of the appropriate shape to the out parameter of the .read() method in order to not incur in possible memory leaks when using torch ecosystem transformations.

In any case, infinite thanks for your precious help and time !! 🙏

fratambot avatar Mar 04 '24 10:03 fratambot

if I pass an empty array to the rasterio .read() method :

this smells a bit like a reference counting issue. Maybe related to https://cython.readthedocs.io/en/latest/src/userguide/language_basics.html#python-objects-as-parameters-and-return-values and calls like https://github.com/rasterio/rasterio/blob/5db194fe99c566479858f71baa0b7d22e560dd22/rasterio/_io.pyx#L653, where there would be something wrong going with the refcount when the array is created at https://github.com/rasterio/rasterio/blob/5db194fe99c566479858f71baa0b7d22e560dd22/rasterio/_io.pyx#L626 ?

rouault avatar Mar 05 '24 18:03 rouault

@fratambot @groutr @rouault the program below sort of reproduces the situation. Note that it accumulates arrays in a list exactly as the code above does.

import numpy
import rasterio

array_list = []

for i in range(1000):
    with rasterio.open("/Users/sean.gillies/projects/rasterio/tests/data/RGB2.byte.tif") as dataset:
        data = numpy.transpose(dataset.read(), (1, 2, 0))
        array_list.append(data)

Figure_1

But this isn't a leak. It's a consequence of storing image arrays in a list and should be expected. They aren't deallocated until the list itself goes out of scope and gets garbage collected. If I don't append them to the list, I see this

Figure_1

sgillies avatar Mar 05 '24 23:03 sgillies

@sgillies the script I 'm using is not appending indefinitely but only in batches of 32 images and 32 masks and then the lists are emptied and the garbage collector should kick in and free the memory, right ?

In the meanwhile I recreated an environment without gdal in my real project and I actually noticed that there's no memory leak even without passing the empty numpy arrays !!

It's definitely an issue related to a conflict between packages (and Linux kernel / Cpython version ? Otherwise I cannot explain why I observe the leak and @groutr doesn't, even while using the exact same environment).

To test it, I conda remove gdal from the environment I was using obtaining this one, nogdal_environment.json, but I still observed the leak.

I then recreated from scratch a new environment running the following commands (therefore releasing the requirements on the package versions) :

$ conda install pytorch torchvision cpuonly -c pytorch
$ pip install rasterio
$ pip install -U memory_profiler
$ conda install matplotlib

And with this new environment, new_environment.json, no leak is observed :

new_env

I cannot say exactly how and when the leak arise but it's for sure related to a package (and/or Linux kernel / Cpython) conflict. On this topic, it's unfortunate that rasterio can be installed only using pip. I prefer using conda or mamba exactly for this reason : without a solver, I often find myself pip-installing a package and pray there are no conflicts 😅

Will rasterio be on conda someday ? That should solve (pun intended) every weird behavior looking like a bug but actually entirely due to some packages conflict...

fratambot avatar Mar 06 '24 14:03 fratambot

@fratambot a for loop doesn't create a new scope. Inside a batch you do reassign the name to a new list, but the previous value will stick around, anonymously, until the function returns.

That said, there certainly could be something else going on. I could be using Cython or gcc/clang wrong. It wouldn't be the first time 😅

Rasterio is in the conda-forge channel https://anaconda.org/conda-forge/rasterio. I recommend getting everything you can from it because it's more current than the default conda channel. mamba install -c conda-forge gdal rasterio should get you a consistent set, all sharing the same library versions.

sgillies avatar Mar 06 '24 15:03 sgillies

Rasterio is in the conda-forge channel https://anaconda.org/conda-forge/rasterio. I recommend getting everything you can from it because it's more current than the default conda channel. mamba install -c conda-forge gdal rasterio should get you a consistent set, all sharing the same library versions.

Wow ! I should have searched better... my bad 😑

fratambot avatar Mar 06 '24 16:03 fratambot

@fratambot it's interesting that you installed rasterio via pip. Does that also include building and installing the dependencies of rasterio via pip? When you installed rasterio, was that using a binary wheel or did you build from source?

@sgillies I think your statement about when a list is garbage collected is inaccurate. The list should be freed immediately when the reference count reaches 0. In @fratambot's code above, the list should be garbage collected every new batch except the last.

groutr avatar Mar 07 '24 23:03 groutr

@groutr I used pip install rasterio so I guess I was using the binary wheel and the dependencies, such as affine were also installed via pip. At least that seems to be the case by looking at the environment files I attached in the previous messages

fratambot avatar Mar 08 '24 10:03 fratambot

@groutr you're right. I've been sick this week and must have been delirious there 😆

sgillies avatar Mar 08 '24 23:03 sgillies

Hello again !

I was going back to my original project on our remote machines hoping to have solved any problem but the memory leak with rasterio persists 😭

I updated conda in order to use libmamba solver. At first I thought that the leak was related to another conflict due to the fact that I had to install pytorch-cuda in order to use the GPUs but actually I can reproduce the leak even using the just the cpu. Here's the script (a new version allowing to easily switch between rasterio and gdal):

import numpy as np
import rasterio
import torch
import torchvision.transforms as T

from osgeo import gdal
from pathlib import Path


rasterio.show_versions()


def test_memory_leak(img_path, msk_path):
    img_paths = sorted(list(Path(img_path).glob("*/*/*/*.tif")))
    msk_paths = sorted(list(Path(msk_path).glob("*/*/*/*.tif")))


    nb_epochs = 10
    batch_size = 32
    nb_batches = int(len(img_paths)/batch_size)
    print("nb_batches = ", nb_batches)


    for epoch in range(nb_epochs):
        print("epoch = ", epoch+1)
        for batch in range(nb_batches):
            img_list = []
            msk_list = []
            for idx in range(batch*batch_size, (batch+1)*batch_size):
                # DATA LOADER
                # use="gdal" or use="rio"
                sample = dataset(img_paths, msk_paths, idx, use="rio")
                
                # TRANSFORM
                sample = transform_sample_tensor(sample)

                # DO SOMETHING WITH THE SAMPLE
                img_list.append(sample["img"])
                msk_list.append(sample["msk"])

 
    print("Done")
    return


# DATALOADER
def dataset(img_paths, msk_paths, idx, use):
    if use == "rio":
        with rasterio.open(str(img_paths[idx]), "r") as img_ds:
            img = np.transpose(img_ds.read(), (1,2,0))
        with rasterio.open(str(msk_paths[idx]), "r") as msk_ds:
            msk = np.squeeze(msk_ds.read())

    if use == "gdal":
        img_gdal = gdal.Open(str(img_paths[idx]))
        img = img_gdal.ReadAsArray(interleave="pixel")
        
        msk_gdal = gdal.Open(str(msk_paths[idx]))
        msk = msk_gdal.ReadAsArray(interleave="pixel")
    


    sample = {"img": img, "msk" : msk}

    return sample


# TRANSFORMATIONS
def transform_sample_tensor(sample):

    img = sample["img"]
    img = T.functional.to_tensor(img)

    msk = sample["msk"]
    msk = torch.from_numpy(msk)

    sample = {"img" : img, "msk" : msk}

    return sample


if __name__ == "__main__":
    # >>  mprof run python test_memory_leak.py 
    # >>  mprof plot

    img_path = "path_to/flair_1_toy_dataset/flair_1_toy_aerial_train"
    msk_path = "path_to/flair_1_toy_dataset/flair_1_toy_labels_train"
    test_memory_leak(img_path, msk_path)

Here's the environment without the builds : cpu_minimal.json and the one with the builds : cpu_minimal_builds.json

Use : conda env create --file=cpu_minimal.yml to recreate it.

When I pass use="rio to the function dataset() I observe the leak : cpu_rio

When I pass use="gdal", no leak is observed : cpu_gdal

Here's the output of my rasterio.show_versions() :

rasterio info:
  rasterio: 1.3.9
      GDAL: 3.8.4
      PROJ: 9.3.1
      GEOS: 3.12.1
 PROJ DATA: /opt/miniconda3/envs/cpu-minimal/share/proj
 GDAL DATA: /opt/miniconda3/envs/cpu-minimal/share/gdal

System:
    python: 3.9.18 (main, Sep 11 2023, 13:41:44)  [GCC 11.2.0]
executable: /opt/miniconda3/envs/cpu-minimal/bin/python
   machine: Linux-5.15.0-69-generic-x86_64-with-glibc2.31

Python deps:
    affine: 2.3.0
     attrs: 23.1.0
   certifi: 2024.02.02
     click: 8.1.7
     cligj: 0.7.2
    cython: None
     numpy: 1.26.4
    snuggs: 1.4.7
click-plugins: None
setuptools: 68.2.2

I already tried different version of python and different combination of pytorch and torchvision but it doesn't seem to matter.

Please tell me that you can reproduce the leak, at least ! This issue is driving me a little bit crazy...

fratambot avatar Mar 12 '24 10:03 fratambot

Perhaps try adding sys.getrefcount(img) and sys.getrefcount(mask) to compare if there are differences between the rio and gdal paths? In a non-leaking scenario, I would anticipate the counter to be just 2 immediately after the img = np.transpose(img_ds.read(), (1,2,0)) or img = img_gdal.ReadAsArray(interleave="pixel") calls (the reference of the img variable, and the one acquired temporarily by sys.getrefcount())

>>> from osgeo import gdal
>>> import sys
>>> ds= gdal.Open('byte.tif')
>>> img = ds.ReadAsArray()
>>> print(sys.getrefcount(img))
2

rouault avatar Mar 12 '24 10:03 rouault

Perhaps try adding sys.getrefcount(img) and sys.getrefcount(mask) to compare if there are differences between the rio and gdal paths? In a non-leaking scenario, I would anticipate the counter to be just 2 immediately after the img = np.transpose(img_ds.read(), (1,2,0)) or img = img_gdal.ReadAsArray(interleave="pixel") calls (the reference of the img variable, and the one acquired temporarily by sys.getrefcount())

>>> from osgeo import gdal
>>> import sys
>>> ds= gdal.Open('byte.tif')
>>> img = ds.ReadAsArray()
>>> print(sys.getrefcount(img))
2

I obtain a refcount 2 for the img (or the msk) for both rasterio and gdal calls 😕

fratambot avatar Mar 12 '24 13:03 fratambot

I'm still unable to reproduce this behavior on my machine. @fratambot Have you worked with objgraph? https://mg.pov.lt/objgraph/ It could give some insight.

groutr avatar Mar 12 '24 20:03 groutr