rasterio
rasterio copied to clipboard
Memory leak with .read()
Discussed in https://github.com/rasterio/rasterio/discussions/3033
Originally posted by fratambot February 23, 2024 I really like rasterio and I would like to use it over gdal python bindings in our deep learning projects but we noticed a memory leak and I have no idea how to avoid it.
When loading our .tif files into numpy arrays in the __getitem__
of our torch.utils.data.Dataset
class, if we use:
from osgeo import gdal
img_gdal = gdal.Open(str(self.imgs[idx]))
img = img_gdal.ReadAsArray(interleave="pixel")
we have a constant RAM usage. On the other hand, if we use:
import rasterio
with rasterio.open(str(self.imgs[idx]), "r") as img_ds:
img = np.transpose(img_ds.read(), (1, 2, 0))
the RAM usage keeps growing until the end of the process which consist in loading batches of training images.
Any idea why this is happening ? Many thanks in advance !
(I'm using rasterio 1.3.9 from conda-forge)
Can you provide the following info?
- What is the format of the file? Are you able to publicly share one of the files?
- Do you still see the memory leak if you remove the
np.transpose
call?
import rasterio
with rasterio.open(str(self.imgs[idx]), "r") as img_ds:
img = img_ds.read()
@fratambot what are you using to measure memory usage? Do you see entire images worth of memory not being deallocated or smaller amounts? Can you provide a small program and graphs?
I will delete the duplicate #3033.
Hello Ryan and Sean,
The data comes from a public dataset: https://ignf.github.io/FLAIR/#FLAIR1 and they are in .tif
format; size of every image is ~1 MB.
I was seeing the memory leak though the app we use to monitor our servers (netdata cloud).
I will work on a minimum reproducible example and properly measure the memory consumption and I'll report back to you !
@fratambot another useful piece of info when you report back would be the output of
import rasterio
rasterio.show_versions()
I used the toy dataset and opened each file 500 times using my snippet above (https://github.com/rasterio/rasterio/issues/3034#issuecomment-1961780372).
I measured memory usage with memory_profiler and came up with this plot.
Hello Ryan, here's the output of rasterio.show_versions()
:
rasterio info:
rasterio: 1.3.9
GDAL: 3.7.3
PROJ: 9.3.0
GEOS: 3.12.0
PROJ DATA: /opt/miniconda3/envs/flair-new/share/proj
GDAL DATA: /opt/miniconda3/envs/flair-new/share/gdal
System:
python: 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 10:40:35) [GCC 12.3.0]
executable: /opt/miniconda3/envs/flair-new/bin/python
machine: Linux-5.15.0-69-generic-x86_64-with-glibc2.31
Python deps:
affine: 2.4.0
attrs: 23.1.0
certifi: 2023.11.17
click: 8.1.7
cligj: 0.7.2
cython: None
numpy: 1.26.0
snuggs: 1.4.7
click-plugins: None
setuptools: 68.0.0
I wrote a small script in which I read 64 times each one of the 200 images of the toy dataset and each as if it was a batch. I optionally perform some transformations to the numpy array before passing to the next batch. I compared GDAL vs rasterio in both scenario and I couldn't see any memory leak. I used tracemalloc and store the current memory usage and the peak memory usage while looping over the batches.
Here's the script :
import tracemalloc
import pickle
from pathlib import Path
import numpy as np
import rasterio
from osgeo import gdal
def test_memory_leak(data_path):
data = sorted(list(Path(data_path).glob("*/*/*/*.tif")))
print(len(data))
memory_usage = {
"current" : [],
"peak" : []
}
tracemalloc.start()
for i in range(len(data)):
batch = []
current, peak = tracemalloc.get_traced_memory()
print("current / peak = ", current/1.e6, peak/1.e6)
memory_usage["current"].append(current/1.e6)
memory_usage["peak"].append(peak/1.e6)
for j in range(64):
img = load_img(data, i)
#img = transform(img)
batch.append(img)
current, peak = tracemalloc.get_traced_memory()
print("final current / peak = ", current/1.e6, peak/1.e6)
tracemalloc.stop()
with open("notebooks/gdal.pkl", "wb") as f:
pickle.dump(memory_usage, f)
print("Done")
return
def load_img(data, idx):
# with rasterio.open(str(data[idx]), "r") as img_ds:
# img = img_ds.read()
img_gdal = gdal.Open(str(data[idx]))
img = img_gdal.ReadAsArray(interleave="pixel")
return img
def transform(img):
img = img[:,:,:2]
img = np.transpose(img, (2, 0, 1))
return img
if __name__ == "__main__":
data_path = "path_to_my_dataset_folder"
test_memory_leak(data_path)
Here are the plots :
I think the issue might be related to pytorch. I'll perform more test using the pytorch Dataloader and the T.Compose transformations...
Here's a screenshot of the app we use to monitor the RAM usage of our servers just to show you I'm not crazy ;)
The only difference is that in the first run I load the .tif with gdal and in the second run with rasterio and it seems that something is not freed but, as I was saying, a lot of other things happen after that such as transformations of numpy arrays into tensors which are moved to cuda device and further transformed.
It's still not clear to me what the issue really is... I'll keep investigating !
@fratambot thank you, we take this seriously! We've fixed one leak since 1.3.9, but it's not related https://github.com/rasterio/rasterio/blob/441d8c85ba1b7170a0b19313a4d654c5fab6b77a/CHANGES.txt#L35.
I was inspecting my rasterio environment I used to do my plot and realized it was an older version of rasterio (1.3.6 I think). I updated to 1.3.9 and the memory usage with rasterio is basically a flat line now (but does rise 1-2Mb over the duration of the test).
Can you clarify something about your screenshot of RAM usage on the servers. Is that running your full processing pipeline, or just the loading script in your comment?
Hello everybody,
First of all I realized that tracemalloc
is not the best tool to measure the overall memory usage of a script so I moved to memory-profile
.
Secondly and most importantly, I managed to reproduce the memory leak with a simple script and it's a weird one !
TL;DR The leak happens when we load both the images and the masks and we transform the images using .to_tensor()
from torchvision and the masks using from_numpy()
from pytorch.
The bad news is that's typically what we do when we want to train a model for deep learning 😓
Here's the minimal script I used :
from pathlib import Path
import rasterio
from osgeo import gdal
import torch
import torchvision.transforms as T
#rasterio.show_versions()
def test_memory_leak(img_path, msk_path):
img_paths = None
msk_paths = None
img_paths = sorted(list(Path(img_path).glob("*/*/*/*.tif")))
msk_paths = sorted(list(Path(msk_path).glob("*/*/*/*.tif")))
if img_paths:
data_size = len(img_paths)
else:
data_size = len(msk_paths)
nb_epochs = 10
batch_size = 32
nb_batches = int(data_size/batch_size)
print("nb_batches = ", nb_batches)
for epoch in range(nb_epochs):
print("epoch = ", epoch+1)
for batch in range(nb_batches):
img_list = []
msk_list = []
for idx in range(batch*batch_size, (batch+1)*batch_size):
# DATA LOADER
sample = dataset_rio(img_paths, msk_paths, idx)
#sample = dataset_gdal(img_paths, msk_paths, idx)
# TRANSFORM
sample = transform_sample_tensor(sample)
#sample = transform_sample_numpy(sample) # IT NEVER LEAKS
# DO SOMETHING WITH THE SAMPLE
if "img" in sample:
img_list.append(sample["img"])
if "msk" in sample:
msk_list.append(sample["msk"])
print("Done")
return
# DATA LOADERS
def dataset_rio(img_paths, msk_paths, idx):
if img_paths:
with rasterio.open(str(img_paths[idx]), "r") as img_ds:
img = img_ds.read()
if msk_paths:
with rasterio.open(str(msk_paths[idx]), "r") as msk_ds:
msk = msk_ds.read()
if img_paths and msk_paths:
sample = {"img": img, "msk" : msk}
elif not img_paths:
sample = {"msk" : msk}
elif not msk_paths:
sample = {"img" : img}
else:
raise ValueError("you need to pass at least 1 path")
return sample
def dataset_gdal(img_paths, msk_paths, idx):
img_gdal = gdal.Open(str(img_paths[idx]))
img = img_gdal.ReadAsArray(interleave="pixel")
msk_gdal = gdal.Open(str(msk_paths[idx]))
msk = msk_gdal.ReadAsArray(interleave="pixel")
sample = {"img": img, "msk" : msk}
return sample
# TRANSFORMATIONS
def transform_sample_tensor(sample):
if "img" in sample:
img = sample["img"]
img = T.functional.to_tensor(img)
#img = torch.from_numpy(img)
if "msk" in sample:
msk = sample["msk"]
#msk = T.functional.to_tensor(msk)
msk = torch.from_numpy(msk)
if ("img" in sample) and ("msk" in sample):
sample = {"img" : img, "msk" : msk}
elif "img" not in sample:
print("only mask")
sample = {"msk" : msk}
elif "msk" not in sample:
print("only img")
sample = {"img" : img}
else:
raise ValueError("I cannot find any sample to transform")
return sample
def transform_sample_numpy(sample):
img, msk = sample["img"], sample["msk"]
img = img[:,:,:3]
msk[msk > 13] = 13
return {"img" : img, "msk" : msk}
if __name__ == "__main__":
# >> mprof run python test_memory_leak.py
# >> mprof plot
img_path = "path_to/flair_1_toy_aerial_train"
msk_path = "path_to/flair_1_toy_labels_train"
test_memory_leak(img_path, msk_path)
(all the if-elif-else
logic is just to use the same function with both the images and masks datasets or only one of them)
Here's the memory profile when the leaks happens (notice the order of magnitude of the y-axis) :
If the sample is loaded using dataset_gdal()
, there's no leak :
Coming back to using dataset_rio()
, if we do nothing with the sample, i.e. everything below the comment # DO SOMETHING WITH THE SAMPLE
is commented, the leak disappears (but IRL this never happens) :
Weirdly enough, if we transform the image using .from_numpy()
and the mask using .to_tensor()
, there's no leak (but IRL this never happens) :
The data type is the same for both the images and the masks (uint8
) and the only difference is in the number of channels. Here's a profile of the rasterio dataset for the first image :
driver : GTiff
dtype : uint8
nodata : None
width : 512
height : 512
count : 5
crs : EPSG:2154
transform : | 0.20, 0.00, 922524.80|
| 0.00,-0.20, 6301512.00|
| 0.00, 0.00, 1.00|
Band 1 [uint8] [color gray] : Statistics(min=32.0, max=219.0, mean=82.958557128906, std=43.971926219213)
Band 2 [uint8] [color undefined] : Statistics(min=46.0, max=224.0, mean=88.913074493408, std=33.015264288545)
Band 3 [uint8] [color undefined] : Statistics(min=46.0, max=218.0, mean=82.606864929199, std=24.936723073525)
Band 4 [uint8] [color undefined] : Statistics(min=28.0, max=224.0, mean=92.435501098633, std=26.890214699652)
Band 5 [uint8] [color undefined] : Statistics(min=0.0, max=100.0, mean=26.231979370117, std=28.680426421257)
And for one of the first mask :
driver : GTiff
dtype : uint8
nodata : None
width : 512
height : 512
count : 1
crs : LOCAL_CS["RGF93 v1 / Lambert-93",UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH]]
transform : | 0.20, 0.00, 922524.80|
| 0.00,-0.20, 6301512.00|
| 0.00, 0.00, 1.00|
Band 1 [uint8] [color gray] : Statistics(min=1.0, max=10.0, mean=6.9922790527344, std=1.9747478592662)
If we transform both the images and the masks using .to_tensor()
or .from_numpy()
, no leak is observed.
If we transform only the numpy array and never use tensors, no leak is observed.
If we use only images or only masks, no leak is observed.
Unfortunately, none of this scenarios happen IRL.
As I was saying at the beginning, for training a model we always want to use .to_tensor()
for images because it automagically reshape any numpy array of shape [h,w,c]
into [c,h,w]
and it normalizes the pixel values between 0.0 and 1.0 which is what a neural network expects. On the other hand, we never want to normalize the pixel value of the masks because the integer value represents a class (in semantic segmentation) and we just want to transform it to a tensor using .from_numpy()
which is exactly the scenario for which the leak manifests itself 💔.
The issue is most likely related with the fact that the tensor returned by .from_numpy()
shares the same memory as the numpy array. However, the fact that it manifests only when we mix the methods and that if we use .from_numpy()
on the image instead of the mask and .to_tensor()
on the mask instead of the images the leak disappears, throws me off, as if it depends also on the number of raster's bands 😕
Hopefully you will find the reason and fix it because I would really like to replace every python code line using gdal (whose documentations is almost inexistent) with rasterio in all of our projects !
@fratambot One thing to note is that the shapes of the arrays returned from gdal and rasterio are different. gdal returns an img array of shape (512, 512, 5), but rasterio returns (5, 512, 512). Depending on what torchvision is expecting, that might lead to some weird processing.
@fratambot One thing to note is that the shapes of the arrays returned from gdal and rasterio are different. gdal returns an img array of shape (512, 512, 5), but rasterio returns (5, 512, 512). Depending on what torchvision is expecting, that might lead to some weird processing.
Yes, I'm aware ! And that's why in the first script I was performing the np.translate()
on reading in order to respect the (internal ?) convention the numpy array have shape [h,w,c] so that you can plot them easily. I removed it in the last script in order to not introduce any aleatory effect and it shouldn't be related to the leak but I can take a look at it tomorrow !
Actually that could explain the different behaviour with respect to masks which are 1D 🤔
@fratambot I ran your code sample with an environment that tried to match versions as best I could. I could not reproduce your issue on my machine.
I sourced all my packages from conda-forge. I used the latest compatible pytorch (2.1.2) and torchvision (0.16.1)
rasterio info:
rasterio: 1.3.9
GDAL: 3.7.3
PROJ: 9.3.0
GEOS: 3.12.0
PROJ DATA: /home/grout/mambaforge/envs/rasterioleak2/share/proj
GDAL DATA: /home/grout/mambaforge/envs/rasterioleak2/share/gdal
System:
python: 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 10:40:35) [GCC 12.3.0]
executable: /home/grout/mambaforge/envs/rasterioleak2/bin/python3.11
machine: Linux-6.7.6-arch1-1-x86_64-with-glibc2.39
Python deps:
affine: 2.4.0
attrs: 23.1.0
certifi: 2024.02.02
click: 8.1.7
cligj: 0.7.2
cython: None
numpy: 1.26.0
snuggs: 1.4.7
click-plugins: None
setuptools: 69.1.1
Could you post the output of conda list
of your environment. Perhaps there is another package that is causing a strange interaction.
@fratambot I ran your code sample with an environment that tried to match versions as best I could. I could not reproduce your issue on my machine.
I sourced all my packages from conda-forge. I used the latest compatible pytorch (2.1.2) and torchvision (0.16.1)
rasterio info: rasterio: 1.3.9 GDAL: 3.7.3 PROJ: 9.3.0 GEOS: 3.12.0 PROJ DATA: /home/grout/mambaforge/envs/rasterioleak2/share/proj GDAL DATA: /home/grout/mambaforge/envs/rasterioleak2/share/gdal System: python: 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 10:40:35) [GCC 12.3.0] executable: /home/grout/mambaforge/envs/rasterioleak2/bin/python3.11 machine: Linux-6.7.6-arch1-1-x86_64-with-glibc2.39 Python deps: affine: 2.4.0 attrs: 23.1.0 certifi: 2024.02.02 click: 8.1.7 cligj: 0.7.2 cython: None numpy: 1.26.0 snuggs: 1.4.7 click-plugins: None setuptools: 69.1.1
Could you post the output of
conda list
of your environment. Perhaps there is another package that is causing a strange interaction.
I'll send it to you tomorrow (here in EU it's evening) but that's good news !! 🤩 I know I have both rasterio and gdal which is not a good idea but I need both of them for testing !
@fratambot I have no idea what your production environment is like, but here is my environment spec if you want to recreate it and test. rasterioleak2.json (the json extension is only for upload. It is not a json file and you'll need to change the extension to .yml).
You can recreate the environment with
conda env create --file rasterioleak2.yml -n <name_of_new_environment>
Hello Ryan,
I recreated your environment but I still observe the leak 😕
Your memory profile looks like the "inverted scenario" in which the leak is not observed, though. Just to be sure, this script will always produce the leak for me :
from pathlib import Path
import rasterio
import torch
import torchvision.transforms as T
rasterio.show_versions()
def test_memory_leak(img_path, msk_path):
img_paths = sorted(list(Path(img_path).glob("*/*/*/*.tif")))
msk_paths = sorted(list(Path(msk_path).glob("*/*/*/*.tif")))
nb_epochs = 10
batch_size = 32
nb_batches = int(len(img_paths)/batch_size)
print("nb_batches = ", nb_batches)
for epoch in range(nb_epochs):
print("epoch = ", epoch+1)
for batch in range(nb_batches):
img_list = []
msk_list = []
for idx in range(batch*batch_size, (batch+1)*batch_size):
# DATA LOADER
sample = dataset_rio(img_paths, msk_paths, idx)
# TRANSFORM
sample = transform_sample_tensor(sample)
# DO SOMETHING WITH THE SAMPLE
img_list.append(sample["img"])
msk_list.append(sample["msk"])
print("Done")
return
# DATA LOADERS
def dataset_rio(img_paths, msk_paths, idx):
with rasterio.open(str(img_paths[idx]), "r") as img_ds:
img = img_ds.read()
with rasterio.open(str(msk_paths[idx]), "r") as msk_ds:
msk = msk_ds.read()
sample = {"img": img, "msk" : msk}
return sample
# TRANSFORMATIONS
def transform_sample_tensor(sample):
img = sample["img"]
img = T.functional.to_tensor(img)
msk = sample["msk"]
msk = torch.from_numpy(msk)
sample = {"img" : img, "msk" : msk}
return sample
if __name__ == "__main__":
# >> mprof run python test_memory_reproduce_leak.py
# >> mprof plot
img_path = "/home/ftamborra/Data/img/flair_samples/flair_1_toy_dataset/flair_1_toy_aerial_train"
msk_path = "/home/ftamborra/Data/img/flair_samples/flair_1_toy_dataset/flair_1_toy_labels_train"
test_memory_leak(img_path, msk_path)
My rasterio version is identical to yours :
rasterio info:
rasterio: 1.3.9
GDAL: 3.7.3
PROJ: 9.3.0
GEOS: 3.12.0
PROJ DATA: /home/ftamborra/miniconda3/envs/rasterioleak/share/proj
GDAL DATA: /home/ftamborra/miniconda3/envs/rasterioleak/share/gdal
System:
python: 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 10:40:35) [GCC 12.3.0]
executable: /home/ftamborra/miniconda3/envs/rasterioleak/bin/python
machine: Linux-6.5.0-21-generic-x86_64-with-glibc2.35
Python deps:
affine: 2.4.0
attrs: 23.1.0
certifi: 2024.02.02
click: 8.1.7
cligj: 0.7.2
cython: None
numpy: 1.26.0
snuggs: 1.4.7
click-plugins: None
setuptools: 69.1.1
And also the environment : my_rasterioleak.json
The only difference I notice is that I have a slightly older Linux kernel (I'm on Ubuntu 22.04) and I use miniconda while you seem to use miniforge (?), but still...the packages are exactly the same...
In case this helps, here's what goes on in read()
.
If you don't pass an array in (the out
parameter), a new empty or zeroed out array is created in Python using numpy.array()
. In Cython/C, we call PyArray_DATA()
to get a pointer to the underlying buffer of that array.
https://github.com/rasterio/rasterio/blob/5db194fe99c566479858f71baa0b7d22e560dd22/rasterio/_io.pyx#L86
The pointer is passed to GDALRasterIO()
, which fills it with pixel values, and then the array is returned to the caller of read()
. At that point, whether the array's allocated memory is released or not is up to the calling application.
Now, there is another pool of memory in play with GDAL: its raster block cache, which defaults to 5% of your computer's RAM. Block keys are partially formed from dataset names, so as you read from more files, the cache will grow in size over time, approaching the maximum configured value.
Here's a rough diagram of the I/O:
array << read(out=array) << array <- GDAL block reader << block cache << GDAL driver implementation << dataset
I don't use Tensorflow, so I don't know if it has its own caching system. I wouldn't be surprised if it did.
@fratambot I recreated your environment and used your latest version of the script and I still cannot replicate your memory usage on my machine. I was hoping it could be a related to using an older/newer version of pytorch/torchvision.
If gdal does not result in a memory usage issue, verify that the rasterio is providing the exact same array (shape, dtype, and flags) that gdal is providing to pytorch. If rasterio and gdal loading results in the exact same array, it's weird to me that rasterio would have provoke such strange memory problems while gdal does not.
sample = dataset_rio(img_paths, msk_paths, idx)
samplegdal = dataset_gdal(img_paths, msk_paths, idx)
assert sample.dtype == samplegdal.dtype
assert sample.shape == samplegdal.shape
assert sample.flags == samplegdal.flags
assert np.allclose(sample, samplegdal)
Hello @groutr ,
It is indeed really weird that we don't have the same behavior since the environment is the same (even the packages build are specified in the yaml !). Does it depend on the different Linux kernel then ? 🤷
Rasterio and gdal cannot provide the exact same array as far as I understand. Rasterio shape is different from gdal by design and if I transpose it, then the flags, identical before the transposition, differ, since the OWNDATA
field switches from False
to True
.
If I ignore this difference and perform the following transformation in order to have numpy arrays of shape identical to gdal's :
def dataset_rio(img_paths, msk_paths, idx):
with rasterio.open(str(img_paths[idx]), "r") as img_ds:
img = np.transpose(img_ds.read(), (1,2,0))
with rasterio.open(str(msk_paths[idx]), "r") as msk_ds:
msk = np.squeeze(msk_ds.read())
sample = {"img": img, "msk" : msk}
return sample
I still observe the memory leak:
However, following @sgillies explanation, if I pass an empty array to the rasterio .read()
method :
def dataset_rio(img_paths, msk_paths, idx):
img_raw = np.empty(shape=(5,512,512))
msk_raw = np.empty(shape=(1,512,512))
with rasterio.open(str(img_paths[idx]), "r") as img_ds:
img = np.transpose(img_ds.read(out=img_raw), (1,2,0))
with rasterio.open(str(msk_paths[idx]), "r") as msk_ds:
msk = np.squeeze(msk_ds.read(out=msk_raw))
sample = {"img": img, "msk" : msk}
return sample
There's no leak :
As far as I'm concerned, this solves the problem and the issue can be closed ! 🥳
But maybe it would be nice if rasterio could perform internally and implicitly a similar operation in terms of memory allocation without forcing the user to know in advance the shape of the numpy arrays (which might not be known).
Or, if it's torchvision and its cache to blame, at least specify in the doc that's good practice to pass an empty numpy array of the appropriate shape to the out
parameter of the .read()
method in order to not incur in possible memory leaks when using torch ecosystem transformations.
In any case, infinite thanks for your precious help and time !! 🙏
if I pass an empty array to the rasterio
.read()
method :
this smells a bit like a reference counting issue. Maybe related to https://cython.readthedocs.io/en/latest/src/userguide/language_basics.html#python-objects-as-parameters-and-return-values and calls like https://github.com/rasterio/rasterio/blob/5db194fe99c566479858f71baa0b7d22e560dd22/rasterio/_io.pyx#L653, where there would be something wrong going with the refcount when the array is created at https://github.com/rasterio/rasterio/blob/5db194fe99c566479858f71baa0b7d22e560dd22/rasterio/_io.pyx#L626 ?
@fratambot @groutr @rouault the program below sort of reproduces the situation. Note that it accumulates arrays in a list exactly as the code above does.
import numpy
import rasterio
array_list = []
for i in range(1000):
with rasterio.open("/Users/sean.gillies/projects/rasterio/tests/data/RGB2.byte.tif") as dataset:
data = numpy.transpose(dataset.read(), (1, 2, 0))
array_list.append(data)
But this isn't a leak. It's a consequence of storing image arrays in a list and should be expected. They aren't deallocated until the list itself goes out of scope and gets garbage collected. If I don't append them to the list, I see this
@sgillies the script I 'm using is not appending indefinitely but only in batches of 32 images and 32 masks and then the lists are emptied and the garbage collector should kick in and free the memory, right ?
In the meanwhile I recreated an environment without gdal
in my real project and I actually noticed that there's no memory leak even without passing the empty numpy arrays !!
It's definitely an issue related to a conflict between packages (and Linux kernel / Cpython version ? Otherwise I cannot explain why I observe the leak and @groutr doesn't, even while using the exact same environment).
To test it, I conda remove gdal
from the environment I was using obtaining this one, nogdal_environment.json, but I still observed the leak.
I then recreated from scratch a new environment running the following commands (therefore releasing the requirements on the package versions) :
$ conda install pytorch torchvision cpuonly -c pytorch
$ pip install rasterio
$ pip install -U memory_profiler
$ conda install matplotlib
And with this new environment, new_environment.json, no leak is observed :
I cannot say exactly how and when the leak arise but it's for sure related to a package (and/or Linux kernel / Cpython) conflict.
On this topic, it's unfortunate that rasterio can be installed only using pip
. I prefer using conda
or mamba
exactly for this reason : without a solver, I often find myself pip-installing a package and pray there are no conflicts 😅
Will rasterio be on conda
someday ? That should solve (pun intended) every weird behavior looking like a bug but actually entirely due to some packages conflict...
@fratambot a for
loop doesn't create a new scope. Inside a batch you do reassign the name to a new list, but the previous value will stick around, anonymously, until the function returns.
That said, there certainly could be something else going on. I could be using Cython or gcc/clang wrong. It wouldn't be the first time 😅
Rasterio is in the conda-forge channel https://anaconda.org/conda-forge/rasterio. I recommend getting everything you can from it because it's more current than the default conda channel. mamba install -c conda-forge gdal rasterio
should get you a consistent set, all sharing the same library versions.
Rasterio is in the conda-forge channel https://anaconda.org/conda-forge/rasterio. I recommend getting everything you can from it because it's more current than the default conda channel.
mamba install -c conda-forge gdal rasterio
should get you a consistent set, all sharing the same library versions.
Wow ! I should have searched better... my bad 😑
@fratambot it's interesting that you installed rasterio via pip. Does that also include building and installing the dependencies of rasterio via pip? When you installed rasterio, was that using a binary wheel or did you build from source?
@sgillies I think your statement about when a list is garbage collected is inaccurate. The list should be freed immediately when the reference count reaches 0. In @fratambot's code above, the list should be garbage collected every new batch except the last.
@groutr I used pip install rasterio
so I guess I was using the binary wheel and the dependencies, such as affine
were also installed via pip. At least that seems to be the case by looking at the environment files I attached in the previous messages
@groutr you're right. I've been sick this week and must have been delirious there 😆
Hello again !
I was going back to my original project on our remote machines hoping to have solved any problem but the memory leak with rasterio persists 😭
I updated conda in order to use libmamba solver. At first I thought that the leak was related to another conflict due to the fact that I had to install pytorch-cuda
in order to use the GPUs but actually I can reproduce the leak even using the just the cpu.
Here's the script (a new version allowing to easily switch between rasterio and gdal):
import numpy as np
import rasterio
import torch
import torchvision.transforms as T
from osgeo import gdal
from pathlib import Path
rasterio.show_versions()
def test_memory_leak(img_path, msk_path):
img_paths = sorted(list(Path(img_path).glob("*/*/*/*.tif")))
msk_paths = sorted(list(Path(msk_path).glob("*/*/*/*.tif")))
nb_epochs = 10
batch_size = 32
nb_batches = int(len(img_paths)/batch_size)
print("nb_batches = ", nb_batches)
for epoch in range(nb_epochs):
print("epoch = ", epoch+1)
for batch in range(nb_batches):
img_list = []
msk_list = []
for idx in range(batch*batch_size, (batch+1)*batch_size):
# DATA LOADER
# use="gdal" or use="rio"
sample = dataset(img_paths, msk_paths, idx, use="rio")
# TRANSFORM
sample = transform_sample_tensor(sample)
# DO SOMETHING WITH THE SAMPLE
img_list.append(sample["img"])
msk_list.append(sample["msk"])
print("Done")
return
# DATALOADER
def dataset(img_paths, msk_paths, idx, use):
if use == "rio":
with rasterio.open(str(img_paths[idx]), "r") as img_ds:
img = np.transpose(img_ds.read(), (1,2,0))
with rasterio.open(str(msk_paths[idx]), "r") as msk_ds:
msk = np.squeeze(msk_ds.read())
if use == "gdal":
img_gdal = gdal.Open(str(img_paths[idx]))
img = img_gdal.ReadAsArray(interleave="pixel")
msk_gdal = gdal.Open(str(msk_paths[idx]))
msk = msk_gdal.ReadAsArray(interleave="pixel")
sample = {"img": img, "msk" : msk}
return sample
# TRANSFORMATIONS
def transform_sample_tensor(sample):
img = sample["img"]
img = T.functional.to_tensor(img)
msk = sample["msk"]
msk = torch.from_numpy(msk)
sample = {"img" : img, "msk" : msk}
return sample
if __name__ == "__main__":
# >> mprof run python test_memory_leak.py
# >> mprof plot
img_path = "path_to/flair_1_toy_dataset/flair_1_toy_aerial_train"
msk_path = "path_to/flair_1_toy_dataset/flair_1_toy_labels_train"
test_memory_leak(img_path, msk_path)
Here's the environment without the builds : cpu_minimal.json and the one with the builds : cpu_minimal_builds.json
Use : conda env create --file=cpu_minimal.yml
to recreate it.
When I pass use="rio
to the function dataset()
I observe the leak :
When I pass use="gdal"
, no leak is observed :
Here's the output of my rasterio.show_versions()
:
rasterio info:
rasterio: 1.3.9
GDAL: 3.8.4
PROJ: 9.3.1
GEOS: 3.12.1
PROJ DATA: /opt/miniconda3/envs/cpu-minimal/share/proj
GDAL DATA: /opt/miniconda3/envs/cpu-minimal/share/gdal
System:
python: 3.9.18 (main, Sep 11 2023, 13:41:44) [GCC 11.2.0]
executable: /opt/miniconda3/envs/cpu-minimal/bin/python
machine: Linux-5.15.0-69-generic-x86_64-with-glibc2.31
Python deps:
affine: 2.3.0
attrs: 23.1.0
certifi: 2024.02.02
click: 8.1.7
cligj: 0.7.2
cython: None
numpy: 1.26.4
snuggs: 1.4.7
click-plugins: None
setuptools: 68.2.2
I already tried different version of python and different combination of pytorch and torchvision but it doesn't seem to matter.
Please tell me that you can reproduce the leak, at least ! This issue is driving me a little bit crazy...
Perhaps try adding sys.getrefcount(img) and sys.getrefcount(mask) to compare if there are differences between the rio and gdal paths? In a non-leaking scenario, I would anticipate the counter to be just 2 immediately after the img = np.transpose(img_ds.read(), (1,2,0))
or img = img_gdal.ReadAsArray(interleave="pixel")
calls (the reference of the img variable, and the one acquired temporarily by sys.getrefcount())
>>> from osgeo import gdal
>>> import sys
>>> ds= gdal.Open('byte.tif')
>>> img = ds.ReadAsArray()
>>> print(sys.getrefcount(img))
2
Perhaps try adding sys.getrefcount(img) and sys.getrefcount(mask) to compare if there are differences between the rio and gdal paths? In a non-leaking scenario, I would anticipate the counter to be just 2 immediately after the
img = np.transpose(img_ds.read(), (1,2,0))
orimg = img_gdal.ReadAsArray(interleave="pixel")
calls (the reference of the img variable, and the one acquired temporarily by sys.getrefcount())>>> from osgeo import gdal >>> import sys >>> ds= gdal.Open('byte.tif') >>> img = ds.ReadAsArray() >>> print(sys.getrefcount(img)) 2
I obtain a refcount 2
for the img (or the msk) for both rasterio and gdal calls 😕
I'm still unable to reproduce this behavior on my machine. @fratambot Have you worked with objgraph? https://mg.pov.lt/objgraph/ It could give some insight.