pandarallel OSError: [Errno 28] No space left on device

General

Operating System: Docker(python:3.12-slim)
Python version: 3.12.5
Pandas version: 2.2.2
Pandarallel version: 1.6.5

Acknowledgement

[x] My issue is NOT present when using pandas without alone (without pandarallel)

Bug description

Observed behavior

When I execute the program, I get "OSError: [Errno 28] No space left on device"

This is my code.

I referred to https://github.com/nalepae/pandarallel/issues/127 and added MEMORY_FS_ROOT and JOBLIB_TEMP_FOLDER, but it doesn't work.

import pandas as pd
from pandarallel import pandarallel
import os

os.environ['MEMORY_FS_ROOT'] = "/app/tmp"
os.environ['JOBLIB_TEMP_FOLDER'] = '/app/tmp'

data = {'url': ['https://example.com/1', 'https://example.com/2'],
        'label': [0, 1]}
table = pd.DataFrame(data)

pandarallel.initialize(progress_bar=False, use_memory_fs = False)

table['count.'] = table['url'].parallel_apply(lambda x: x.count('.')) # parallel_apply apply
table

df -h for my docker:

Filesystem      Size  Used Avail Use% Mounted on
overlay         1.8T  260G  1.5T  15% /
tmpfs            64M     0   64M   0% /dev
shm              64M   64M     0 100% /dev/shm
/dev/nvme1n1    1.8T  260G  1.5T  15% /app
tmpfs            63G     0   63G   0% /proc/asound
tmpfs            63G     0   63G   0% /proc/acpi
tmpfs            63G     0   63G   0% /proc/scsi
tmpfs            63G     0   63G   0% /sys/firmware
tmpfs            63G     0   63G   0% /sys/devices/virtual/powercap

I also try os.environ['JOBLIB_TEMP_FOLDER'] = '/tmp'

Anyone can help me?

Oct 23 '24 09:10 DTDwind

Did you try setting the MEMORY_FS_ROOT env variable before importing pandarallel?

You can check the current location with pandarallel.core.MEMORY_FS_ROOT

Oct 23 '24 13:10 highvight

Hi @highvight ,

I try setting the MEMORY_FS_ROOT env variable before importing pandarallel and check the current location.

Following is my code.

import pandas as pd
import os

os.environ['MEMORY_FS_ROOT'] = "/app/tmp"
os.environ['JOBLIB_TEMP_FOLDER'] = '/app/tmp'

import pandarallel
print(pandarallel.core.MEMORY_FS_ROOT) # /app/tmp
pandarallel.pandarallel.initialize(progress_bar=False) # and I have try (progress_bar=False, use_memory_fs = False)

data = {'url': ['https://example.com/1', 'https://example.com/2'], 'label': [0, 1]}
table = pd.DataFrame(data)
table['count.'] = table['url'].parallel_apply(lambda x: x.count('.'))

MEMORY_FS_ROOT is /app/tmp.

ls -al for /app/tmp drwxrwxrwx 2 root root 4096 Oct 23 17:32 tmp

ls -al for /app/tmp content -rw------- 1 root root 637 Oct 24 09:42 pandarallel_input_g3g0gh6k.pickle -rw------- 1 root root 637 Oct 24 09:42 pandarallel_input_k5bgg22r.pickle

I am still getting No space left on device, I don't know why.

Error message

Oct 24 '24 01:10 DTDwind

I just confirmed that temporarily clearing /dev/shm allows small amounts of data to pass through the program, so it seems my modification is ineffective?

I tried modifying core.py, but it still doesn't work.

core.py
33| # MEMORY_FS_ROOT = os.environ.get("MEMORY_FS_ROOT", "/dev/shm")
34| MEMORY_FS_ROOT = /app/tmp

Oct 24 '24 03:10 DTDwind

@DTDwind Try setting use_memory_fs = False.

Note: MEMORY_FS_ROOT is only applied when use_memory_fs is set to True.

Nov 08 '24 15:11 usama3162

I'm running into the same issues with no remedy from changing the settings or os.environ. I'm also inside docker, my dataset is ~3GB and my RAM 512 GB.

Looking at possible solutions:

docker has a laughable 64 MB as the default /dev/shm (df -h /dev/shm) but it can be changed using the --shm-size=1gb when starting the container
looking at DTDwind's re implementation, maybe SpooledFile or MemoryFS could help in ease the pains?!

class tempfile.SpooledTemporaryFile(max_size=0, mode='w+b', buffering=-1, encoding=None, newline=None, suffix=None, prefix=None, dir=None, *, errors=None)

This class operates exactly as TemporaryFile() does, except that data is spooled in memory until the file size exceeds max_size, or until the file’s fileno() method is called, at which point the contents are written to disk and operation proceeds as with TemporaryFile().

Update:

I can confirm that running pandarallel outside of docker on the same machine does not error. It wants to consume huge amounts of RAM though. I'm using 36 workers (auto-selected, also I want max. speed), my dataset is 3GB and RAM consumption rises to >260 GB for this (this is more than 2x the overall dataset if each worker would hold it 100%)

Nov 09 '24 12:11 chris-aeviator

I used parallel_pandas, and it works well in the Docker environment. Here is a simple example: pip install --upgrade parallel-pandas

import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas

#initialize parallel-pandas
ParallelPandas.initialize(n_cpu=16, split_factor=4, disable_pr_bar=True)

data = {'url': ['https://example.com/1', 'https://example.com/2'],
        'label': [0, 1]}
table = pd.DataFrame(data)
table['count.'] = table['url'].p_apply(lambda x: x.count('.'))

Nov 26 '24 09:11 DTDwind