OSError: [Errno 28] No space left on device
General
- Operating System: Docker(python:3.12-slim)
- Python version: 3.12.5
- Pandas version: 2.2.2
- Pandarallel version: 1.6.5
Acknowledgement
- [x] My issue is NOT present when using
pandaswithout alone (withoutpandarallel)
Bug description
Observed behavior
When I execute the program, I get "OSError: [Errno 28] No space left on device"
This is my code.
I referred to https://github.com/nalepae/pandarallel/issues/127 and added MEMORY_FS_ROOT and JOBLIB_TEMP_FOLDER, but it doesn't work.
import pandas as pd
from pandarallel import pandarallel
import os
os.environ['MEMORY_FS_ROOT'] = "/app/tmp"
os.environ['JOBLIB_TEMP_FOLDER'] = '/app/tmp'
data = {'url': ['https://example.com/1', 'https://example.com/2'],
'label': [0, 1]}
table = pd.DataFrame(data)
pandarallel.initialize(progress_bar=False, use_memory_fs = False)
table['count.'] = table['url'].parallel_apply(lambda x: x.count('.')) # parallel_apply apply
table
df -h for my docker:
Filesystem Size Used Avail Use% Mounted on
overlay 1.8T 260G 1.5T 15% /
tmpfs 64M 0 64M 0% /dev
shm 64M 64M 0 100% /dev/shm
/dev/nvme1n1 1.8T 260G 1.5T 15% /app
tmpfs 63G 0 63G 0% /proc/asound
tmpfs 63G 0 63G 0% /proc/acpi
tmpfs 63G 0 63G 0% /proc/scsi
tmpfs 63G 0 63G 0% /sys/firmware
tmpfs 63G 0 63G 0% /sys/devices/virtual/powercap
I also try os.environ['JOBLIB_TEMP_FOLDER'] = '/tmp'
Anyone can help me?
Did you try setting the MEMORY_FS_ROOT env variable before importing pandarallel?
You can check the current location with pandarallel.core.MEMORY_FS_ROOT
Hi @highvight ,
I try setting the MEMORY_FS_ROOT env variable before importing pandarallel and check the current location.
Following is my code.
import pandas as pd
import os
os.environ['MEMORY_FS_ROOT'] = "/app/tmp"
os.environ['JOBLIB_TEMP_FOLDER'] = '/app/tmp'
import pandarallel
print(pandarallel.core.MEMORY_FS_ROOT) # /app/tmp
pandarallel.pandarallel.initialize(progress_bar=False) # and I have try (progress_bar=False, use_memory_fs = False)
data = {'url': ['https://example.com/1', 'https://example.com/2'], 'label': [0, 1]}
table = pd.DataFrame(data)
table['count.'] = table['url'].parallel_apply(lambda x: x.count('.'))
MEMORY_FS_ROOT is /app/tmp.
ls -al for /app/tmp
drwxrwxrwx 2 root root 4096 Oct 23 17:32 tmp
ls -al for /app/tmp content
-rw------- 1 root root 637 Oct 24 09:42 pandarallel_input_g3g0gh6k.pickle
-rw------- 1 root root 637 Oct 24 09:42 pandarallel_input_k5bgg22r.pickle
I am still getting No space left on device, I don't know why.
Error message
I just confirmed that temporarily clearing /dev/shm allows small amounts of data to pass through the program, so it seems my modification is ineffective?
I tried modifying core.py, but it still doesn't work.
core.py
33| # MEMORY_FS_ROOT = os.environ.get("MEMORY_FS_ROOT", "/dev/shm")
34| MEMORY_FS_ROOT = /app/tmp
@DTDwind Try setting use_memory_fs = False.
Note: MEMORY_FS_ROOT is only applied when use_memory_fs is set to True.
I'm running into the same issues with no remedy from changing the settings or os.environ. I'm also inside docker, my dataset is ~3GB and my RAM 512 GB.
Looking at possible solutions:
- docker has a laughable 64 MB as the default
/dev/shm(df -h /dev/shm) but it can be changed using the--shm-size=1gbwhen starting the container - looking at DTDwind's re implementation, maybe SpooledFile or MemoryFS could help in ease the pains?!
class tempfile.SpooledTemporaryFile(max_size=0, mode='w+b', buffering=-1, encoding=None, newline=None, suffix=None, prefix=None, dir=None, *, errors=None)
This class operates exactly as TemporaryFile() does, except that data is spooled in memory until the file size exceeds max_size, or until the file’s fileno() method is called, at which point the contents are written to disk and operation proceeds as with TemporaryFile().
Update:
I can confirm that running pandarallel outside of docker on the same machine does not error. It wants to consume huge amounts of RAM though. I'm using 36 workers (auto-selected, also I want max. speed), my dataset is 3GB and RAM consumption rises to >260 GB for this (this is more than 2x the overall dataset if each worker would hold it 100%)
I used parallel_pandas, and it works well in the Docker environment.
Here is a simple example:
pip install --upgrade parallel-pandas
import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas
#initialize parallel-pandas
ParallelPandas.initialize(n_cpu=16, split_factor=4, disable_pr_bar=True)
data = {'url': ['https://example.com/1', 'https://example.com/2'],
'label': [0, 1]}
table = pd.DataFrame(data)
table['count.'] = table['url'].p_apply(lambda x: x.count('.'))