returnn icon indicating copy to clipboard operation
returnn copied to clipboard

FileCache no space left on device race condition

Open albertz opened this issue 1 month ago • 3 comments

Again a crash. It ultimately failed with this (after retrying a few times):

OSError: [Errno 28] No space left on device

Log:

FileCache: Copy file /rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface/TransformAndMapHuggingFaceDatasetJob.F
xPUVJtw1EeN/output/dataset/train/data-00405-of-00848.arrow to cache
FileCache: Delete file /var/tmp//az668407/returnn/file_cache/rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface
/TransformAndMapHuggingFaceDatasetJob.FxPUVJtw1EeN/output/dataset/train/data-00637-of-00848.arrow, size 288.4MB. Still need more space, file is 18.0 ho
urs old. After deletion, have 555.7MB free space.
FileCache: Delete file /var/tmp//az668407/returnn/file_cache/rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface
/TransformAndMapHuggingFaceDatasetJob.FxPUVJtw1EeN/output/dataset/train/data-00834-of-00848.arrow, size 287.5MB. Still need more space, file is 18.0 ho
urs old. After deletion, have 0.8GB free space.
FileCache: Copy file /rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface/TransformAndMapHuggingFaceDatasetJob.F
xPUVJtw1EeN/output/dataset/train/data-00038-of-00848.arrow to cacheFileCache: using existing file /var/tmp//az668407/returnn/file_cache/rwthfs/rz/clust
er/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface/TransformAndMapHuggingFaceDatasetJob.FxPUVJtw1EeN/output/dataset/train/data
-00707-of-00848.arrow
FileCache: using existing file /var/tmp//az668407/returnn/file_cache/rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/hug
gingface/TransformAndMapHuggingFaceDatasetJob.FxPUVJtw1EeN/output/dataset/train/data-00692-of-00848.arrow
FileCache: Copy file /rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface/TransformAndMapHuggingFaceDatasetJob.F
xPUVJtw1EeN/output/dataset/train/data-00761-of-00848.arrow to cache
FileCache: Delete file /var/tmp//az668407/returnn/file_cache/rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface/TransformAndMapHuggingFaceDatasetJob.FxPUVJtw1EeN/output/dataset/train/data-00549-of-00848.arrow, size 288.9MB. Still need more space, file is 18.0 ho
urs old. After deletion, have 556.3MB free space.
FileCache: using existing file /var/tmp//az668407/returnn/file_cache/rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface/TransformAndMapHuggingFaceDatasetJob.FxPUVJtw1EeN/output/dataset/train/data-00759-of-00848.arrow
...
FileCache: Copy file /rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface/TransformAndMapHuggingFaceDatasetJob.F
xPUVJtw1EeN/output/dataset/train/data-00512-of-00848.arrow to cache
FileCache: using existing file /var/tmp//az668407/returnn/file_cache/rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/hug
gingface/TransformAndMapHuggingFaceDatasetJob.FxPUVJtw1EeN/output/dataset/train/data-00615-of-00848.arrow
FileCache: Copy file /rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface/TransformAndMapHuggingFaceDatasetJob.F
xPUVJtw1EeN/output/dataset/train/data-00206-of-00848.arrow to cache
FileCache: Delete file /var/tmp//az668407/returnn/file_cache/rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface
/TransformAndMapHuggingFaceDatasetJob.FxPUVJtw1EeN/output/dataset/train/data-00330-of-00848.arrow, size 286.8MB. Still need more space, file is 18.0 ho
urs old. After deletion, have 0.8GB free space.
FileCache: Copy file /rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface/TransformAndMapHuggingFaceDatasetJob.F
xPUVJtw1EeN/output/dataset/train/data-00290-of-00848.arrow to cache
...
FileCache: Copy file /rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface/TransformAndMapHuggingFaceDatasetJob.F
xPUVJtw1EeN/output/dataset/train/data-00722-of-00848.arrow to cache
FileCache: Copy file /rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface/TransformAndMapHuggingFaceDatasetJob.F
xPUVJtw1EeN/output/dataset/train/data-00453-of-00848.arrow to cache
Process DistributeFilesDataset train ep 387 worker ep 388:
EXCEPTION
Traceback (most recent call last):
  File "/usr/lib64/python3.12/multiprocessing/process.py", line 314, in BaseProcess._bootstrap
...
  File "/home/az668407/work/py-envs/py3.12-torch2.7/lib64/python3.12/site-packages/tree/__init__.py", line 428, in map_structure
FileCache: Delete file /var/tmp//az668407/returnn/file_cache/rwthfs/rz/cluster/home/az668407/setups/2025-10-21-chunked-ctc/work/i6_core/datasets/huggin
gface/TransformAndMapHuggingFaceDatasetJob.FxPUVJtw1EeN/output/dataset/train/data-00155-of-00848.arrow, size 720.5MB. Still need more space, file is 17
.4 hours old. After deletion, have 0.8GB free space.
FileCache: Delete file /var/tmp//az668407/returnn/file_cache/rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface
/TransformAndMapHuggingFaceDatasetJob.FxPUVJtw1EeN/output/dataset/train/data-00748-of-00848.arrow, size 286.6MB. Still need more space, file is 18.0 hours old. After deletion, have 0.8GB free space.
FileCache: Ignoring error while copying /var/tmp//az668407/returnn/file_cache/rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/dat
asets/huggingface/TransformAndMapHuggingFaceDatasetJob.FxPUVJtw1EeN/output/dataset/train/data-00038-of-00848.arrow: OSError: [Errno 28] No space left on device
FileCache: Copy file /rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface/TransformAndMapHuggingFaceDatasetJob.F
xPUVJtw1EeN/output/dataset/train/data-00038-of-00848.arrow to cache
FileCache: Delete file /var/tmp//az668407/returnn/file_cache/rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface
/TransformAndMapHuggingFaceDatasetJob.FxPUVJtw1EeN/output/dataset/train/data-00323-of-00848.arrow, size 286.0MB. Still need more space, file is 18.0 ho
urs old. After deletion, have 695.9MB free space.
FileCache: Delete file /var/tmp//az668407/returnn/file_cache/rwthfs/rz/cluster/home/az668407/setups/2025-10-21-chunked-ctc/work/i6_core/datasets/huggin
gface/TransformAndMapHuggingFaceDatasetJob.FxPUVJtw1EeN/output/dataset/train/data-00220-of-00848.arrow, size 720.8MB. Still need more space, file is 17.4 hours old. After deletion, have 1.4GB free space.
...
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/util/file_cache.py", line 343, in FileCache.handle_cached_files_in_config.<locals>._handle_value
    line: res = self.get_file(value.filename)
    locals:
      self = <local> <returnn.util.file_cache.FileCache object at 0x1548d1c914f0>
      value = <local> CachedFile(filename='/rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface/TransformAndMapH
uggingFaceDatasetJob.FxPUVJtw1EeN/output/dataset/train/data-00038-of-00848.arrow')
      value.filename = <local> '/rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface/TransformAndMapHuggingFaceDatasetJob.FxPUVJtw1EeN/output/dataset/train/data-00038-of-00848.arrow', len = 188
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/util/file_cache.py", line 129, in FileCache.get_file
    line: raise last_error
    locals:
      last_error = <local> OSError(28, 'No space left on device')
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/util/file_cache.py", line 121, in FileCache.get_file
    line: self._copy_file_if_needed(src_filename, dst_filename)
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/util/file_cache.py", line 430, in FileCache._copy_file_if_needed
    line: _copy_with_prealloc(src_filename, dst_tmp_filename)
    locals:
      src_filename = <local> '/rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface/TransformAndMapHuggingFaceDatasetJob.FxPUVJtw1EeN/output/dataset/train/data-00038-of-00848.arrow', len = 188
      dst_tmp_filename = <local> '/var/tmp//az668407/returnn/file_cache/rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/
huggingface/TransformAndMapHuggingFaceDatasetJob.FxPUVJtw1EeN/output/dataset/train/data-00038-of-00848.arrow.copy', len = 230
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/util/file_cache.py", line 517, in _copy_with_prealloc
    line: os.posix_fallocate(dst_file.fileno(), 0, file_size + 1)
    locals:
      os.posix_fallocate = <global> <built-in function posix_fallocate>
      dst_file = <local> <_io.BufferedWriter name='/var/tmp//az668407/returnn/file_cache/rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/
i6_core/datasets/huggingface/TransformAndMapHuggingFaceDatasetJob.FxPUVJtw1EeN/output/dataset/train/data-00038-of-00848.arrow.copy'>
      file_size = <local> 752307496
OSError: [Errno 28] No space left on device
...
FileCache: Copy file /rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface/TransformAndMapHuggingFaceDatasetJob.F
xPUVJtw1EeN/output/dataset/train/data-00175-of-00848.arrow to cache
FileCache: Delete file /var/tmp//az668407/returnn/file_cache/rwthfs/rz/cluster/home/az668407/setups/2025-10-21-chunked-ctc/work/i6_core/datasets/huggin
gface/TransformAndMapHuggingFaceDatasetJob.FxPUVJtw1EeN/output/dataset/train/data-00191-of-00848.arrow, size 720.4MB. Still need more space, file is 17
.4 hours old. After deletion, have 0.9GB free space.
FileCache: using existing file /var/tmp//az668407/returnn/file_cache/rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/hug
gingface/TransformAndMapHuggingFaceDatasetJob.FxPUVJtw1EeN/output/dataset/train/data-00752-of-00848.arrow
FileCache: Copy file /rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface/TransformAndMapHuggingFaceDatasetJob.F
xPUVJtw1EeN/output/dataset/train/data-00688-of-00848.arrow to cache
FileCache: Delete file /var/tmp//az668407/returnn/file_cache/rwthfs/rz/cluster/home/az668407/setups/2025-10-21-chunked-ctc/work/i6_core/datasets/huggin
gface/TransformAndMapHuggingFaceDatasetJob.FxPUVJtw1EeN/output/dataset/train/data-00112-of-00848.arrow, size 720.0MB. Still need more space, file is 17
.4 hours old. After deletion, have 0.9GB free space.
FileCache: Copy file /rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface/TransformAndMapHuggingFaceDatasetJob.F
xPUVJtw1EeN/output/dataset/train/data-00640-of-00848.arrow to cache
FileCache: using existing file /var/tmp//az668407/returnn/file_cache/rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/hug
gingface/TransformAndMapHuggingFaceDatasetJob.FxPUVJtw1EeN/output/dataset/train/data-00729-of-00848.arrow
FileCache: Copy file /rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface/TransformAndMapHuggingFaceDatasetJob.F
xPUVJtw1EeN/output/dataset/train/data-00247-of-00848.arrow to cache
FileCache: Delete file /var/tmp//az668407/returnn/file_cache/rwthfs/rz/cluster/home/az668407/setups/2025-10-21-chunked-ctc/work/i6_core/datasets/huggin
gface/TransformAndMapHuggingFaceDatasetJob.FxPUVJtw1EeN/output/dataset/train/data-00065-of-00848.arrow, size 719.2MB. Still need more space, file is 17
.4 hours old. After deletion, have 1.0GB free space.
FileCache: Copy file /rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface/TransformAndMapHuggingFaceDatasetJob.F
xPUVJtw1EeN/output/dataset/train/data-00719-of-00848.arrow to cache
FileCache: using existing file /var/tmp//az668407/returnn/file_cache/rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/hug
gingface/TransformAndMapHuggingFaceDatasetJob.FxPUVJtw1EeN/output/dataset/train/data-00827-of-00848.arrow
FileCache: Copy file /rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/datasets/huggingface/TransformAndMapHuggingFaceDatasetJob.F
xPUVJtw1EeN/output/dataset/train/data-00831-of-00848.arrow to cache
FileCache: Delete file /var/tmp//az668407/returnn/file_cache/rwthfs/rz/cluster/home/az668407/setups/2025-10-21-chunked-ctc/work/i6_core/datasets/huggin
gface/TransformAndMapHuggingFaceDatasetJob.FxPUVJtw1EeN/output/dataset/train/data-00244-of-00848.arrow, size 718.1MB. Still need more space, file is 17.4 hours old. After deletion, have 725.6MB free space.

In this log, there are some other errors. I'm not really sure whether they are related.

Before this FileCache error:

...
Epoch 386 evaluation: dev: ctc_4 0.526 ctc_10 0.347 ctc_16 0.296 ce 0.276 fer 0.044 ce_3 0.271 fer_3 0.045 devtrain: ctc_4 0.487 ctc_10 0.322 ctc_16 0.
261 ce 0.238 fer 0.033 ce_3 0.235 fer_3 0.034
Memory usage (cuda): alloc cur 8.5GB alloc peak 11.8GB reserved cur 74.6GB reserved peak 74.6GB
We have stored models for epochs [20, 40, 80, ..., 384, 385, 386] and keep epochs [20, 40, 80, 160, 270, 273, 279, 285, 287, 291, 300, 304, 313, 320, 3
39, 342, 349, 360, 361, 363, 370, 372, 373, 375, 376, 381, 382, 383, 384, 385, 386].
We will delete the models of epochs [367].
Deleted 2.1GB.
start epoch 387 global train step 1343275 with effective learning rate 0.00021730000000000002 ...
Memory usage (cuda): alloc cur 8.4GB alloc peak 8.4GB reserved cur 74.6GB reserved peak 74.6GB
Exception ignored in atexit callback: <function dump_compile_times at 0x1543caec5440>
Traceback (most recent call last):
  File "/home/az668407/work/py-envs/py3.12-torch2.7/lib64/python3.12/site-packages/torch/_dynamo/utils.py", line 765, in dump_compile_times
Exception ignored in atexit callback: <function dump_compile_times at 0x14f9aae8d1c0>
Traceback (most recent call last):
  File "/home/az668407/work/py-envs/py3.12-torch2.7/lib64/python3.12/site-packages/torch/_dynamo/utils.py", line 765, in dump_compile_times
    log.info(compile_times(repr="str", aggregate=True))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/az668407/work/py-envs/py3.12-torch2.7/lib64/python3.12/site-packages/torch/_dynamo/utils.py", line 751, in compile_times
    log.info(compile_times(repr="str", aggregate=True))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/az668407/work/py-envs/py3.12-torch2.7/lib64/python3.12/site-packages/torch/_dynamo/utils.py", line 751, in compile_times
    out += tabulate(rows, headers=("Function", "Runtimes (s)"))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/az668407/work/py-envs/py3.12-torch2.7/lib64/python3.12/site-packages/torch/_dynamo/utils.py", line 217, in tabulate
    out += tabulate(rows, headers=("Function", "Runtimes (s)"))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/az668407/work/py-envs/py3.12-torch2.7/lib64/python3.12/site-packages/torch/_dynamo/utils.py", line 217, in tabulate
    import tabulate
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1322, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1262, in _find_spec
  File "<frozen importlib._bootstrap_external>", line 1532, in find_spec
  File "<frozen importlib._bootstrap_external>", line 1506, in _get_spec
  File "<frozen importlib._bootstrap_external>", line 1609, in find_spec
  File "<frozen importlib._bootstrap_external>", line 1652, in _fill_cache
OSError: [Errno 107] Transport endpoint is not connected: '/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/Python/3.12.3-GCC
core-13.3.0/easybuild/python'
    import tabulate
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1322, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1262, in _find_spec
  File "<frozen importlib._bootstrap_external>", line 1532, in find_spec
  File "<frozen importlib._bootstrap_external>", line 1506, in _get_spec
  File "<frozen importlib._bootstrap_external>", line 1609, in find_spec
  File "<frozen importlib._bootstrap_external>", line 1652, in _fill_cache
OSError: [Errno 107] Transport endpoint is not connected: '/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/Python/3.12.3-GCC
core-13.3.0/easybuild/python'

Then it restarted automatically via the auto-restart logic:

...
RETURNN runtime: 26:22:15
RETURNN return code: 1
Most recent trained model epoch: 386 file: /rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/returnn/training/ReturnnTrainingJob.7
HRPf52gKZ8L/output/models/epoch.386
Most recent trained model epoch before RETURNN run: 348
-> trained successfully 38 epoch(s)
Try again, restart RETURNN...
WARNING:root:Settings file 'settings.py' does not exist, ignoring it ([Errno 2] No such file or directory: 'settings.py').
Running in managed mode.
RETURNN starting up, version 1.20251107.004249+git.9678032a, date/time 2025-11-07-14-55-34 (UTC+0100), pid 121023, cwd /rwthfs/rz/cluster/hpcwork/p0023
999/az668407/setups/2025-08-aed-large/work/i6_core/returnn/training/ReturnnTrainingJob.7HRPf52gKZ8L/work, Python /home/az668407/work/py-envs/py3.12-tor
ch2.7/bin/python
RETURNN command line options: ['/rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/work/i6_core/returnn/training/ReturnnTrainingJob.7HRPf52gKZ8L
/output/returnn.config']
Hostname: n23g0003.hpc.itc.rwth-aachen.de
...

There was then another problem after the restart:

RETURNN frontend _native backend: Error while getting module:
/lib64/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /var/tmp//az668407/returnn_py_ext_mod_cache/_returnn_frontend_native/bdd43d0b4e/
_returnn_frontend_native.so)
This is optional (although very recommended), so we continue without it.

And that caused the somewhat unrelated error:

...
Time to get first batch data: 0:00:51
ValueError: Dividing a Tensor of type int by an integer is disallowed. Please convert the Tensor to float.
Unhandled exception <class 'ValueError'> in thread <_MainThread(MainThread, started 22368398583616)>, proc 121023.
...
...
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/__main__.py", line 543, in execute_main_task
...
  File "/rwthfs/rz/cluster/home/az668407/setups/2025-08-aed-large/recipe/i6_experiments/users/zeyer/experiments/exp2024_04_23_baselines/aed.py", line 6
45, in aed_training
    line: enc, enc_spatial_dim = model.encode(data, in_spatial_dim=data_spatial_dim, collected_outputs=collected_outputs)
...
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/frontend/normalization.py", line 36, in moments
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/frontend/reduce.py", line 95, in reduce_mean
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/frontend/reduce.py", line 47, in reduce
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/torch/frontend/_backend.py", line 1492, in TorchBackend.reduce
    line: correction_factor = rf.masked_fraction_of_shape(axis, inverse=True)
    locals:
      correction_factor = <local> None
      axis = <local> [Dim{B}, Dim{'⌈((-199+time)+-200)/160⌉'[B]}]
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/frontend/dims.py", line 286, in masked_fraction_of_shape
    line: return (num_elems_masked / num_elems_total) if not inverse else (num_elems_total / num_elems_masked)
    locals:
      num_elems_masked = <local> Tensor{'reduce_sum', [], dtype='int64'}
      num_elems_total = <local> Tensor{'mul', [], dtype='int32'}
      inverse = <local> True
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/tensor/_tensor_op_overloads.py", line 84, in _TensorOpOverloa
dsMixin.__truediv__
    line: return _rf().combine(self, "/", other)
    locals:
      self = <local> Tensor{'mul', [], dtype='int32'}
      other = <local> Tensor{'reduce_sum', [], dtype='int64'}
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/frontend/math_.py", line 211, in combine
    line: raise ValueError(
              "Dividing a Tensor of type int by an integer is disallowed. Please convert the Tensor to float."
          )
ValueError: Dividing a Tensor of type int by an integer is disallowed. Please convert the Tensor to float.

albertz avatar Nov 07 '25 14:11 albertz

I think actually that this was maybe a hiccup also in the cluster. Some FS might have not been available temporarily. That would explain this:

OSError: [Errno 107] Transport endpoint is not connected: '/cvmfs/software.hpc.rwth.de/Linux/RH9/x86_64/intel/sapphirerapids/software/Python/3.12.3-GCCcore-13.3.0/easybuild/python'

That might also explain this:

RETURNN frontend _native backend: Error while getting module:
/lib64/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /var/tmp//az668407/returnn_py_ext_mod_cache/_returnn_frontend_native/bdd43d0b4e/
_returnn_frontend_native.so)

I assume usually there is another libstdc++ that it uses instead, which was not available, and thus it used /lib64/libstdc++.so.6. Or maybe some other lib mixup caused by the FS issue, which results in this error.

albertz avatar Nov 07 '25 14:11 albertz

Note, that second issue (ValueError: Dividing a Tensor of type int by an integer is disallowed) is somewhat unrelated, and was already reported: #1749

I just reported it here again because the /lib64/libstdc++.so.6 / GLIBCXX_3.4.30 error is probably related to FS issues, which might be the underlying issue here.

albertz avatar Nov 07 '25 14:11 albertz

What I was thinking is still a possible race condition (similar to #1785):

In _copy_file_if_needed, it calls:

# Make sure we have enough disk space, st_size +1 due to _copy_with_prealloc
self.cleanup(need_at_least_free_space_size=os.stat(src_filename).st_size + 1)

And then a bit later:

_copy_with_prealloc(src_filename, dst_tmp_filename)

Where it fails with OSError: [Errno 28] No space left on device.

Inside cleanup, we have:

want_free_space_size = max(
    int(self._cleanup_disk_usage_wanted_multiplier * need_at_least_free_space_size),
    int(self._cleanup_disk_usage_wanted_free_ratio * disk_usage.total),
)

And we differentiate the case for want_free_space_size and need_at_least_free_space_size. For want_free_space_size, we have the logic:

if not delete_reason and want_free_space_size > cur_expected_free:
    if cur_time - mtime > self._cleanup_files_wanted_older_than_days * 60 * 60 * 24:

And we have the default cleanup_files_wanted_older_than_days: float = 1.0.

If we are close to the space limit, and most of the used files are not older than 1 day, the want_free_space_size will never be handled. Only the need_at_least_free_space_size wll be handled.

However, we have need_at_least_free_space_size=os.stat(src_filename).st_size + 1, which is extremely tight. When concurrent procs are running which also copy data, it can easily happen that the space is already consumed again.

This would explain why it still fails with OSError: [Errno 28] No space left on device.

I think we can improve our current logic. Right now we have these two cases, need_at_least_free_space_size and want_free_space_size, where one is strict (need_at_least_free_space_size), the other only when _cleanup_files_wanted_older_than_days is satisfied. Maybe we should handle intermediate cases. Maybe linearly interpolate between self._lock_timeout * 0.5 (the limit for need_at_least_free_space_size) and _cleanup_files_wanted_older_than_days (the limit for want_free_space_size).

Maybe we should also separate the want_free_space_size. That already includes self._cleanup_disk_usage_wanted_multiplier * need_at_least_free_space_size and self._cleanup_disk_usage_wanted_free_ratio * disk_usage.total. The first (_cleanup_disk_usage_wanted_multiplier * need_at_least_free_space_size) is much more important. For the other, we can keep the _cleanup_files_wanted_older_than_days logic.

(cc @NeoLegends)

albertz avatar Nov 07 '25 14:11 albertz