FileNotFoundError: Race Condition when TabPFN is loaded for the first time in multiple jobs

Open chillerb opened this issue 6 months ago • 1 comments

Describe the bug

If TabPFN's checkpoints are initially loaded across multiple parallel jobs, the jobs overwrite each others checkpoint files, causing TabPFN to fail in all jobs except one.

Steps/Code to Reproduce

from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_breast_cancer
from tabpfn import TabPFNClassifier

X, y = load_breast_cancer(return_X_y=True)

model = TabPFNClassifier()

# TabPFN has never been loaded before
# now, 5 parallel jobs are looking for the checkpoints
print(cross_val_score(model, X, y, cv=5, n_jobs=5))

Expected Results

Checkpoints are loaded once, and script prints out cross-validation score over 5 different folds.

Actual Results

For context: script was executed in a (docker-rootless) dev container, where I am ~~g~~root. It seems that each job individually tries to load the tabpfn checkpoints. I assume that only the job finishing last succeeds on the fit, due to overwriting the checkpoints for all other jobs:

(dev) root@39f78d7b4313:/workspace# python tabpfn_bug.py 
/opt/micromamba/envs/dev/lib/python3.12/site-packages/tabpfn/base.py:89: UserWarning: Downloading model to /root/.cache/tabpfn/tabpfn-v2-classifier.ckpt.
  model, _, config_ = load_model_criterion_config(
/opt/micromamba/envs/dev/lib/python3.12/site-packages/tabpfn/base.py:89: UserWarning: Downloading model to /root/.cache/tabpfn/tabpfn-v2-classifier.ckpt.
  model, _, config_ = load_model_criterion_config(
/opt/micromamba/envs/dev/lib/python3.12/site-packages/tabpfn/base.py:89: UserWarning: Downloading model to /root/.cache/tabpfn/tabpfn-v2-classifier.ckpt.
  model, _, config_ = load_model_criterion_config(
/opt/micromamba/envs/dev/lib/python3.12/site-packages/tabpfn/base.py:89: UserWarning: Downloading model to /root/.cache/tabpfn/tabpfn-v2-classifier.ckpt.
  model, _, config_ = load_model_criterion_config(
/opt/micromamba/envs/dev/lib/python3.12/site-packages/tabpfn/base.py:89: UserWarning: Downloading model to /root/.cache/tabpfn/tabpfn-v2-classifier.ckpt.
  model, _, config_ = load_model_criterion_config(
tabpfn-v2-classifier.ckpt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 29.0M/29.0M [00:00<00:00, 190MB/s]
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 37.0/37.0 [00:00<00:00, 350kB/s]
tabpfn-v2-classifier.ckpt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 29.0M/29.0M [00:00<00:00, 198MB/s]
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 37.0/37.0 [00:00<00:00, 466kB/s]
tabpfn-v2-classifier.ckpt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 29.0M/29.0M [00:00<00:00, 191MB/s]
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 37.0/37.0 [00:00<00:00, 473kB/s]
tabpfn-v2-classifier.ckpt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 29.0M/29.0M [00:00<00:00, 161MB/s]
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 37.0/37.0 [00:00<00:00, 411kB/s]
tabpfn-v2-classifier.ckpt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 29.0M/29.0M [00:00<00:00, 157MB/s]
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 37.0/37.0 [00:00<00:00, 403kB/s]
/opt/micromamba/envs/dev/lib/python3.12/site-packages/sklearn/model_selection/_validation.py:516: FitFailedWarning: 
4 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
4 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/micromamba/envs/dev/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 859, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/micromamba/envs/dev/lib/python3.12/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/micromamba/envs/dev/lib/python3.12/site-packages/tabpfn/classifier.py", line 392, in fit
    self.model_, self.config_, _ = initialize_tabpfn_model(
                                   ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/micromamba/envs/dev/lib/python3.12/site-packages/tabpfn/base.py", line 89, in initialize_tabpfn_model
    model, _, config_ = load_model_criterion_config(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/micromamba/envs/dev/lib/python3.12/site-packages/tabpfn/model/loading.py", line 467, in load_model_criterion_config
    loaded_model, criterion, config = load_model(path=model_path, model_seed=model_seed)
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/micromamba/envs/dev/lib/python3.12/site-packages/tabpfn/model/loading.py", line 651, in load_model
    checkpoint = torch.load(path, map_location="cpu", weights_only=None)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/micromamba/envs/dev/lib/python3.12/site-packages/torch/serialization.py", line 1425, in load
    with _open_file_like(f, "rb") as opened_file:
         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/micromamba/envs/dev/lib/python3.12/site-packages/torch/serialization.py", line 751, in _open_file_like
    return _open_file(name_or_buffer, mode)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/micromamba/envs/dev/lib/python3.12/site-packages/torch/serialization.py", line 732, in __init__
    super().__init__(open(name, mode))
                     ^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/root/.cache/tabpfn/tabpfn-v2-classifier.ckpt'

  warnings.warn(some_fits_failed_message, FitFailedWarning)
[       nan        nan 0.98245614        nan        nan]

THIS ONLY WORKS (I.E. FAILS) IF TABPFN'S CHECKPOINTS HAVE NEVER BEEN LOADED BEFORE!

On consecutive runs, the script works as expected:

(dev) root@39f78d7b4313:/workspace# python tabpfn_bug.py 
[0.97368421 0.97368421 0.98245614 0.98245614 0.99115044]

Versions

Collecting system and dependency information...
PyTorch version: 2.6.0
CUDA used to build PyTorch: 12.6
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 12 (bookworm) (x86_64)
GCC version: (Debian 12.2.0-14+deb12u1) 12.2.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.36

Python version: 3.12.11 | packaged by conda-forge | (main, Jun  4 2025, 14:45:31) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-142-generic-x86_64-with-glibc2.36
Is CUDA available: True
CUDA runtime version: 12.9.86
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA L40
Nvidia driver version: 535.230.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        40 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               64
On-line CPU(s) list:                  0-63
Vendor ID:                            AuthenticAMD
Model name:                           QEMU Virtual CPU version 2.5+
CPU family:                           15
Model:                                107
Thread(s) per core:                   1
Core(s) per socket:                   32
Socket(s):                            2
Stepping:                             1
BogoMIPS:                             6199.99
Flags:                                fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm rep_good nopl cpuid extd_apicid tsc_known_freq pni ssse3 cx16 sse4_1 sse4_2 x2apic popcnt aes hypervisor lahf_lm cmp_legacy 3dnowprefetch vmmcall
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            4 MiB (64 instances)
L1i cache:                            4 MiB (64 instances)
L2 cache:                             32 MiB (64 instances)
L3 cache:                             1 GiB (64 instances)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-63
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Not affected
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Dependency Versions:
--------------------
tabpfn: 2.0.9
torch: 2.6.0
numpy: 2.2.6
scipy: 1.16.0
pandas: 2.3.0
scikit-learn: 1.7.0
typing_extensions: 4.14.0
einops: 0.8.1
huggingface-hub: 0.33.1

Jul 04 '25 13:07 chillerb

Hi @chillerb, thank you so much for raising this issue and sorry for the slow reply, I was able to reproduce your bug, will raise this issue internally and hopefully come to you with a fix soon.

Maybe as an intermediate solution for anyone encountering this issue at the moment, simply rerunning the code might suffice. Alternatively, calling the model once before also works, so for your example this worked for me:

X, y = load_breast_cancer(return_X_y=True)

model = TabPFNClassifier()

model.fit(X,y)

# TabPFN has never been loaded before
# now, 5 parallel jobs are looking for the checkpoints
print(cross_val_score(model, X, y, cv=5, n_jobs=5))

Aug 27 '25 09:08 klemens-floege