FileNotFoundError: Race Condition when TabPFN is loaded for the first time in multiple jobs
Describe the bug
If TabPFN's checkpoints are initially loaded across multiple parallel jobs, the jobs overwrite each others checkpoint files, causing TabPFN to fail in all jobs except one.
Steps/Code to Reproduce
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_breast_cancer
from tabpfn import TabPFNClassifier
X, y = load_breast_cancer(return_X_y=True)
model = TabPFNClassifier()
# TabPFN has never been loaded before
# now, 5 parallel jobs are looking for the checkpoints
print(cross_val_score(model, X, y, cv=5, n_jobs=5))
Expected Results
Checkpoints are loaded once, and script prints out cross-validation score over 5 different folds.
Actual Results
For context: script was executed in a (docker-rootless) dev container, where I am ~~g~~root. It seems that each job individually tries to load the tabpfn checkpoints. I assume that only the job finishing last succeeds on the fit, due to overwriting the checkpoints for all other jobs:
(dev) root@39f78d7b4313:/workspace# python tabpfn_bug.py
/opt/micromamba/envs/dev/lib/python3.12/site-packages/tabpfn/base.py:89: UserWarning: Downloading model to /root/.cache/tabpfn/tabpfn-v2-classifier.ckpt.
model, _, config_ = load_model_criterion_config(
/opt/micromamba/envs/dev/lib/python3.12/site-packages/tabpfn/base.py:89: UserWarning: Downloading model to /root/.cache/tabpfn/tabpfn-v2-classifier.ckpt.
model, _, config_ = load_model_criterion_config(
/opt/micromamba/envs/dev/lib/python3.12/site-packages/tabpfn/base.py:89: UserWarning: Downloading model to /root/.cache/tabpfn/tabpfn-v2-classifier.ckpt.
model, _, config_ = load_model_criterion_config(
/opt/micromamba/envs/dev/lib/python3.12/site-packages/tabpfn/base.py:89: UserWarning: Downloading model to /root/.cache/tabpfn/tabpfn-v2-classifier.ckpt.
model, _, config_ = load_model_criterion_config(
/opt/micromamba/envs/dev/lib/python3.12/site-packages/tabpfn/base.py:89: UserWarning: Downloading model to /root/.cache/tabpfn/tabpfn-v2-classifier.ckpt.
model, _, config_ = load_model_criterion_config(
tabpfn-v2-classifier.ckpt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 29.0M/29.0M [00:00<00:00, 190MB/s]
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 37.0/37.0 [00:00<00:00, 350kB/s]
tabpfn-v2-classifier.ckpt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 29.0M/29.0M [00:00<00:00, 198MB/s]
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 37.0/37.0 [00:00<00:00, 466kB/s]
tabpfn-v2-classifier.ckpt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 29.0M/29.0M [00:00<00:00, 191MB/s]
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 37.0/37.0 [00:00<00:00, 473kB/s]
tabpfn-v2-classifier.ckpt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 29.0M/29.0M [00:00<00:00, 161MB/s]
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 37.0/37.0 [00:00<00:00, 411kB/s]
tabpfn-v2-classifier.ckpt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 29.0M/29.0M [00:00<00:00, 157MB/s]
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 37.0/37.0 [00:00<00:00, 403kB/s]
/opt/micromamba/envs/dev/lib/python3.12/site-packages/sklearn/model_selection/_validation.py:516: FitFailedWarning:
4 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.
Below are more details about the failures:
--------------------------------------------------------------------------------
4 fits failed with the following error:
Traceback (most recent call last):
File "/opt/micromamba/envs/dev/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 859, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/opt/micromamba/envs/dev/lib/python3.12/contextlib.py", line 81, in inner
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "/opt/micromamba/envs/dev/lib/python3.12/site-packages/tabpfn/classifier.py", line 392, in fit
self.model_, self.config_, _ = initialize_tabpfn_model(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/micromamba/envs/dev/lib/python3.12/site-packages/tabpfn/base.py", line 89, in initialize_tabpfn_model
model, _, config_ = load_model_criterion_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/micromamba/envs/dev/lib/python3.12/site-packages/tabpfn/model/loading.py", line 467, in load_model_criterion_config
loaded_model, criterion, config = load_model(path=model_path, model_seed=model_seed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/micromamba/envs/dev/lib/python3.12/site-packages/tabpfn/model/loading.py", line 651, in load_model
checkpoint = torch.load(path, map_location="cpu", weights_only=None)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/micromamba/envs/dev/lib/python3.12/site-packages/torch/serialization.py", line 1425, in load
with _open_file_like(f, "rb") as opened_file:
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/micromamba/envs/dev/lib/python3.12/site-packages/torch/serialization.py", line 751, in _open_file_like
return _open_file(name_or_buffer, mode)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/micromamba/envs/dev/lib/python3.12/site-packages/torch/serialization.py", line 732, in __init__
super().__init__(open(name, mode))
^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/root/.cache/tabpfn/tabpfn-v2-classifier.ckpt'
warnings.warn(some_fits_failed_message, FitFailedWarning)
[ nan nan 0.98245614 nan nan]
THIS ONLY WORKS (I.E. FAILS) IF TABPFN'S CHECKPOINTS HAVE NEVER BEEN LOADED BEFORE!
On consecutive runs, the script works as expected:
(dev) root@39f78d7b4313:/workspace# python tabpfn_bug.py
[0.97368421 0.97368421 0.98245614 0.98245614 0.99115044]
Versions
Collecting system and dependency information...
PyTorch version: 2.6.0
CUDA used to build PyTorch: 12.6
ROCM used to build PyTorch: N/A
OS: Debian GNU/Linux 12 (bookworm) (x86_64)
GCC version: (Debian 12.2.0-14+deb12u1) 12.2.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.36
Python version: 3.12.11 | packaged by conda-forge | (main, Jun 4 2025, 14:45:31) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.15.0-142-generic-x86_64-with-glibc2.36
Is CUDA available: True
CUDA runtime version: 12.9.86
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA L40
Nvidia driver version: 535.230.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 40 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Vendor ID: AuthenticAMD
Model name: QEMU Virtual CPU version 2.5+
CPU family: 15
Model: 107
Thread(s) per core: 1
Core(s) per socket: 32
Socket(s): 2
Stepping: 1
BogoMIPS: 6199.99
Flags: fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm rep_good nopl cpuid extd_apicid tsc_known_freq pni ssse3 cx16 sse4_1 sse4_2 x2apic popcnt aes hypervisor lahf_lm cmp_legacy 3dnowprefetch vmmcall
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 4 MiB (64 instances)
L1i cache: 4 MiB (64 instances)
L2 cache: 32 MiB (64 instances)
L3 cache: 1 GiB (64 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-63
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Not affected
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Dependency Versions:
--------------------
tabpfn: 2.0.9
torch: 2.6.0
numpy: 2.2.6
scipy: 1.16.0
pandas: 2.3.0
scikit-learn: 1.7.0
typing_extensions: 4.14.0
einops: 0.8.1
huggingface-hub: 0.33.1
Hi @chillerb, thank you so much for raising this issue and sorry for the slow reply, I was able to reproduce your bug, will raise this issue internally and hopefully come to you with a fix soon.
Maybe as an intermediate solution for anyone encountering this issue at the moment, simply rerunning the code might suffice. Alternatively, calling the model once before also works, so for your example this worked for me:
X, y = load_breast_cancer(return_X_y=True)
model = TabPFNClassifier()
model.fit(X,y)
# TabPFN has never been loaded before
# now, 5 parallel jobs are looking for the checkpoints
print(cross_val_score(model, X, y, cv=5, n_jobs=5))