TabPFN Improve inference speed by keeping model on GPU/MPS between fit/predict calls

I tested the example code for the classifier:

from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split

from tabpfn import TabPFNClassifier


# Load data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

# Initialize a classifier
clf = TabPFNClassifier(device="mps")
clf.fit(X_train, y_train)
print("Device: ", next(clf.executor_.model.parameters()).device)
# It prints "cpu"

# Predict probabilities
prediction_probabilities = clf.predict_proba(X_test)
print("ROC AUC:", roc_auc_score(y_test, prediction_probabilities[:, 1]))
# Predict labels
predictions = clf.predict(X_test)
print("Accuracy", accuracy_score(y_test, predictions))

clf.executor_.model = clf.executor_.model.to("mps")
print("New device: ", next(clf.executor_.model.parameters()).device)
# Now it prints "mps:0"

When I print the device of the model parameters, they appear to be on the cpu, even though mps is available. Is this the expected behavior, or is it unintended? The same issue occurs when I set the device to auto.

Debug info

PyTorch version: 2.8.0 CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A

OS: macOS 15.6.1 (arm64) GCC version: Could not collect Clang version: 17.0.0 (clang-1700.3.19.1) CMake version: version 3.31.5 Libc version: N/A

Python version: 3.12.1 (main, Jun 3 2024, 17:33:54) [Clang 15.0.0 (clang-1500.3.9.4)] (64-bit runtime) Python platform: macOS-15.6.1-arm64-arm-64bit Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Apple M2 Pro

Dependency Versions:

tabpfn: 2.2.1 torch: 2.8.0 numpy: 2.3.3 scipy: 1.16.2 pandas: 2.3.2 scikit-learn: 1.6.1 typing_extensions: 4.15.0 einops: 0.8.1 huggingface-hub: 0.35.0

Sep 22 '25 12:09 DanieleMorotti

Hi @DanieleMorotti , thank you for flagging this. We are looking into it, and will keep you posted.

TL;DR: We are unnecessarily moving just the InferenceEngine back to CPU after fitting which results is a decrease in prediction speed.

When fitting and predicting with TabPFN on "cpu" and "mps", so we can see that we are using the MPS device:

UserWarning: Running on CPU with more than 200 samples may be slow.
Consider using a GPU or the tabpfn-client API: https://github.com/PriorLabs/tabpfn-client
  check_cpu_warning(
ROC AUC: 0.9984721161191749
Time:  6.548170804977417
MPS
ROC AUC: 0.9984721161191749
Time:  1.9219889640808105

However I think you are right about the inference engine, because when I time the prediction times for you code I get:

(device(type='mps'),)
Device:  cpu
ROC AUC: 0.9984721161191749
Time:  2.296971082687378
Accuracy 0.9824561403508771
New device:  mps:0
ROC AUC: 0.9984721161191749
Time:  1.1160151958465576

for sake of completion here is my code:

import time

from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split

from tabpfn import TabPFNClassifier


# Load data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

# Initialize a classifier
clf = TabPFNClassifier(device="mps")
clf.fit(X_train, y_train)
print(clf.devices_)
print("Device: ", next(clf.executor_.model.parameters()).device)
# It prints "cpu"

# Predict probabilities
start = time.time()
prediction_probabilities = clf.predict_proba(X_test)
print("ROC AUC:", roc_auc_score(y_test, prediction_probabilities[:, 1]))
print("Time: ", time.time() - start)
predictions = clf.predict(X_test)
print("Accuracy", accuracy_score(y_test, predictions))

clf.executor_.model = clf.executor_.model.to("mps")
print("New device: ", next(clf.executor_.model.parameters()).device)


start = time.time()
prediction_probabilities = clf.predict_proba(X_test)
print("ROC AUC:", roc_auc_score(y_test, prediction_probabilities[:, 1]))
print("Time: ", time.time() - start)
# Now it prints "mps:0"

Sep 30 '25 16:09 klemens-floege

You can see what's going on in the default inference engine here: https://github.com/PriorLabs/TabPFN/blob/3dd61b0bc35faae52041cd2b611490db2178ffec/src/tabpfn/inference.py#L511 It moves the model back to CPU after inference.

I guess this is done to save GPU memory, but it's probably unecessary and might be confusing (users expect the model to be on the device they specify). In the case of MPS, the CPU and the GPU share the same physical memory, so I doubt moving back to the CPU is beneficial, and it definitely takes time to move it.

A few things off the top of my head to consider when fixing:

We might want to provide a .to() function on the model first, so users can free GPU memory if they want. This is tracked in https://github.com/PriorLabs/TabPFN/issues/502
To support multi-device inference, we'll need a mapping from device to model instance, in each inference engine
The CacheKV inference engine will be a special case, as in this case we have a copy of the model for each estimator with that estimator's kv cache attached. It's likely that the gpu won't have enough memory to fit all of these models at once (by default there are 8 estimators), so the easiest option might be to continue to copy between the cpu/target device each time.

Sep 30 '25 16:09 oscarkey

Okay, thank you for your response.

Oct 01 '25 14:10 DanieleMorotti