TabPFN icon indicating copy to clipboard operation
TabPFN copied to clipboard

TabPFNRegressor preprocessing fails on bigger datasets

Open LeoGrin opened this issue 10 months ago • 5 comments

See https://huggingface.co/Prior-Labs/TabPFN-v2-reg/discussions/2 It seems that QuantileTransformer fails on big datasets with message The number of quantiles cannot be greater than the number of samples used, which means TabPFN is unusable for these bigger datasets even with ignore_pretraining_contraints=True. Seems to only happen on regression? (not sure)

LeoGrin avatar Feb 04 '25 17:02 LeoGrin

In the preprocessing QuantileTransformers, we set the number of quantiles to be num_examples // 10 or num_examples // 5, which means that it should be lower than the number of samples, but the subsample parameter is unchanged from default 10K, which can be lower than the number of quantiles when the number of sample is large. We can either:

  • limit the number of quantiles to 10K, or
  • set the subsample really high.

I quickly checked the time cost of the second option with this test:

import numpy as np
import time
from sklearn.preprocessing import QuantileTransformer

def test_quantile_transformer_speed():
    # Use a dataset with many samples so that default subsampling is active.
    n_samples = 200_000  # more than the default subsample limit (typically 100000)
    n_features = 100
    n_quantiles = 10_000
    X = np.random.rand(n_samples, n_features)

    n_runs = 5
    default_times = []
    large_times = []

    for run in range(n_runs):
        print(f"\nRun {run + 1}/{n_runs}")
        
        # Test with default settings
        print("Testing QuantileTransformer with default subsample parameter")
        qt_default = QuantileTransformer(random_state=42, n_quantiles=n_quantiles)
        t0 = time.perf_counter()
        X_trans_default = qt_default.fit_transform(X)
        X_trans_default_2 = qt_default.transform(X)
        t1 = time.perf_counter()
        default_time = t1 - t0
        default_times.append(default_time)
        print(f"Default QuantileTransformer fit_transform time: {default_time:.6f} sec")
        print("Transformed shape:", X_trans_default.shape)

        # Test with subsample explicitly set
        print("\nTesting QuantileTransformer with subsample=100_000")
        qt_large = QuantileTransformer(subsample=100_000, random_state=42, n_quantiles=n_quantiles)
        t0 = time.perf_counter()
        X_trans_large = qt_large.fit_transform(X)
        X_trans_large_2 = qt_large.transform(X)
        t1 = time.perf_counter()
        large_time = t1 - t0
        large_times.append(large_time)
        print(f"QuantileTransformer (subsample=100_000) fit_transform time: {large_time:.6f} sec")
        print("Transformed shape:", X_trans_large.shape)

    # Print summary statistics
    print("\nSummary Statistics:")
    print(f"Default QuantileTransformer:")
    print(f"  Average time: {np.mean(default_times):.6f} sec")
    print(f"  Std dev: {np.std(default_times):.6f} sec")
    print(f"  Times: {[f'{t:.6f}' for t in default_times]}")
    
    print(f"\nQuantileTransformer (subsample=100_000):")
    print(f"  Average time: {np.mean(large_times):.6f} sec")
    print(f"  Std dev: {np.std(large_times):.6f} sec")
    print(f"  Times: {[f'{t:.6f}' for t in large_times]}")

if __name__ == '__main__':
    test_quantile_transformer_speed() 

And got:

Summary Statistics:
Default QuantileTransformer:
  Average time: 7.082734 sec
  Std dev: 0.044457 sec
  Times: ['7.070789', '7.033962', '7.103202', '7.047488', '7.158230']

QuantileTransformer (subsample=100_000):
  Average time: 5.735545 sec
  Std dev: 0.040030 sec
  Times: ['5.678141', '5.717424', '5.721183', '5.772044', '5.788931']

so surprisingly increasing the subsample seems to be a bit faster 🤔

(for 1K quantiles I get

Summary Statistics:
Default QuantileTransformer:
  Average time: 4.122261 sec
  Std dev: 0.044430 sec
  Times: ['4.209649', '4.111263', '4.086600', '4.104299', '4.099495']

QuantileTransformer (subsample=100_000):
  Average time: 4.494478 sec
  Std dev: 0.045392 sec
  Times: ['4.579021', '4.465794', '4.501846', '4.474663', '4.451065']

@noahho would you have an opinion among both options, and on whether changing this parameter after training might be an issue?

LeoGrin avatar Feb 04 '25 20:02 LeoGrin

When using ignore_pretraining_limits=True in TabPFN, the training data is subsampled (typically to 10,000 samples) before fitting the preprocessing pipeline. Currently, quantile transformers in our pipeline—configured in ReshapeFeatureDistributionsStep.get_all_preprocessors—use the original dataset size (e.g. num_examples // 5 or num_examples) to set parameters like n_quantiles. This mismatch leads to requesting more quantiles than the available number of samples (for example, 1,748,982 quantiles for only 10,000 samples), resulting in a ValueError.

Potential solution: Location 1: Update Call Site In the _set_transformer_and_cat_ix method of ReshapeFeatureDistributionsStep (in tabpfn/model/preprocessing.py), compute the effective sample count based on the subsampled data (e.g. 10,000 if subsampling is applied). Location 2: Update Quantile Transformer Setup In ReshapeFeatureDistributionsStep.get_all_preprocessors, replace usage of the original num_examples in the quantile transformer calculations with the effective sample count. This ensures that the n_quantiles parameter is dynamically set to a value that does not exceed the number of training samples available for fitting.

Additional Discussion: @leoGrin Changing this parameter post-hoc is not an issue because it is applied during preprocessing. This means that the quantile transformation is determined based on the actual data used for training, ensuring consistency during inference. For datasets with fewer than 10k samples, the effective sample count remains unchanged, preserving the original quantile configuration. Importantly, we need to maintain multiple values per quantile bucket; otherwise, the quantile transformer would degenerate to merely ranking the values. This issue is more prominent in regression tasks since we also apply quantile transformation to the target (y) values.

Relevant code https://github.com/PriorLabs/TabPFN/blob/9f208b70768bda1fe08b136842f3989b35b25081/src/tabpfn/model/preprocessing.py#L727 https://github.com/PriorLabs/TabPFN/blob/9f208b70768bda1fe08b136842f3989b35b25081/src/tabpfn/model/preprocessing.py#L867 We subsample afterwards: https://github.com/PriorLabs/TabPFN/blob/9f208b70768bda1fe08b136842f3989b35b25081/src/tabpfn/preprocessing.py#L563

noahho avatar Feb 13 '25 16:02 noahho

Changing this parameter post-hoc is not an issue because it is applied during preprocessing. This means that the quantile transformation is determined based on the actual data used for training, ensuring consistency during inference.

I meant changing after pretraining.

LeoGrin avatar Feb 13 '25 17:02 LeoGrin

A simple fix could be to replace the lines below with n_quantiles=min(max(num_examples // 10, 2), 10_000). The quantiles are then estimated from a subsampled 10_000 samples which might lead to less accurate quantiles compared when using more samples in some cases though.

Relevant lines should be:

https://github.com/PriorLabs/TabPFN/blob/9f208b70768bda1fe08b136842f3989b35b25081/src/tabpfn/model/preprocessing.py#L722

https://github.com/PriorLabs/TabPFN/blob/9f208b70768bda1fe08b136842f3989b35b25081/src/tabpfn/model/preprocessing.py#L727

https://github.com/PriorLabs/TabPFN/blob/9f208b70768bda1fe08b136842f3989b35b25081/src/tabpfn/model/preprocessing.py#L737

https://github.com/PriorLabs/TabPFN/blob/9f208b70768bda1fe08b136842f3989b35b25081/src/tabpfn/model/preprocessing.py#L742

https://github.com/PriorLabs/TabPFN/blob/9f208b70768bda1fe08b136842f3989b35b25081/src/tabpfn/model/preprocessing.py#L747

dgedon avatar Feb 18 '25 09:02 dgedon

@noahho I have made the changes as suggested by @dgedon Please Review!

Krishnadubey1008 avatar Mar 26 '25 14:03 Krishnadubey1008