TabPFN [Blocked] Use Scrub for data cleaning

Fix #138: NA handling in text columns Fix #163 Partially fixed by https://github.com/PriorLabs/TabPFN/pull/242

Summary

Add skrub>=0.3.0 dependency to handle mixed string/NA data
Integrate TableVectorizer in TabPFNClassifier to properly process text columns with NA values
Add test to verify the solution works as expected

Test plan

Added test_classifier_with_text_and_na that verifies we can fit and predict on a DataFrame with text columns containing NA values
Manually verified with additional use cases not in tests

Feb 28 '25 08:02 noahho

Okay we encountered problem, skrub 0.3.0 requires scipy 1.9.3 which isn't compatible with TabPFN

Feb 28 '25 13:02 noahho

Does it fail without _handle_string_na_values? I'm surprised you need it.

Mar 03 '25 12:03 LeoGrin

I've simplified the implementation to only rely on TableVectorizer without needing the extra function. Also bumped scikit-learn minimum version to 1.2.1 for compatibility with skrub. Note that scikit-learn 1.2.1 was released in January 2023, so it's still more than 2 years old and should be a reasonable dependency. Same for pandas 1.5.3. Also removed the drop_null_fraction parameter since it doesn't exist in all skrub versions. The default behavior is reasonable - it only removes columns that are all NaN, which is appropriate for our use case.

Mar 03 '25 17:03 LeoGrin

Instead we could use Autogluon AutoMLPipelineFeatureGenerator?

Mar 04 '25 10:03 noahho