TabPFN icon indicating copy to clipboard operation
TabPFN copied to clipboard

[Blocked] Use Scrub for data cleaning

Open noahho opened this issue 10 months ago • 4 comments

Fix #138: NA handling in text columns Fix #163 Partially fixed by https://github.com/PriorLabs/TabPFN/pull/242

Summary

  • Add skrub>=0.3.0 dependency to handle mixed string/NA data
  • Integrate TableVectorizer in TabPFNClassifier to properly process text columns with NA values
  • Add test to verify the solution works as expected

Test plan

  • Added test_classifier_with_text_and_na that verifies we can fit and predict on a DataFrame with text columns containing NA values
  • Manually verified with additional use cases not in tests

noahho avatar Feb 28 '25 08:02 noahho

Okay we encountered problem, skrub 0.3.0 requires scipy 1.9.3 which isn't compatible with TabPFN

noahho avatar Feb 28 '25 13:02 noahho

Does it fail without _handle_string_na_values? I'm surprised you need it.

LeoGrin avatar Mar 03 '25 12:03 LeoGrin

I've simplified the implementation to only rely on TableVectorizer without needing the extra function. Also bumped scikit-learn minimum version to 1.2.1 for compatibility with skrub. Note that scikit-learn 1.2.1 was released in January 2023, so it's still more than 2 years old and should be a reasonable dependency. Same for pandas 1.5.3. Also removed the drop_null_fraction parameter since it doesn't exist in all skrub versions. The default behavior is reasonable - it only removes columns that are all NaN, which is appropriate for our use case.

LeoGrin avatar Mar 03 '25 17:03 LeoGrin

Instead we could use Autogluon AutoMLPipelineFeatureGenerator?

noahho avatar Mar 04 '25 10:03 noahho