[Blocked] Use Scrub for data cleaning
Fix #138: NA handling in text columns Fix #163 Partially fixed by https://github.com/PriorLabs/TabPFN/pull/242
Summary
- Add skrub>=0.3.0 dependency to handle mixed string/NA data
- Integrate TableVectorizer in TabPFNClassifier to properly process text columns with NA values
- Add test to verify the solution works as expected
Test plan
- Added test_classifier_with_text_and_na that verifies we can fit and predict on a DataFrame with text columns containing NA values
- Manually verified with additional use cases not in tests
Okay we encountered problem, skrub 0.3.0 requires scipy 1.9.3 which isn't compatible with TabPFN
Does it fail without _handle_string_na_values? I'm surprised you need it.
I've simplified the implementation to only rely on TableVectorizer without needing the extra function. Also bumped scikit-learn minimum version to 1.2.1 for compatibility with skrub. Note that scikit-learn 1.2.1 was released in January 2023, so it's still more than 2 years old and should be a reasonable dependency. Same for pandas 1.5.3.
Also removed the drop_null_fraction parameter since it doesn't exist in all skrub versions. The default behavior is reasonable - it only removes columns that are all NaN, which is appropriate for our use case.
Instead we could use Autogluon AutoMLPipelineFeatureGenerator?