datacleaner icon indicating copy to clipboard operation
datacleaner copied to clipboard

Automatically cleaning unicode text

Open dimenwarper opened this issue 8 years ago • 2 comments

Thanks for this awesome tool! I was wondering if we could include some sanity checking/cleanup for badly behaved text (e.g. all those invalid unicode characters). Could be as simple as running ftfy on all text columns. I'd volunteer to integrate this into datacleaner.

dimenwarper avatar May 25 '17 05:05 dimenwarper

Sounds promising. Please submit a PR with the new functionality along with unit tests to demonstrate how it works.

rhiever avatar May 25 '17 13:05 rhiever

I've implemented a draft of this but realized it may clash with the functionality of converting all text to numerical values. I wonder how to proceed, as I see it there are two options:

  1. Fix the text before applying the encoding: This is what I'm doing right now, so strings like >=50 and >=50'get encoded to the same label.
  2. Make encoding optional: This is tricky, there will be some text-based columns where you want to preserve the text to featurize later (e.g. with a sklearn.feature_extraction.text.TfidfVectorizer) rather than convert them to a label with an encoder. The tricky part is how to specify what columns you want to encode or not.

One way to proceed would be to go with 1 and then tackle 2 in a later issue.

dimenwarper avatar May 31 '17 17:05 dimenwarper