dask-ml icon indicating copy to clipboard operation
dask-ml copied to clipboard

added TfidfTransformer and TfidfVectorizer to feature_extraction.text

Open ParticularMiner opened this issue 4 years ago • 6 comments
trafficstars

Hi,

Thanks to all dask-developers for your outstanding work!

In this PR, I have attempted to apply my rudimentary knowledge of dask to include dask-implementations of TfidfTransformer and TfidfVectorizer (found in the sklearn.feature_extraction.text module) in the dask-ml.feature_extraction.text module. For now, just a minimal working code (no unit-test yet) is available. Though the examples I've hard-coded into their docstrings should be able to run without incident.

I think it requires someone with proper dask-expertise to inspect it and give me some pointers. In the meantime, I'll draw-up the tests.

Hopefully, this will prove to be a useful extension to dask-ml. At least it would for me, if it would eventually be merged upstream.

ParticularMiner avatar Oct 23 '21 08:10 ParticularMiner

Tests have now been added.

ParticularMiner avatar Oct 23 '21 15:10 ParticularMiner

I have also just now added support for dask.dataframe.Series to CountVectorizer and TfidfVectorizer.

ParticularMiner avatar Oct 24 '21 09:10 ParticularMiner

Hi @TomAugspurger

Thanks for your review.

From your comments I realize that the dask programming paradigm is to delay all computations until such a time that the user executes compute() outside of the class, right? I guess that's the challenge for me right now. I'll see what I can do to achieve this.

ParticularMiner avatar Oct 24 '21 14:10 ParticularMiner

is to delay all computations until such a time that the user executes compute() outside of the class, right? I guess that's the challenge for me right now. I'll see what I can do to achieve this.

If possible, yes. But sometimes intermediate computation is inevitable. .fit will often require a compute to learn the parameters (e.g. StandardScaler), we just want to make sure it's actually required.

TomAugspurger avatar Oct 24 '21 14:10 TomAugspurger

Hi @TomAugspurger

If possible, yes. But sometimes intermediate computation is inevitable. .fit will often require a compute to learn the parameters (e.g. StandardScaler), we just want to make sure it's actually required.

True.

I've cleaned things up a bit now — all unnecessary calls to compute() have been removed — and TfidfTransformer's fit() function is now lazy, that is, it does not learn the parameters until the first call to TfidfTransformer's transform() function is made. Thereafter, all learned data remains in memory and so does not need to be computed again.

Also, TfidfTransformer's transform() function, and TfidfVectorizer's fit(), fit_transform(), and transform() functions are all lazy.

Currently all tests are passing.

The one outstanding issue regards the 'worrying' side-effect of using self.get_params() in CountVectorizer. I'm not sure why the original developers chose to use this function, since I found that sklearn.feature_extraction.text.CountVectorizer's fit_transform() and transform() functions do not use this function themselves. So there might be a way to circumvent its use. I'll take a closer look at it.

ParticularMiner avatar Oct 25 '21 14:10 ParticularMiner

@TomAugspurger

By the way, I do not have access to a cluster, so I'm not sure how the code scales with the cluster-size. I merely presumed that if I wrote code similar to that of already existing dask-ml's CountVectorizer, things would be fine.

If you know of any way I can test the code in a truly distributed environment, kindly let me know.

ParticularMiner avatar Oct 25 '21 16:10 ParticularMiner