dask-ml
dask-ml copied to clipboard
added TfidfTransformer and TfidfVectorizer to feature_extraction.text
Hi,
Thanks to all dask-developers for your outstanding work!
In this PR, I have attempted to apply my rudimentary knowledge of dask to include dask-implementations of TfidfTransformer and TfidfVectorizer (found in the sklearn.feature_extraction.text module) in the dask-ml.feature_extraction.text module. For now, just a minimal working code (no unit-test yet) is available. Though the examples I've hard-coded into their docstrings should be able to run without incident.
I think it requires someone with proper dask-expertise to inspect it and give me some pointers. In the meantime, I'll draw-up the tests.
Hopefully, this will prove to be a useful extension to dask-ml. At least it would for me, if it would eventually be merged upstream.
Tests have now been added.
I have also just now added support for dask.dataframe.Series to CountVectorizer and TfidfVectorizer.
Hi @TomAugspurger
Thanks for your review.
From your comments I realize that the dask programming paradigm is to delay all computations until such a time that the user executes compute() outside of the class, right? I guess that's the challenge for me right now. I'll see what I can do to achieve this.
is to delay all computations until such a time that the user executes compute() outside of the class, right? I guess that's the challenge for me right now. I'll see what I can do to achieve this.
If possible, yes. But sometimes intermediate computation is inevitable. .fit will often require a compute to learn the parameters (e.g. StandardScaler), we just want to make sure it's actually required.
Hi @TomAugspurger
If possible, yes. But sometimes intermediate computation is inevitable. .fit will often require a compute to learn the parameters (e.g. StandardScaler), we just want to make sure it's actually required.
True.
I've cleaned things up a bit now — all unnecessary calls to compute() have been removed — and TfidfTransformer's fit() function is now lazy, that is, it does not learn the parameters until the first call to TfidfTransformer's transform() function is made. Thereafter, all learned data remains in memory and so does not need to be computed again.
Also, TfidfTransformer's transform() function, and TfidfVectorizer's fit(), fit_transform(), and transform() functions are all lazy.
Currently all tests are passing.
The one outstanding issue regards the 'worrying' side-effect of using self.get_params() in CountVectorizer. I'm not sure why the original developers chose to use this function, since I found that sklearn.feature_extraction.text.CountVectorizer's fit_transform() and transform() functions do not use this function themselves. So there might be a way to circumvent its use. I'll take a closer look at it.
@TomAugspurger
By the way, I do not have access to a cluster, so I'm not sure how the code scales with the cluster-size. I merely presumed that if I wrote code similar to that of already existing dask-ml's CountVectorizer, things would be fine.
If you know of any way I can test the code in a truly distributed environment, kindly let me know.