dask-ml
dask-ml copied to clipboard
Distributed TFIDF
Greetings!
I recently used dask to implement a distributed version of tfidf. I want to contribute to the dask project by putting it somewhere.
Would this be the correct repo.?
I thought maybe a feature_extraction
directory would be appropriate.
Things are still a bit in flux, but I think this would be a good home. This would be a great addition!
I thought maybe a feature_extraction directory would be appropriate.
Perfect. I'm trying to follow scikit-learn's layout as closely as possible.
I see.
What's the difference between this repo. and dask-glm
? Can't the later be a subset of the former?
What's the difference between this repo. and dask-glm?
I'm probably going to just import the dask-glm estimators into dask-ml
namespace (likewise with dask-searchcv, dask-patternsearch). For the user, it'd be nice to have a single place to go for all dask-related ML things.
Development will probably still continue in those other repositories.
I'd be interested in seeing this code. How was this achieved?
The way I first implemented it was something like this: (I changed it around so that I removed company private code and this was way before I started contributing to dask so my knowledge was far less than it is today)
import numpy as np
import dask.bag as db
from toolz.curried import frequencies
from toolz.curried import valmap
from toolz.curried import merge_with
from toolz.curried import unique
from toolz import merge_with as _merge_with
def normalize(_dict):
return {k:v/sum(_dict.values()) for k,v in _dict.items()}
corpus = 2*[
"a a b b b",
"a a a a c",
"c b a d e f g",
"a b b b c c c d d d a b",
"c b a d e f g",
"a b b b c c c d d d a b"
]
base = (
db
.from_sequence(corpus, npartitions=4)
.map(str.split)
)
tf = (
base
.map(frequencies)
.map(normalize)
)
idf = (
base
.map(unique)
.map(frequencies)
.reduction(merge_with(sum), merge_with(sum), split_every=2)
.apply(normalize)
.apply(valmap(np.log10))
.apply(valmap(np.negative))
)
This has a draw-back, since it must pass over the entire dataset once in order to calculate the idf
step.
To me a better solution (which is what I want to implement here) would be first to implement the hashing trick and then use the tfidf.
Thanks Daniel,
Looks good. My end goal is to map multiple text columns in a dask dataframe to ngram vectors suitable for supervised machine learning in combination with other non-text columns. I’ll have a play around what you’ve done here and post back my results. I don’t think the hashing trick will work for me as I’ll require feature mapping at the end of all this. I’ll post my results.
Cool.
The problem I faced when I decided to implement this was similar to yours. Our implementation originally used the get_dummies function in pandas on the text columns.
Yep. I've used the get_dummies function to process my categorical columns using map_partitions function on a dask dataframe since marking the input columns as categorical handles the schema merging per partition. What strategy does your original problem implement to form partitions with consistent partition schemas?
An alternative approach to using dask bag could be to apply scikit-learn CountVectorizer
or HashingVectorizer
on chunks of the dataset, merge the results and then apply IDF weighting (see https://github.com/FreeDiscovery/FreeDiscovery/issues/152). It might require somewhat less work since different vectorization options are already implemented in scikit-learn, and it should be possible to keep a fairly compatible API.
+1 on avoiding bag in performance sensitive code :)
On Wed, Jan 24, 2018 at 5:56 PM, Roman Yurchak [email protected] wrote:
An alternative approach to using dask bag could be to apply scikit-learn CountVectorizer or HashingVectorizer on chunks of the dataset, merge the results and then apply IDF weighting (see FreeDiscovery/FreeDiscovery#152 https://github.com/FreeDiscovery/FreeDiscovery/issues/152). It might require somewhat less work since different vectorization options are already implemented in scikit-learn, and it should be possible to keep a fairly compatible API.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/5#issuecomment-360302286, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszBORFY_eHqD6DQEuYQgD4cNmXcsYks5tN7UigaJpZM4PiBOZ .
RAPIDS already has an implementation of distributed TFIDF: rapidsai/cuml/python/cuml/dask/feature_extraction/text/tfidf_transformer.py#L31.
They explicitly support Dask Arrays: https://github.com/rapidsai/cuml/blob/147f795e03a1b3f3b53aa545385cc5025f86a2f0/python/cuml/dask/feature_extraction/text/tfidf_transformer.py#L140
Beware that cuml typically does relatively GPU-centric things. Their algorithms are rarely generalizable to non-RAPIDS use cases. I would be curious to see what a CPU implementation would look like with more traditional dataframe/array operations.
On Thu, Sep 24, 2020 at 8:45 AM Scott Sievert [email protected] wrote:
RAPIDS already has an implementation of distributed TFIDF: rapidsai/cuml/python/cuml/dask/feature_extraction/text/tfidf_transformer.py#L31 https://github.com/rapidsai/cuml/blob/147f795e03a1b3f3b53aa545385cc5025f86a2f0/python/cuml/dask/feature_extraction/text/tfidf_transformer.py#L31 .
They explicitly support Dask Arrays: https://github.com/rapidsai/cuml/blob/147f795e03a1b3f3b53aa545385cc5025f86a2f0/python/cuml/dask/feature_extraction/text/tfidf_transformer.py#L140
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/5#issuecomment-698427819, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTAZRKY6QJBUTT6DJ6TSHNSQPANCNFSM4D4ICOMQ .
Sorry to revive an old thread, but was this implemented anywhere?