dask-ml Distributed TFIDF

trafficstars

Greetings!

I recently used dask to implement a distributed version of tfidf. I want to contribute to the dask project by putting it somewhere.

Would this be the correct repo.?

I thought maybe a feature_extraction directory would be appropriate.

Sep 24 '17 20:09 dsevero

Things are still a bit in flux, but I think this would be a good home. This would be a great addition!

I thought maybe a feature_extraction directory would be appropriate.

Perfect. I'm trying to follow scikit-learn's layout as closely as possible.

Sep 25 '17 14:09 TomAugspurger

I see.

What's the difference between this repo. and dask-glm? Can't the later be a subset of the former?

Sep 26 '17 00:09 dsevero

What's the difference between this repo. and dask-glm?

I'm probably going to just import the dask-glm estimators into dask-ml namespace (likewise with dask-searchcv, dask-patternsearch). For the user, it'd be nice to have a single place to go for all dask-related ML things.

Development will probably still continue in those other repositories.

Sep 26 '17 11:09 TomAugspurger

I'd be interested in seeing this code. How was this achieved?

Oct 20 '17 00:10 bendruitt

The way I first implemented it was something like this: (I changed it around so that I removed company private code and this was way before I started contributing to dask so my knowledge was far less than it is today)

import numpy as np
import dask.bag as db

from toolz.curried import frequencies
from toolz.curried import valmap
from toolz.curried import merge_with
from toolz.curried import unique
from toolz import merge_with as _merge_with


def normalize(_dict):
    return {k:v/sum(_dict.values()) for k,v in _dict.items()}

corpus = 2*[
    "a a b b b",
    "a a a a c",
    "c b a d e f g",
    "a b b b c c c d d d a b",
    "c b a d e f g",
    "a b b b c c c d d d a b"
]


base = (
    db
    .from_sequence(corpus, npartitions=4)
    .map(str.split)
)

tf = (
    base
    .map(frequencies)
    .map(normalize)
)

idf = (
    base
    .map(unique)
    .map(frequencies)
    .reduction(merge_with(sum), merge_with(sum), split_every=2)
    .apply(normalize)
    .apply(valmap(np.log10))
    .apply(valmap(np.negative))
)

This has a draw-back, since it must pass over the entire dataset once in order to calculate the idf step.

To me a better solution (which is what I want to implement here) would be first to implement the hashing trick and then use the tfidf.

Oct 20 '17 01:10 dsevero

Thanks Daniel,

Looks good. My end goal is to map multiple text columns in a dask dataframe to ngram vectors suitable for supervised machine learning in combination with other non-text columns. I’ll have a play around what you’ve done here and post back my results. I don’t think the hashing trick will work for me as I’ll require feature mapping at the end of all this. I’ll post my results.

Oct 20 '17 02:10 bendruitt

Cool.

The problem I faced when I decided to implement this was similar to yours. Our implementation originally used the get_dummies function in pandas on the text columns.

Oct 20 '17 02:10 dsevero

Yep. I've used the get_dummies function to process my categorical columns using map_partitions function on a dask dataframe since marking the input columns as categorical handles the schema merging per partition. What strategy does your original problem implement to form partitions with consistent partition schemas?

Oct 20 '17 03:10 bendruitt

An alternative approach to using dask bag could be to apply scikit-learn CountVectorizer or HashingVectorizer on chunks of the dataset, merge the results and then apply IDF weighting (see https://github.com/FreeDiscovery/FreeDiscovery/issues/152). It might require somewhat less work since different vectorization options are already implemented in scikit-learn, and it should be possible to keep a fairly compatible API.

Jan 24 '18 22:01 rth

+1 on avoiding bag in performance sensitive code :)

On Wed, Jan 24, 2018 at 5:56 PM, Roman Yurchak [email protected] wrote:

An alternative approach to using dask bag could be to apply scikit-learn CountVectorizer or HashingVectorizer on chunks of the dataset, merge the results and then apply IDF weighting (see FreeDiscovery/FreeDiscovery#152 https://github.com/FreeDiscovery/FreeDiscovery/issues/152). It might require somewhat less work since different vectorization options are already implemented in scikit-learn, and it should be possible to keep a fairly compatible API.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/5#issuecomment-360302286, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszBORFY_eHqD6DQEuYQgD4cNmXcsYks5tN7UigaJpZM4PiBOZ .

Jan 24 '18 23:01 mrocklin

RAPIDS already has an implementation of distributed TFIDF: rapidsai/cuml/python/cuml/dask/feature_extraction/text/tfidf_transformer.py#L31.

They explicitly support Dask Arrays: https://github.com/rapidsai/cuml/blob/147f795e03a1b3f3b53aa545385cc5025f86a2f0/python/cuml/dask/feature_extraction/text/tfidf_transformer.py#L140

Sep 24 '20 15:09 stsievert

Beware that cuml typically does relatively GPU-centric things. Their algorithms are rarely generalizable to non-RAPIDS use cases. I would be curious to see what a CPU implementation would look like with more traditional dataframe/array operations.

On Thu, Sep 24, 2020 at 8:45 AM Scott Sievert [email protected] wrote:

RAPIDS already has an implementation of distributed TFIDF: rapidsai/cuml/python/cuml/dask/feature_extraction/text/tfidf_transformer.py#L31 https://github.com/rapidsai/cuml/blob/147f795e03a1b3f3b53aa545385cc5025f86a2f0/python/cuml/dask/feature_extraction/text/tfidf_transformer.py#L31 .

They explicitly support Dask Arrays: https://github.com/rapidsai/cuml/blob/147f795e03a1b3f3b53aa545385cc5025f86a2f0/python/cuml/dask/feature_extraction/text/tfidf_transformer.py#L140

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/5#issuecomment-698427819, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTAZRKY6QJBUTT6DJ6TSHNSQPANCNFSM4D4ICOMQ .

Sep 24 '20 15:09 mrocklin

Sorry to revive an old thread, but was this implemented anywhere?

Jun 02 '22 14:06 cakiki

dask-ml dask-ml copied to clipboard

Distributed TFIDF

dask-ml
dask-ml copied to clipboard