river icon indicating copy to clipboard operation
river copied to clipboard

TFIDF clustering pipeline

Open nlyf opened this issue 3 years ago • 13 comments

Versions

river version: 0.7.1 Python version: 3.7 Operating system: ubuntu 20.04

Describe the bug

I'm trying to make online tf-idf clustering using DBSTREAM and DenStream. Can I use those algorithms with TFIDF feature extraction method? Or I'am supposed to build whole feature-matrix table and then perform clustering? But this is not streaming clustering...

Steps/code to reproduce

from river import cluster,compose,feature_extraction
tfidf = compose.Pipeline(("tfidf",feature_extraction.TFIDF()),
                        ("clusterer",cluster.DBSTREAM()))

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
for sentence in corpus:
    tfidf.transform_one(sentence)

Get the following traceback

    201                     + self._gaussian_neighborhood(x, self.micro_clusters[i].center)
    202                     * (x[j] - self.micro_clusters[i].center[j])
--> 203                     for j in self.micro_clusters[i].center.keys()
    204                 }
    205                 self.micro_clusters[i].last_update = self.time_stamp

KeyError: 'first'

And traceback for DenStream respectively

    386             self.initial_weight = 2 ** (self.decaying_factor * self.timestamp)
    387             self.LS = x
--> 388             self.SS = {i: (x[i] * x[i]) for i in range(self.dim)}
    389             self.IWLS = {
    390                 i: (2 ** (self.decaying_factor * self.timestamp)) * x[i]

KeyError: 0

nlyf avatar Aug 06 '21 07:08 nlyf

@hoanganhngo610 could you take a look at this? I've tried with DBStream and DenStream and both don't work. Some initialization stuff is missing somewhere.

MaxHalford avatar Aug 13 '21 07:08 MaxHalford

Hi, I'm still very curious. Have you got any ideas? @MaxHalford @hoanganhngo610

nlyf avatar Aug 30 '21 13:08 nlyf

Sorry @nlyf I closed this by mistake. I would like @hoanganhngo610 to take a look at what's going on.

MaxHalford avatar Aug 30 '21 17:08 MaxHalford

@MaxHalford @nlyf @hoanganhngo610

This is actually something I am very interested at implementing.

I think the issue has something to do with TFIDF outputting data in the format of dictionaries whereas dbstream and denstream expect the input in the form of a 2D array.

I've only just discovered RiverML, so my analysis of this issue might be incorrect if I'm missing something.

ribab avatar Sep 01 '21 19:09 ribab

You could be right... To be honest I haven't supervised this part of the codebase so I'm not too sure. I'll take a look.

MaxHalford avatar Sep 01 '21 21:09 MaxHalford

I was poking through the DenStream code, and I'm a little confused

This method in denstream.py seems to be expecting dict format

    @staticmethod
    def _distance(point_a, point_b):
        return math.sqrt(utils.math.minkowski_distance(point_a, point_b, 2))

as you can see in the minkowski_distance function where a and b both call .keys()

def minkowski_distance(a: dict, b: dict, p: int):
    return sum(
        (abs(a.get(k, 0.0) - b.get(k, 0.0))) ** p for k in set([*a.keys(), *b.keys()])
    )

Whereas the failure point in the code is expecting x to be a list indexable by the integer i

self.SS = {i: (x[i] * x[i]) for i in range(self.dim)}

ribab avatar Sep 01 '21 22:09 ribab

The example for DenStream uses the output from river.stream.iter_array which looks like {0: -1, 1: -0.5} with integers as keys for the dict. With using TFIDF online however, we will find new unknown words as time goes on, so maybe the DenStream code needs to change to take this into account and also enable non-integer keys?

ribab avatar Sep 01 '21 22:09 ribab

This is indeed a bug. As @ribab indicates, the indexing must be changed to account for non-integer keys. I will coordinate with @hoanganhngo610 to fix this.

jacobmontiel avatar Sep 02 '21 00:09 jacobmontiel

Hey @nlyf, can you please confirm if the changes introduced in #711 by @jacobmontiel fix the problems you faced with the clusterers?

smastelini avatar Oct 08 '21 17:10 smastelini

@smastelini @jacobmontiel Thanks for your work. There is no error with DenStream, but DBSTREAM produces the same bug.

nlyf avatar Oct 11 '21 08:10 nlyf

Thanks for reporting, @nlyf We are going to check it :)

smastelini avatar Oct 11 '21 11:10 smastelini

@smastelini @jacobmontiel Thanks for your work. There is no error with DenStream, but DBSTREAM produces the same bug.

@smastelini @jacobmontiel Any news on DBSTREAM?

Dhul-Husni avatar Jan 24 '22 18:01 Dhul-Husni

I am not very familiar with the clustering module. I am pinging @jacobmontiel and @hoanganhngo610. If they are unavailable, I can take a look.

smastelini avatar Jan 27 '22 12:01 smastelini

@hoanganhngo610 can I let you look into this so we can possibly close the issue?

MaxHalford avatar Sep 23 '22 17:09 MaxHalford

@MaxHalford I will have a look at this real soon!

hoanganhngo610 avatar Sep 23 '22 17:09 hoanganhngo610

@MaxHalford I have just tried this again with River version 0.13.0 and it does not return any errors. Would you mind checking it again so that I do not miss anything?

image

hoanganhngo610 avatar Sep 24 '22 21:09 hoanganhngo610

Yep it looks good. Thanks for checking @hoanganhngo610, you rock :)

MaxHalford avatar Sep 24 '22 21:09 MaxHalford