river
river copied to clipboard
TFIDF clustering pipeline
Versions
river version: 0.7.1 Python version: 3.7 Operating system: ubuntu 20.04
Describe the bug
I'm trying to make online tf-idf clustering using DBSTREAM and DenStream. Can I use those algorithms with TFIDF feature extraction method? Or I'am supposed to build whole feature-matrix table and then perform clustering? But this is not streaming clustering...
Steps/code to reproduce
from river import cluster,compose,feature_extraction
tfidf = compose.Pipeline(("tfidf",feature_extraction.TFIDF()),
("clusterer",cluster.DBSTREAM()))
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
for sentence in corpus:
tfidf.transform_one(sentence)
Get the following traceback
201 + self._gaussian_neighborhood(x, self.micro_clusters[i].center)
202 * (x[j] - self.micro_clusters[i].center[j])
--> 203 for j in self.micro_clusters[i].center.keys()
204 }
205 self.micro_clusters[i].last_update = self.time_stamp
KeyError: 'first'
And traceback for DenStream respectively
386 self.initial_weight = 2 ** (self.decaying_factor * self.timestamp)
387 self.LS = x
--> 388 self.SS = {i: (x[i] * x[i]) for i in range(self.dim)}
389 self.IWLS = {
390 i: (2 ** (self.decaying_factor * self.timestamp)) * x[i]
KeyError: 0
@hoanganhngo610 could you take a look at this? I've tried with DBStream
and DenStream
and both don't work. Some initialization stuff is missing somewhere.
Hi, I'm still very curious. Have you got any ideas? @MaxHalford @hoanganhngo610
Sorry @nlyf I closed this by mistake. I would like @hoanganhngo610 to take a look at what's going on.
@MaxHalford @nlyf @hoanganhngo610
This is actually something I am very interested at implementing.
I think the issue has something to do with TFIDF outputting data in the format of dictionaries whereas dbstream and denstream expect the input in the form of a 2D array.
I've only just discovered RiverML, so my analysis of this issue might be incorrect if I'm missing something.
You could be right... To be honest I haven't supervised this part of the codebase so I'm not too sure. I'll take a look.
I was poking through the DenStream code, and I'm a little confused
This method in denstream.py seems to be expecting dict format
@staticmethod
def _distance(point_a, point_b):
return math.sqrt(utils.math.minkowski_distance(point_a, point_b, 2))
as you can see in the minkowski_distance
function where a
and b
both call .keys()
def minkowski_distance(a: dict, b: dict, p: int):
return sum(
(abs(a.get(k, 0.0) - b.get(k, 0.0))) ** p for k in set([*a.keys(), *b.keys()])
)
Whereas the failure point in the code is expecting x
to be a list indexable by the integer i
self.SS = {i: (x[i] * x[i]) for i in range(self.dim)}
The example for DenStream
uses the output from river.stream.iter_array
which looks like {0: -1, 1: -0.5}
with integers as keys for the dict. With using TFIDF online however, we will find new unknown words as time goes on, so maybe the DenStream
code needs to change to take this into account and also enable non-integer keys?
This is indeed a bug. As @ribab indicates, the indexing must be changed to account for non-integer keys. I will coordinate with @hoanganhngo610 to fix this.
Hey @nlyf, can you please confirm if the changes introduced in #711 by @jacobmontiel fix the problems you faced with the clusterers?
@smastelini @jacobmontiel Thanks for your work. There is no error with DenStream, but DBSTREAM produces the same bug.
Thanks for reporting, @nlyf We are going to check it :)
@smastelini @jacobmontiel Thanks for your work. There is no error with DenStream, but DBSTREAM produces the same bug.
@smastelini @jacobmontiel Any news on DBSTREAM?
I am not very familiar with the clustering module. I am pinging @jacobmontiel and @hoanganhngo610. If they are unavailable, I can take a look.
@hoanganhngo610 can I let you look into this so we can possibly close the issue?
@MaxHalford I will have a look at this real soon!
@MaxHalford I have just tried this again with River version 0.13.0 and it does not return any errors. Would you mind checking it again so that I do not miss anything?
Yep it looks good. Thanks for checking @hoanganhngo610, you rock :)