river
river copied to clipboard
Returning 0 clusters regardless of data (even using examples from documentation)
Versions
river version: installed from github Python version: 3.11 Operating system: Mac OS
Describe the bug
With openai embeddings, Denstream is returning 0 clusters regardless of the set parameters.
Steps/code to reproduce
from schema import db, Post, TiktokVideo, Topic, Subtopic, ResponseCampaign, TiktokVideoHashtag, TiktokQuery
import time
from river import cluster, stream
import requests
from qdrant_client import QdrantClient
client = QdrantClient("localhost", port=6333)
initial_sample = 1000
denstream = cluster.DenStream(decaying_factor=0.01,
beta=0.5,
mu=2.5,
epsilon=0.5,
n_samples_init=10)
# dbstream = cluster.DBSTREAM(
# clustering_threshold=5,
# fading_factor=0.05,
# cleanup_interval=40,
# intersection_factor=0.5,
# minimum_weight=5
# )
# clustream = cluster.CluStream(
# n_macro_clusters=30,
# max_micro_clusters=300,
# time_gap= 100,
# time_window = 10000,
# seed = 42
# )
def get_initial_tiktok_videos():
try:
videos = TiktokVideo.select().limit(initial_sample)
ids = [video.id for video in videos]
# print("IDs: ", ids)
tiktok_ids = [video.tiktok_id for video in videos]
# print("Tiktok IDs: ", tiktok_ids)
return ids, tiktok_ids
except TiktokVideo.DoesNotExist:
print("No Tiktok videos found.")
return None, None
def get_stream_videos(limit=100, page=10):
try:
videos = TiktokVideo.select().limit(limit).offset(page*limit)
ids = [video.id for video in videos]
# print("IDs: ", ids)
tiktok_ids = [video.tiktok_id for video in videos]
# print("Tiktok IDs: ", tiktok_ids)
return ids, tiktok_ids
except TiktokVideo.DoesNotExist:
print("No Tiktok videos found.")
return None, None
def fetch_embeddings_from_qdrant(ids):
try:
embeddings = []
for id in ids:
url = f"http://localhost:6333/collections/raqia_test_collection/points/{id}"
response = requests.get(url)
if response.status_code == 200:
embedding = response.json()['result']['vector']
# print(embedding)
embeddings.append(embedding)
return embeddings
except Exception as e:
print(f"Error: {e}")
return None
if __name__ == "__main__":
# get_initial_tiktok_videos()
# get_stream_videos()
# fetch_embeddings_from_qdrant([0,1])
initial_set = get_initial_tiktok_videos()[0]
initial_embeddings = fetch_embeddings_from_qdrant(initial_set)
for x, _ in stream.iter_array(initial_embeddings):
# print(x)
denstream = denstream.learn_one(x)
print(denstream.n_clusters)
print(denstream.clusters)
# dbstream = dbstream.learn_one(x)
# clustream = clustream.learn_one(x)
# print(dbstream.n_clusters)
# print(clustream.centers)
print(denstream.n_clusters)
print(denstream.clusters)
# print(dbstream.n_clusters)
# print(dbstream.clusters)
# print(dbstream.centers)
# print(clustream.centers)
time.sleep(60)
while True:
stream_set = get_stream_videos()[0]
embeddings = fetch_embeddings_from_qdrant(stream_set)
for embedding, _ in stream.iter_array(embeddings):
denstream = denstream.learn_one(embedding)
# dbstream = dbstream.learn_one(embedding)
# clustream = clustream.learn_one(embedding)
# print(denstream.n_clusters)
# print(denstream.clusters)
time.sleep(60*5)
Thank you so much for raising the issue @shiraeisenberg. I will have a look into this as soon as possible.
@shiraeisenberg Would you mind giving me the value of your denstream._n_samples_seen
? Usually, from my personal experience, DenStream
would not return any clusters for approximately 100 first observations. As such, we would usually use 100-1000 first samples as a burn-in, i.e let the model learn without actually requiring any predictions.
@hoanganhngo610 I see this too.
I took the example in the documentation, and re-ran that.
If we do not add something like denstream.predict_one({0: -1, 1: -2})
, then n_clusters is always 0.
Dear @nipunagarwala. Thank you very much for your response.
If you look closely into the source code of DenStream
and the example within the documentation, you can notice that the learn_one
only creates p-micro-clusters and o-micro-clusters, but only when predict_one
is called for the first time, the final clusters will be generated and the number of clusters for the solution will be known.
As such, in the example within the documentation, if you retrieve the number of o_micro_clusters
and p_micro_clusters
by the following commands, the results will be 0 and 2 respectively:
>>> len(denstream.o_micro_clusters
0
>>> len(denstream.p_micro_clusters
2
This philosophy is adopted since we believe that the cluster solutions, or final cluster centers, should only be generated if the command is called since it's extremely computationally expensive; else,only o-micro-clusters and p-micro-clusters will be updated.
Hope that this answer is clear and helpful to you.