the-algorithm icon indicating copy to clipboard operation
the-algorithm copied to clipboard

Use K-means clustering instead of what you have currently

Open ecoates-bc opened this issue 1 year ago • 6 comments

So I've had a look around this repo and it looks like there are a LOT of files in it.

That's not necessarily a bad thing! But maybe if you trimmed your recommender system down a little, it would help your servers' overhead, and that way save you folks a lot of money.

So, I think you should just use k-means clustering instead of whatever spaghetti you have in here currently.

Why K-means clustering?

There are a lot of reasons why you'd use k-means clustering in this case.

  1. It's unsupervised. Don't have to worry about feature engineering when there are no features!
  2. It's fast. There's a lot to be said about a mean-and-lean approach to data science, especially on the web.
  3. You can choose your own k. That gives you a lot of flexibility in training the best algorithm for your users' needs.

Conclusion

I think you should use K-means clustering for your project "The Algorithm." Let me know how it goes!

ecoates-bc avatar Apr 04 '23 03:04 ecoates-bc

It's possible but what if what they got is better than vanilla K-means in terms of the recommendation outcome? I like the idea of choosing your own recommendation algorithm/outcome.

dclipca avatar Apr 04 '23 03:04 dclipca

Adopting pure, vanilla K-means clusterning means elon musk's tweets are not prioritized, so I don't think they can do it.

amicus-veritatis avatar Apr 04 '23 04:04 amicus-veritatis

Adopting pure, vanilla K-means clusterning means elon musk's tweets are not prioritized, so I don't think they can do it.

That's a good point! Maybe he could get his own little cluster

ecoates-bc avatar Apr 04 '23 05:04 ecoates-bc

It's possible but what if what they got is better than vanilla K-means in terms of the recommendation outcome? I like the idea of choosing your own recommendation algorithm/outcome.

hmmm that's very true, you would not get away with an out-of-the-box, everyday K-means implementation and retain the same degree of being able to tune it on-the-fly per user. I'm sure there could be an implementation that preserved this, maybe client-side post-processing of clusters. But yea either way definitely would not want to sacrifice tuning in order to achieve an implementation.

stealthpaladin avatar Apr 04 '23 18:04 stealthpaladin

As it is right now, there is no point talking about K-means or similar. That problem comes later.

A cook can only do so much with said ingredients not matter the process the cook choses.

A like is an input. A like + others who liked the same is another. The time the like happened is another. The difference of the time of the like for the same thing of various users matters. The time of a user spending on said thing that was liked matters. This is probably silly but it goes deeper. The main problem atm is Data. There are tons of possibilities to get Data even with the current architecture. Once Data value is satisfying, then Algorithmic process will be more exciting ( more difficult than it is ). Not even the current version of transformers will be able to give accurate outputs ( in relation with user's well being and market wanting to invest with advertisements).

Cap-ten avatar Apr 04 '23 18:04 Cap-ten

I agree with your sentiments expressed here. But, K-means presents major drawbacks:

  • Strong sensitivity to outliers like Republicans.
  • Computationally expensive for large datasets like Twitter data as k becomes large.
  • Doesn’t guarantee to converge to a global minimum. It is sensitive to the centroids’ initialization(e.g. Different setups may lead to different results).
  • Works on numerical data only, and doesn't support categorical data.
  • Fails to give good quality clustering for a group of points that have non-convex shapes.

wiseaidev avatar Apr 05 '23 10:04 wiseaidev