tomotopy
tomotopy copied to clipboard
Is there a way to 'weight' docs?
I'm working with tweets and want to weight them by likes; I couldn't find an obvious way to do this going over the docs.
Is this possible?
Hi @batmanscode,
Unfortunately there is no way to weight docs currently.
Actually, I have conducted several experiments with different doc weightings before, but they didn't show any improvement compared to the original one. So I dropped the document weighting feature from tomotopy
.
However, you can simulate doc weighting similarly by adding the same document multiple times. I recommend you to run the experiment by simulating weighting first. Divide documents into several sections by their number of likes, and try to insert documents a different number of times depending on the section, e.g. documents in the highest section 10 times each and documents in the smallest section once. I think, if this experiment shows some improvements, it is worth to implement document weighting feature.
Very interesting @bab2min, thanks for sharing about your experiments and thanks for the suggestion!
At the moment I've simply multiplied each tweet by the number of likes and so far this seems to provide better results
There are some considerations however:
- Weighting is most effective when there is a large range i.e. 0-50k likes
- Weighting is less effective (similar to not weighting) when the range is smaller i.e. 2.5k-50k likes
- Most accurate results seem to be when taking specific "bins" of tweets instead of multiplying by likes i.e. 50-100 likes, 1-2k likes etc
Also I noticed that when I both multiply tweets by likes and use min_df=1000, min_cf=10
, I get a much better log likelihood. Around -4.5 compared to -6.5; I would've thought that both weighting and using min_df
might have been a little redundant
I will reply back here after experimenting more if weighting (or some other method) delivers the better results overall. Thanks
@batmanscode
Thank you for sharing your detail experience!! Most of your words sound reasonable.
However, there seems to be a pitfall in improving log likelihood by adjusting min_df
and min_cf
. If you set min_df
and min_cf
larger, the more uncommon words are excluded. This naturally causes increasing the value of log likelihood.
Aside from that, I'll consider implementing doc weighting into tomotopy.
@batmanscode Thank you for sharing your detail experience!! Most of your words sound reasonable. However, there seems to be a pitfall in improving log likelihood by adjusting
min_df
andmin_cf
. If you setmin_df
andmin_cf
larger, the more uncommon words are excluded. This naturally causes increasing the value of log likelihood. Aside from that, I'll consider implementing doc weighting into tomotopy.
Right that makes sense! I hadn't considered that, thank you