tomotopy icon indicating copy to clipboard operation
tomotopy copied to clipboard

Is there a way to 'weight' docs?

Open batmanscode opened this issue 2 years ago • 4 comments

I'm working with tweets and want to weight them by likes; I couldn't find an obvious way to do this going over the docs.

Is this possible?

batmanscode avatar Feb 25 '22 00:02 batmanscode

Hi @batmanscode, Unfortunately there is no way to weight docs currently. Actually, I have conducted several experiments with different doc weightings before, but they didn't show any improvement compared to the original one. So I dropped the document weighting feature from tomotopy.

However, you can simulate doc weighting similarly by adding the same document multiple times. I recommend you to run the experiment by simulating weighting first. Divide documents into several sections by their number of likes, and try to insert documents a different number of times depending on the section, e.g. documents in the highest section 10 times each and documents in the smallest section once. I think, if this experiment shows some improvements, it is worth to implement document weighting feature.

bab2min avatar Feb 26 '22 16:02 bab2min

Very interesting @bab2min, thanks for sharing about your experiments and thanks for the suggestion!

At the moment I've simply multiplied each tweet by the number of likes and so far this seems to provide better results

There are some considerations however:

  • Weighting is most effective when there is a large range i.e. 0-50k likes
  • Weighting is less effective (similar to not weighting) when the range is smaller i.e. 2.5k-50k likes
  • Most accurate results seem to be when taking specific "bins" of tweets instead of multiplying by likes i.e. 50-100 likes, 1-2k likes etc

Also I noticed that when I both multiply tweets by likes and use min_df=1000, min_cf=10, I get a much better log likelihood. Around -4.5 compared to -6.5; I would've thought that both weighting and using min_df might have been a little redundant

I will reply back here after experimenting more if weighting (or some other method) delivers the better results overall. Thanks

batmanscode avatar Feb 28 '22 06:02 batmanscode

@batmanscode Thank you for sharing your detail experience!! Most of your words sound reasonable. However, there seems to be a pitfall in improving log likelihood by adjusting min_df and min_cf. If you set min_df and min_cf larger, the more uncommon words are excluded. This naturally causes increasing the value of log likelihood. Aside from that, I'll consider implementing doc weighting into tomotopy.

bab2min avatar Mar 02 '22 16:03 bab2min

@batmanscode Thank you for sharing your detail experience!! Most of your words sound reasonable. However, there seems to be a pitfall in improving log likelihood by adjusting min_df and min_cf. If you set min_df and min_cf larger, the more uncommon words are excluded. This naturally causes increasing the value of log likelihood. Aside from that, I'll consider implementing doc weighting into tomotopy.

Right that makes sense! I hadn't considered that, thank you

batmanscode avatar Mar 08 '22 12:03 batmanscode