DETM icon indicating copy to clipboard operation
DETM copied to clipboard

can't reproduce the preprocessed data

Open quynhneo opened this issue 4 years ago • 10 comments

Hi there, I ran https://github.com/adjidieng/DETM/blob/master/scripts/data_undebates.py on the kaggle data for un debates (as link in your paper: https://www.kaggle.com/unitednations/un-general-debates) but I am unable to reproduce the preprocessed data you linked here https://bitbucket.org/franrruiz/data_undebates_largev/src/master/ (variables in .mat files are different from yours) . Any idea? There is not much setting beside min_df and max_df. I used the default, perhaps you used something else?

quynhneo avatar Nov 19 '20 00:11 quynhneo

Might be too obvious, but could it just be because of the random permutation with no seed? Apart from that, I've observed a lot of things I had to change in the code to get it to run and to implement the model as described in the paper. I was never able to reproduce the results using the original code.

mona-timmermann avatar Nov 24 '20 12:11 mona-timmermann

hm...possibly. Same here on having to change a lot. Perhaps we should submit some PRs.

quynhneo avatar Nov 24 '20 16:11 quynhneo

Let's work on converting it to a python library @quynhneo @mona-timmermann

What do you think?

Although I notice a new error that occurs on a large dataset

Emekaborisama avatar Jan 05 '21 08:01 Emekaborisama

Not a bad idea ... Ideally we have @adjidieng supports the idea .

quynhneo avatar Jan 05 '21 09:01 quynhneo

I can talk to @adjidieng tomorrow and i will keep you in touch with her response

wyt? @mona-timmermann

Emekaborisama avatar Jan 05 '21 22:01 Emekaborisama

Adji said we can proceed but we will upload the package as a branch on this repo. @quynhneo @mona-timmermann lets get this done

Emekaborisama avatar Jan 06 '21 10:01 Emekaborisama

@Emekaborisama Hi any updates on the python script to reproduce this study? thank you very much.

yangyijane avatar Feb 03 '21 21:02 yangyijane

that's cool. thx.

On Wed, Feb 3, 2021 at 4:47 PM Quynh M. Nguyen [email protected] wrote:

I have made it to work, see my fork https://github.com/quynhneo/DETM

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/adjidieng/DETM/issues/10#issuecomment-772846227, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALAUW4ZROIQ2K2VNOQ5ONMDS5G76XANCNFSM4T2WUOAA .

yangyijane avatar Feb 04 '21 01:02 yangyijane

Hi Mr Nguyen,

I have a follow-up question regarding the script running DETM after you preprocessing all your data. I checked your script and you split the data into training vs testing set.

Why did you do that? I thought it is supposed to be unsupervised learning? Thank you very much.

On Wed, Feb 3, 2021 at 8:58 PM It’s Jenny’s Wonderland [email protected] wrote:

that's cool. thx.

On Wed, Feb 3, 2021 at 4:47 PM Quynh M. Nguyen [email protected] wrote:

I have made it to work, see my fork https://github.com/quynhneo/DETM

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/adjidieng/DETM/issues/10#issuecomment-772846227, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALAUW4ZROIQ2K2VNOQ5ONMDS5G76XANCNFSM4T2WUOAA .

yangyijane avatar Feb 04 '21 02:02 yangyijane

according to the paper, they calculate perplexity using test documents.

quynhneo avatar Feb 10 '21 05:02 quynhneo