stc_clustering icon indicating copy to clipboard operation
stc_clustering copied to clipboard

Problems with reproduction of results

Open gabrielsantosrv opened this issue 4 years ago • 5 comments

Hi @hadifar !

First of all, thanks for sharing your code, it's a good work!

I can't reproduce the results for Biomedical dataset, published in the paper "A Self-Training Approach for Short Text Clustering", which are ACC: 54.8±2.3 and NMI: 47.1±0.8. I'm obtaining results close to ACC: 24.12 and NMI 17.98.

I have gotten results close to those reported for the dataset Stackoverflow, but don't for Biomedical dataset.

Could you please help me?

gabrielsantosrv avatar Mar 23 '20 19:03 gabrielsantosrv

Hi @gabrielsantosrv, Thanks for your interest. Did you do the pretraining step for the biomedical dataset? I guess, you probably used the StackOverflow pre-trained model for biomedical experiments. Sif-embedding alone can roughly get ACC and NMI of 0.38 and 0.33 on biomedical dataset. I'll try to add the biomedical model in the repository soon.

hadifar avatar Mar 23 '20 21:03 hadifar

Thanks for your fast reply!

How do I pretrain for the biomedical dataset? Is enough just setting the parameter --ae_weights to an inexisting path? Because it isn't working.

gabrielsantosrv avatar Mar 24 '20 13:03 gabrielsantosrv

Hello, @hadifar !

Thank you for your work, it was very helpful for me.

I have the same problem with Biomedical. I've downloaded Biomedical word2vec's from link in your paper (https://github.com/jacoxu/STC2/tree/master/dataset), then I run clustering with

python STC.py --dataset=biomedical --maxiter 1500 --save_dir data/Biomedical/results

So the autoencoder was pretrained and achieved event better results than in your paper for non-trained clustering: acc = 0.40439, nmi = 0.34360.

However, in the next training steps metrics steadily decline, so in the end of training it becomes: acc: 0.35644444444444445 nmi 0.30346925987039597

Results on StackOverflow and SearchSnippets were reproduced successfully.

Please, tell me, what I do wrong.

Lesha17 avatar Jun 05 '20 09:06 Lesha17

Hello, @hadifar !

Thank you for your work, it was very helpful for me.

I have the same problem with Biomedical. I've downloaded Biomedical word2vec's from link in your paper (https://github.com/jacoxu/STC2/tree/master/dataset), then I run clustering with

python STC.py --dataset=biomedical --maxiter 1500 --save_dir data/Biomedical/results

So the autoencoder was pretrained and achieved event better results than in your paper for non-trained clustering: acc = 0.40439, nmi = 0.34360.

However, in the next training steps metrics steadily decline, so in the end of training it becomes: acc: 0.35644444444444445 nmi 0.30346925987039597

Results on StackOverflow and SearchSnippets were reproduced successfully.

Please, tell me, what I do wrong.

hello, I have the same problem. Could you solve it now? if possible, please giving me some advises. Thanks.

geangelfirst avatar Oct 12 '20 00:10 geangelfirst

Hello, @hadifar !

Thank you for your work, it was very helpful for me.

I have the same problem with Biomedical. I've downloaded Biomedical word2vec's from link in your paper (https://github.com/jacoxu/STC2/tree/master/dataset), then I run clustering with

python STC.py --dataset=biomedical --maxiter 1500 --save_dir data/Biomedical/results

So the autoencoder was pretrained and achieved event better results than in your paper for non-trained clustering: acc = 0.40439, nmi = 0.34360.

However, in the next training steps metrics steadily decline, so in the end of training it becomes: acc: 0.35644444444444445 nmi 0.30346925987039597

Results on StackOverflow and SearchSnippets were reproduced successfully.

Please, tell me, what I do wrong.

Hi @hadifar and @Lesha17, Thanks for great code. I am trying to reproduce the code in google colab. I have done all the things but I am getting error on this line: python STC.py --maxiter 1500 --ae_weights data/stackoverflow/results/ae_weights.h5 --save_dir data/stackoverflow/results/

Interpreter points on .h5 and says invalid syntax. Can anybody please help me what is the thing that I am doing wrong. And please note that I have copy above line of code from README.md.

ans92 avatar Jan 17 '21 07:01 ans92