stc_clustering
stc_clustering copied to clipboard
Problems with reproduction of results
Hi @hadifar !
First of all, thanks for sharing your code, it's a good work!
I can't reproduce the results for Biomedical dataset, published in the paper "A Self-Training Approach for Short Text Clustering", which are ACC: 54.8±2.3 and NMI: 47.1±0.8. I'm obtaining results close to ACC: 24.12 and NMI 17.98.
I have gotten results close to those reported for the dataset Stackoverflow, but don't for Biomedical dataset.
Could you please help me?
Hi @gabrielsantosrv, Thanks for your interest. Did you do the pretraining step for the biomedical dataset? I guess, you probably used the StackOverflow pre-trained model for biomedical experiments. Sif-embedding alone can roughly get ACC and NMI of 0.38 and 0.33 on biomedical dataset. I'll try to add the biomedical model in the repository soon.
Thanks for your fast reply!
How do I pretrain for the biomedical dataset? Is enough just setting the parameter --ae_weights to an inexisting path? Because it isn't working.
Hello, @hadifar !
Thank you for your work, it was very helpful for me.
I have the same problem with Biomedical. I've downloaded Biomedical word2vec's from link in your paper (https://github.com/jacoxu/STC2/tree/master/dataset), then I run clustering with
python STC.py --dataset=biomedical --maxiter 1500 --save_dir data/Biomedical/results
So the autoencoder was pretrained and achieved event better results than in your paper for non-trained clustering: acc = 0.40439, nmi = 0.34360.
However, in the next training steps metrics steadily decline, so in the end of training it becomes: acc: 0.35644444444444445 nmi 0.30346925987039597
Results on StackOverflow and SearchSnippets were reproduced successfully.
Please, tell me, what I do wrong.
Hello, @hadifar !
Thank you for your work, it was very helpful for me.
I have the same problem with Biomedical. I've downloaded Biomedical word2vec's from link in your paper (https://github.com/jacoxu/STC2/tree/master/dataset), then I run clustering with
python STC.py --dataset=biomedical --maxiter 1500 --save_dir data/Biomedical/results
So the autoencoder was pretrained and achieved event better results than in your paper for non-trained clustering: acc = 0.40439, nmi = 0.34360.
However, in the next training steps metrics steadily decline, so in the end of training it becomes: acc: 0.35644444444444445 nmi 0.30346925987039597
Results on StackOverflow and SearchSnippets were reproduced successfully.
Please, tell me, what I do wrong.
hello, I have the same problem. Could you solve it now? if possible, please giving me some advises. Thanks.
Hello, @hadifar !
Thank you for your work, it was very helpful for me.
I have the same problem with Biomedical. I've downloaded Biomedical word2vec's from link in your paper (https://github.com/jacoxu/STC2/tree/master/dataset), then I run clustering with
python STC.py --dataset=biomedical --maxiter 1500 --save_dir data/Biomedical/results
So the autoencoder was pretrained and achieved event better results than in your paper for non-trained clustering: acc = 0.40439, nmi = 0.34360.
However, in the next training steps metrics steadily decline, so in the end of training it becomes: acc: 0.35644444444444445 nmi 0.30346925987039597
Results on StackOverflow and SearchSnippets were reproduced successfully.
Please, tell me, what I do wrong.
Hi @hadifar and @Lesha17,
Thanks for great code.
I am trying to reproduce the code in google colab. I have done all the things but I am getting error on this line:
python STC.py --maxiter 1500 --ae_weights data/stackoverflow/results/ae_weights.h5 --save_dir data/stackoverflow/results/
Interpreter points on .h5 and says invalid syntax. Can anybody please help me what is the thing that I am doing wrong. And please note that I have copy above line of code from README.md.