BTM icon indicating copy to clipboard operation
BTM copied to clipboard

compare biterm topic modelling to rainette, LDA, coclustering, structural topic model, embedding clustering, autoencoders

Open jwijffels opened this issue 5 years ago • 12 comments

Looking for some typical open data with short texts which are interesting, in order to compare clustering methods (BTM / LDA / stm / coclustering / reinert text clustering / embedding clustering / autoencoder) @datasculptor / @manuelbickel you know any interesting open data?

jwijffels avatar Jun 26 '19 07:06 jwijffels

I have never used (or even taken a look at) this dataset before, but it maybe interesting: https://registry.opendata.aws/amazon-reviews/

rdatasculptor avatar Jun 26 '19 08:06 rdatasculptor

Interesting and huge dataset, but unfortunately the license of that data is too restrictive.

jwijffels avatar Jun 26 '19 08:06 jwijffels

You are right. How about this list of tweet collections: https://www.docnow.io/catalog/

rdatasculptor avatar Jun 26 '19 09:06 rdatasculptor

Would prefer to use data which can be shared

jwijffels avatar Jun 26 '19 21:06 jwijffels

Sorry for not checking before giving the link.

rdatasculptor avatar Jun 26 '19 21:06 rdatasculptor

I have not worked with short texts. Therefore, I have no good sources at hand, unfortunately. Maybe Japanese Haiku to make Text Mining more philosophical ;-)?

Side Note: sorry for not having worked on the quality metrics yet, too many other non-R-related projects, will keep it on my list, for the time being, text2vec::coherence might be used.

Am 26. Juni 2019 09:46:52 MESZ schrieb jwijffels [email protected]:

Looking for some typical open data with short texts which are interesting, in order to compare clustering methods (BTM / LDA / stm / coclustering / reinert text clustering / embedding clustering / autoencoder) @datasculptor / @manuelbickel you know any interesting open data?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/bnosac/BTM/issues/9

-- sent via mobile - please excuse typos

manuelbickel avatar Jun 27 '19 06:06 manuelbickel

No problem. Japanes Haiku, yes, why not :)

jwijffels avatar Jun 27 '19 07:06 jwijffels

Could this be interesting? https://www.linkedin.com/feed/update/urn:li:activity:6553904839447973888 Not that I am a fan or something :-)

rdatasculptor avatar Jul 08 '19 15:07 rdatasculptor

I'm sure you are a fan :)

jwijffels avatar Jul 22 '19 20:07 jwijffels

Also this one could be interesting: https://github.com/EmilHvitfeldt/textdata

rdatasculptor avatar Jul 23 '19 10:07 rdatasculptor

You could look at manifestos. manifestoR is an API to coded political text in several languages.
https://github.com/ManifestoProject/manifestoR While manifestos are (very) long texts, they are coded here as quasi-sentences, statements that can be sentence level or sub-sentence level. They make up short micro texts of specific topics. While the coding is useful, it is far from perfect. It gives an idea about the number of topics in the text, but are not conclusive, as they can be aggregated to higher categories like issues and domains. I am working with them right now, using BTM.

msaeltzer avatar Jun 11 '20 20:06 msaeltzer

Interesting. Didn't know these political party manifesto's existed.

jwijffels avatar Jun 12 '20 08:06 jwijffels