BTM
BTM copied to clipboard
compare biterm topic modelling to rainette, LDA, coclustering, structural topic model, embedding clustering, autoencoders
Looking for some typical open data with short texts which are interesting, in order to compare clustering methods (BTM / LDA / stm / coclustering / reinert text clustering / embedding clustering / autoencoder) @datasculptor / @manuelbickel you know any interesting open data?
I have never used (or even taken a look at) this dataset before, but it maybe interesting: https://registry.opendata.aws/amazon-reviews/
Interesting and huge dataset, but unfortunately the license of that data is too restrictive.
You are right. How about this list of tweet collections: https://www.docnow.io/catalog/
Would prefer to use data which can be shared
Sorry for not checking before giving the link.
I have not worked with short texts. Therefore, I have no good sources at hand, unfortunately. Maybe Japanese Haiku to make Text Mining more philosophical ;-)?
Side Note: sorry for not having worked on the quality metrics yet, too many other non-R-related projects, will keep it on my list, for the time being, text2vec::coherence might be used.
Am 26. Juni 2019 09:46:52 MESZ schrieb jwijffels [email protected]:
Looking for some typical open data with short texts which are interesting, in order to compare clustering methods (BTM / LDA / stm / coclustering / reinert text clustering / embedding clustering / autoencoder) @datasculptor / @manuelbickel you know any interesting open data?
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/bnosac/BTM/issues/9
-- sent via mobile - please excuse typos
No problem. Japanes Haiku, yes, why not :)
Could this be interesting? https://www.linkedin.com/feed/update/urn:li:activity:6553904839447973888 Not that I am a fan or something :-)
I'm sure you are a fan :)
Also this one could be interesting: https://github.com/EmilHvitfeldt/textdata
You could look at manifestos. manifestoR is an API to coded political text in several languages.
https://github.com/ManifestoProject/manifestoR
While manifestos are (very) long texts, they are coded here as quasi-sentences, statements that can be sentence level or sub-sentence level. They make up short micro texts of specific topics. While the coding is useful, it is far from perfect. It gives an idea about the number of topics in the text, but are not conclusive, as they can be aggregated to higher categories like issues and domains.
I am working with them right now, using BTM.
Interesting. Didn't know these political party manifesto's existed.