Request for using TopClus on different pretrained language models
Hi,
I've read your paper and I like this approach. Thank you for sharing the code. I've one question regarding the pretrained language models (PLMs) that you use for getting the contextualized word representations. I saw in the source code that the model you use is fixed, and it's the classical 'bert-base-uncased':
https://github.com/yumeng5/TopClus/blob/01e22fb73262bc45d361ec9165bdadbd929ac9a5/src/trainer.py#L22
Suppose I'm interested on using this method on a corpus of italian texts. In that case, would it be possible to change this model and use a bert-base-multilingual-uncased instead?
If that's possible, can we make pretrained_lm a parameter of the TopClusTrainer?
Thank you.
Hi,
Yes, the method should be applicable to different languages. I haven't used the bert-base-multilingual-uncased model and cannot say for sure what changes are needed to make it work on texts in other languages, but I'd imagine some vocabulary post-processing might be necessary (e.g., filtering out all non-italian words from the results; otherwise, the resulting topics might consist of terms from different languages and will be hard to interpret).
Thanks, Yu