quanteda.textmodels icon indicating copy to clipboard operation
quanteda.textmodels copied to clipboard

biterm topic model

Open cownr10r opened this issue 6 years ago • 14 comments

Requested feature

I'm working on a corpus of short documents. Recent developments in examining short texts like from twitter, etc, have been documented. I'm including two files.

X Cheng 2014.pdf Schneider 2018.pdf

https://github.com/xiaohuiyan/BTM

Use case

Short term, unstructured essays are when this would be helpful

Additional context

Code is already compiled in python. Can it be put in to R? https://github.com/xiaohuiyan/BTM

cownr10r avatar Jul 19 '18 04:07 cownr10r

@cownr10r Not sure if you meant to close this, but if not, then we would need some detail about what feature you actually want before we could interpret this issue.

kbenoit avatar Jul 19 '18 07:07 kbenoit

Howdy there, kbenoit. I wanted to see if you could pursue converting the code below from python to quanteda: https://github.com/xiaohuiyan/BTM. Im not a calculus reader, so I cannot explain in great detail what is happening in the code, but it appears that the writer used a kind of collocation analysis to assist in performing topic modeling to deal with the issue of sparsity (sp?) in twitter texts and short documents.

cownr10r avatar Jul 19 '18 08:07 cownr10r

FYI. R package BTM for biterm topic modelling is on CRAN and at https://github.com/bnosac/BTM

jwijffels avatar Jan 30 '19 23:01 jwijffels

Ah thanks @jwijffels - then I’ll reopen the issue and we will figure out how to feed a quanteda object to BTM.

kbenoit avatar Jan 30 '19 23:01 kbenoit

I was about to open up an issue but found this one. It would be nice to have an additional method on top of the current ones to convert a quanteda object to BTM.

The "issue" is that BTM() requires a data.frame-like object with two columns: the doc id and the co-occurring terms. I guess the question is: can convert() be defined to add a specific method in this regards?

contefranz avatar Mar 11 '20 13:03 contefranz

Thanks all, I finally investigated this.

How to prepare inputs for BTM::BTM()

I haven't examined the implementation of BTM but did look at the example and interface. It seems to get the co-occurrences from the data.frame of tokens with a doc_id, which means we just need to input this format into BTM::BMT(). There are several ways to do this, with or without quanteda.

  1. Use udpipe to produce the data.frame. This part of the example at https://github.com/bnosac/BTM for instance.

  2. Use spacyr in the same way. Example:

library("spacyr")

data("data_corpus_inaugural", package = "quanteda")

toks_sp <- spacy_parse(tail(data_corpus_inaugural, 20))
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 2.2.3, language model: en_core_web_sm)
## (python options: type = "condaenv", value = "spacy_condaenv")
toks_sp <- subset(toks_sp, pos %in% c("NOUN", "PROPN"))

library("BTM")
model <- BTM(toks_sp[c("doc_id", "lemma")], k = 10, beta = 0.01, iter = 100, trace = 10)
## 2020-03-12 16:28:11 Start Gibbs sampling iteration 1/100
## 2020-03-12 16:28:12 Start Gibbs sampling iteration 11/100
## 2020-03-12 16:28:12 Start Gibbs sampling iteration 21/100
## 2020-03-12 16:28:13 Start Gibbs sampling iteration 31/100
## 2020-03-12 16:28:14 Start Gibbs sampling iteration 41/100
## 2020-03-12 16:28:14 Start Gibbs sampling iteration 51/100
## 2020-03-12 16:28:15 Start Gibbs sampling iteration 61/100
## 2020-03-12 16:28:15 Start Gibbs sampling iteration 71/100
## 2020-03-12 16:28:16 Start Gibbs sampling iteration 81/100
## 2020-03-12 16:28:17 Start Gibbs sampling iteration 91/100
model$theta
##  [1] 0.07992100 0.09716183 0.09102999 0.05673945 0.07218055 0.12848997
##  [7] 0.06949687 0.12128307 0.11703060 0.16666667
topicterms <- terms(model, top_n = 5)
topicterms
## [[1]]
##        token probability
## 1     people  0.02250962
## 2     nation  0.02036826
## 3        man  0.01718113
## 4      earth  0.01543817
## 5 Government  0.01444219
## 
## [[2]]
##     token probability
## 1   world  0.05681964
## 2 America  0.03600914
## 3   peace  0.03551755
## 4  nation  0.03183065
## 5    time  0.02478457
## 
## [[3]]
##     token probability
## 1     God  0.02689068
## 2     man  0.02597248
## 3 America  0.02160008
## 4    life  0.01700906
## 5   heart  0.01486659
## 
## [[4]]
##     token probability
## 1 America  0.02328465
## 2    time  0.02300412
## 3     man  0.02223266
## 4    life  0.02118068
## 5 history  0.02040922
## 
## [[5]]
##     token probability
## 1   child  0.01488745
## 2    life  0.01472204
## 3 America  0.01361931
## 4     job  0.01345390
## 5   world  0.01323335
## 
## [[6]]
##     token probability
## 1  people  0.02909066
## 2   world  0.02627146
## 3  nation  0.02469148
## 4 freedom  0.02400991
## 5   peace  0.01920799
## 
## [[7]]
##       token probability
## 1 President  0.04879014
## 2       Mr.  0.02788851
## 3     today  0.02456715
## 4   citizen  0.02078768
## 5      year  0.01947059
## 
## [[8]]
##     token probability
## 1  nation  0.03577469
## 2 freedom  0.02786494
## 3  people  0.02481264
## 4   world  0.02258084
## 5     man  0.02038187
## 
## [[9]]
##     token probability
## 1   world  0.03976123
## 2  nation  0.03095193
## 3  people  0.03054378
## 4 freedom  0.02326505
## 5   peace  0.02238072
## 
## [[10]]
##        token probability
## 1    America  0.03439507
## 2 government  0.02684732
## 3     people  0.02202249
## 4       time  0.02087599
## 5     nation  0.01946676
  1. Use quanteda to tokenize the text and then convert the tokens into a data.frame. Here's how, including a new function which will probably be added as a convert() method for tokens (noting that without the POS annotations, we cannot select out just nouns as in 1 and 2 above):
library("quanteda")
## Package version: 2.0.0.9000

toks_q <- tokens(tail(data_corpus_inaugural, 20),
  remove_punct = TRUE, remove_numbers = TRUE
) %>%
  tokens_tolower() %>%
  tokens_remove(stopwords("en"))

#' coerce tokens to a data.frame
#' 
#' Converts tokens into a data.frame consisting of “doc\_id” and “tokens”, one row per token.
#' @param x a \[tokens\] object
#' @return A data.frame consisting of “doc\_id” and “tokens”, one row per token.
#' @export
#' @method as.data.frame tokens
#' @examples
#' toks <- tokens(c(d1 = “A b b c”, d2 = “x y z”))
#' as.data.frame(toks)
as.data.frame.tokens <- function(x) {
  data.frame(
    doc_id = rep(names(x), lengths(x)),
    tokens = unlist(x, use.names = FALSE)
  )
}

# will not need the full name (with .tokens) if fn is in the package
model2 <- BTM(as.data.frame.tokens(toks_q), k = 10, beta = 0.01, iter = 100, trace = 10)
## 2020-03-12 16:28:19 Start Gibbs sampling iteration 1/100
## 2020-03-12 16:28:20 Start Gibbs sampling iteration 11/100
## 2020-03-12 16:28:21 Start Gibbs sampling iteration 21/100
## 2020-03-12 16:28:23 Start Gibbs sampling iteration 31/100
## 2020-03-12 16:28:24 Start Gibbs sampling iteration 41/100
## 2020-03-12 16:28:25 Start Gibbs sampling iteration 51/100
## 2020-03-12 16:28:26 Start Gibbs sampling iteration 61/100
## 2020-03-12 16:28:28 Start Gibbs sampling iteration 71/100
## 2020-03-12 16:28:29 Start Gibbs sampling iteration 81/100
## 2020-03-12 16:28:30 Start Gibbs sampling iteration 91/100
model2$theta
##  [1] 0.08564881 0.12982954 0.09124488 0.14058082 0.09187240 0.07941331
##  [7] 0.08349220 0.10432355 0.10445064 0.08914387
topicterms2 <- terms(model2, top_n = 5)
topicterms2
## [[1]]
##    token probability
## 1  world 0.014989804
## 2  peace 0.012626687
## 3    now 0.010819598
## 4   must 0.010263571
## 5 people 0.009753879
## 
## [[2]]
##     token probability
## 1 america 0.013636932
## 2    must 0.011221449
## 3  nation 0.010441767
## 4      us 0.010151298
## 5     can 0.009799677
## 
## [[3]]
##     token probability
## 1      us 0.018051018
## 2     can 0.010091267
## 3  people 0.010004276
## 4     one 0.009634560
## 5 america 0.009373585
## 
## [[4]]
##   token probability
## 1    us  0.03219201
## 2   new  0.01798805
## 3   let  0.01714090
## 4 world  0.01294748
## 5   can  0.01290513
## 
## [[5]]
##    token probability
## 1  world 0.012160742
## 2    new 0.011361560
## 3   know 0.010497579
## 4  every 0.009957592
## 5 people 0.009763196
## 
## [[6]]
##     token probability
## 1 freedom 0.013367450
## 2   world 0.012792785
## 3  people 0.010743981
## 4      us 0.010718995
## 5     new 0.009369783
## 
## [[7]]
##       token probability
## 1 president  0.02234005
## 2    fellow  0.01561434
## 3     today  0.01166923
## 4  citizens  0.01159793
## 5        mr  0.01157416
## 
## [[8]]
##     token probability
## 1   world  0.01928973
## 2 nations  0.01776788
## 3   peace  0.01634114
## 4     can  0.01236528
## 5 freedom  0.01101463
## 
## [[9]]
##    token probability
## 1   must 0.019418278
## 2     us 0.019285278
## 3    can 0.013718252
## 4 nation 0.011818244
## 5    let 0.009956235
## 
## [[10]]
##     token probability
## 1      us 0.020435034
## 2 freedom 0.011219335
## 3 liberty 0.010150848
## 4  nation 0.010106328
## 5 century 0.009905987

Finally - consider fcm()

The Python package at https://github.com/xiaohuiyan/BTM - like many similar implementations of methods - combines the textual data preparation with the estimation procedure. The whole idea of quanteda is that we should separate the preparation of textual data from its analysis. The bi-term cooccurrences could be computed from the feature co-occurrence matrix produced very nicely and very efficiently from fcm(), which can find co-occurrences within fixed windows or within entire documents. I suspect this would be a good and flexible input for the BTM. This is exactly how we feed the gloVe model in text2vec, for instance (tutorial here).

kbenoit avatar Mar 12 '20 05:03 kbenoit

@kbenoit thanks for the as.data.frame function (but don't forget to use stringsAsFactors = FALSE)

Regarding the remark:

The whole idea of quanteda is that we should separate the preparation of textual data from its analysis.

I was thinking about allowing to provide in BTM directly the possibility of giving the biterms (e.g. which are computed e.g. with fcm) but didn't have time yet to do so. It would require to change this https://github.com/bnosac/BTM/blob/master/src/rcpp_BTM.cpp#L27 and this https://github.com/bnosac/BTM/blob/master/src/doc.h#L40 such that a user could feed in the biterms directly.

FYI use cases I had where BTM works better than for example LDA

  1. text from emails where the email subject is used in BTM
  2. short search queries in a search bar or at knowledge portals
  3. short twitter messages
  4. short answers on surveys
  • Generally 1/2 are very short, simple word splitting/tokenisation like implemented in quanteda will work fine and with basic removal of words which are irrelevant you can get far. In this use case allowing to feed in biterms directly by the user is not really necessary.
  • Generally for use case 3/4 I either extract only nouns/proper nouns (twitter / surveys) or only adjectives (surveys) to do the BTM clustering upon (using the lemma's & pos tags from udpipe, but spaCy is equally fine if you are only working in English)
    • for this fcm looks like it can not find adjectives within say a 7 word range as POS tagging is not 'integrated' in fcm.
    • In general I would more use the results of dependency parsing (as provided by udpipe/spacyr) to construct co-occurrences and feed these in as biterms for use cases 3/4.
  • If you have other use cases, feel free to let me know. Regarding input the way of providing biterms, I could provide an argument where you can give a data.frame with structure doc_id, word1, word2 to pass on the list of biterms you computed upfront.
  • Just a note. I generally always set background = TRUE when I call BTM to have a topic which covers general word usage in the text

Feel free to create an issue at https://github.com/bnosac/BTM if you want to provide biterms directly to BTM.

jwijffels avatar Mar 12 '20 08:03 jwijffels

Sounds good, will take you up on that soon! This is a good test of how we should be able to implement keeping auxiliary information on POS for each word type, when tokenization and tagging is performed by an NLP tool such as udpipe or spacyr. I was very pleased that the spacy_parsed output worked out of the box with BTM. I like to think this is an example of a successful outcome of the text software developers workshops! We should have another one...

The way to compute co-occurrences with fcm() would be to tag the tokens, then select them, removing the unwanted types but keeping the placeholder (pad = TRUE - see the text2vec example) and then using fcm() to compute whatever proximities are needed. If I could get my head around the format needed as BTM input I could devise on a converter specifically for that format.

But as stated this is a good test and rationale for ways to integrate pos tags with tokens. In quanteda v2 we made it much easier to feed externally tokenized and tagged texts into quanteda objects (like tokens) and this applies to udpipe output. Here the point about separation also applies - let taggers and segmenters do what they do best, and use that as input to all of the convenient manipulators offered in the middle by quanteda.

I missed the stringsAsFactors, thanks. This is moving to a default of FALSE in upcoming R 4.0, with plans to phase this out entirely in subsequent releases. Good riddance.

kbenoit avatar Mar 12 '20 10:03 kbenoit

I like to think this is an example of a successful outcome of the text software developers workshops! We should have another one...

There are certainly beneficial spillover effects

let taggers and segmenters do what they do best

Completely agree on this as there are many taggers and segmenters already. In particular I wonder what quanteda will come up with to handle subword segmenters like https://github.com/bnosac/sentencepiece or https://github.com/bnosac/tokenizers.bpe in combination with word-based segmentation tools which quanteda is focussing on - such that one will be able to align words with subwords. So that these could be efficiently used alongside BERT like models (e.g. https://github.com/bnosac/golgotha) which provide basically embeddings at different levels of granularity and these can next be easily plugged in into SVM models or NER models or the like. But this probably needs another issue...

jwijffels avatar Mar 12 '20 11:03 jwijffels

This works:

sentencepiece_encode(model, txt, type = "subwords") %>%
  quanteda::as.tokens()
# Tokens consisting of 2 documents.
# text1 :
#  [1] "▁give"   "▁me"     "▁back"   "▁my"     "▁money"  "▁or"     "▁i"      "'"       "ll"      "▁call"   "▁the"   
# [12] "▁police"
# [ ... and 1 more ]
# 
# text2 :
#  [1] "▁talk"    "▁to"      "▁the"     "▁hand"    "▁because" "▁the"     "▁face"    "▁don"     "'"        "t"       
# [11] "▁want"    "▁to"     
# [ ... and 5 more ]

basically any list input can work, either as an input to as.tokens() or to tokens() for all of the remove_* options (updated a lot in v2).

golgotha looks really cool but the example code at https://github.com/bnosac/golgotha crashes my R on macOS.

kbenoit avatar Mar 13 '20 05:03 kbenoit

that's a nice improvement in v2. doesn't surprise me that golgotha crashes R sessions given the heavy models and that it interfaces to transformers, would be nicer if there was a direct C++ wrapper to libtorch i'm testing some changes to BTM for incorporating your own set of biterms and will let you know here when this is uploaded to cran

jwijffels avatar Mar 13 '20 08:03 jwijffels

FYI. Changes have been made to the BTM package and package has been pushed to CRAN. You can now also use precomputed biterms to feed into BTM. Examples on biterms of nouns/adjectives or biterms based on dependency parsing are shown in the package README at https://github.com/bnosac/BTM

jwijffels avatar Mar 15 '20 10:03 jwijffels

FYI. I put functionalities for plotting a BTM object in a separate textplot package available at https://CRAN.R-project.org/package=textplot Vignette at https://cran.r-project.org/web/packages/textplot/vignettes/textplot-examples.pdf Knipsel

jwijffels avatar May 04 '20 12:05 jwijffels

I plot the model and need to overcome sth.:

  1. I set up K=9, however, the plot shows only 8 topics, how to fix this?
  2. How to adjust the size of terms(or tokens) in the graph? Some terms are really too tiny to see.

DQKF avatar Apr 08 '23 17:04 DQKF