tidytext Example needed for tidy approach for stm modeling with covariates

In the current tidytext document explaining about the tidy approach to stm object, there is no specific example of how to add covariates.

I wanted to try that out with stm::gadarian data using prevalence = ~treatment + s(pid_rep) covariate formula; however, I have faced some errors.

Would you mind adding one example on how to address this kind of model to the tidytext package document?

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(stm)
#> stm v1.3.5 successfully loaded. See ?stm for help. 
#>  Papers, resources, and other materials at structuraltopicmodel.com
library(tidytext)

glimpse(gadarian)
#> Rows: 341
#> Columns: 4
#> $ MetaID              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
#> $ treatment           <dbl> 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0,...
#> $ pid_rep             <dbl> 1.00000, 1.00000, 0.33300, 0.50000, 0.66667, 0....
#> $ open.ended.response <chr> "problems caused by the influx of illegal immig...

gadarian2 <- gadarian %>%
  mutate(document = row_number())

gadarian_sparse <- gadarian2 %>%
  unnest_tokens(word, open.ended.response) %>%
  anti_join(stop_words) %>%
  count(document, word) %>%
  cast_sparse(document, word, n)
#> Joining, by = "word"

topic_model <- stm(gadarian_sparse,
  K = 3, init.type = "Spectral",
  prevalence = ~ treatment + s(pid_rep),
  data = gadarian2,
  verbose = FALSE
)
#> Error in stm(gadarian_sparse, K = 3, init.type = "Spectral", prevalence = ~treatment + : number of observations in content covariate (335) prevalence covariate (341) and documents (335) are not all equal.

^{Created on 2020-05-03 by the reprex package (v0.3.0)}

Session info

devtools::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 4.0.0 (2020-04-24)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United States.1252  
#>  ctype    English_United States.1252  
#>  tz       America/New_York            
#>  date     2020-05-03                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version    date       lib source                              
#>  assertthat    0.2.1      2019-03-21 [1] CRAN (R 4.0.0)                      
#>  backports     1.1.6      2020-04-05 [1] CRAN (R 4.0.0)                      
#>  callr         3.4.3      2020-03-28 [1] CRAN (R 4.0.0)                      
#>  cli           2.0.2      2020-02-28 [1] CRAN (R 4.0.0)                      
#>  crayon        1.3.4      2017-09-16 [1] CRAN (R 4.0.0)                      
#>  data.table    1.12.8     2019-12-09 [1] CRAN (R 4.0.0)                      
#>  desc          1.2.0      2018-05-01 [1] CRAN (R 4.0.0)                      
#>  devtools      2.2.2.9000 2020-05-01 [1] Github (r-lib/devtools@b166195)     
#>  digest        0.6.25     2020-02-23 [1] CRAN (R 4.0.0)                      
#>  dplyr       * 0.8.5      2020-03-07 [1] CRAN (R 4.0.0)                      
#>  ellipsis      0.3.0      2019-09-20 [1] CRAN (R 4.0.0)                      
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 4.0.0)                      
#>  fansi         0.4.1      2020-01-08 [1] CRAN (R 4.0.0)                      
#>  fs            1.4.1      2020-04-04 [1] CRAN (R 4.0.0)                      
#>  generics      0.0.2      2018-11-29 [1] CRAN (R 4.0.0)                      
#>  glue          1.4.0      2020-04-03 [1] CRAN (R 4.0.0)                      
#>  highr         0.8        2019-03-20 [1] CRAN (R 4.0.0)                      
#>  htmltools     0.4.0      2019-10-04 [1] CRAN (R 4.0.0)                      
#>  janeaustenr   0.1.5      2017-06-10 [1] CRAN (R 4.0.0)                      
#>  knitr         1.28.5     2020-04-28 [1] Github (yihui/knitr@93b46ba)        
#>  lattice       0.20-41    2020-04-02 [1] CRAN (R 4.0.0)                      
#>  lifecycle     0.2.0      2020-03-06 [1] CRAN (R 4.0.0)                      
#>  magrittr      1.5        2014-11-22 [1] CRAN (R 4.0.0)                      
#>  Matrix        1.2-18     2019-11-27 [1] CRAN (R 4.0.0)                      
#>  memoise       1.1.0      2017-04-21 [1] CRAN (R 4.0.0)                      
#>  pillar        1.4.3      2019-12-20 [1] CRAN (R 4.0.0)                      
#>  pkgbuild      1.0.7      2020-04-25 [1] CRAN (R 4.0.0)                      
#>  pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.0.0)                      
#>  pkgload       1.0.2      2018-10-29 [1] CRAN (R 4.0.0)                      
#>  prettyunits   1.1.1      2020-01-24 [1] CRAN (R 4.0.0)                      
#>  processx      3.4.2      2020-02-09 [1] CRAN (R 4.0.0)                      
#>  ps            1.3.2      2020-02-13 [1] CRAN (R 4.0.0)                      
#>  purrr         0.3.4      2020-04-17 [1] CRAN (R 4.0.0)                      
#>  R6            2.4.1      2019-11-12 [1] CRAN (R 4.0.0)                      
#>  Rcpp          1.0.4.6    2020-04-09 [1] CRAN (R 4.0.0)                      
#>  remotes       2.1.1      2020-02-15 [1] CRAN (R 4.0.0)                      
#>  rlang         0.4.6      2020-05-02 [1] CRAN (R 4.0.0)                      
#>  rmarkdown     2.1.3      2020-05-03 [1] Github (rstudio/rmarkdown@d7e1bda)  
#>  rprojroot     1.3-2      2018-01-03 [1] CRAN (R 4.0.0)                      
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 4.0.0)                      
#>  SnowballC     0.7.0      2020-04-01 [1] CRAN (R 4.0.0)                      
#>  stm         * 1.3.5      2020-04-28 [1] Github (bstewart/stm@c95ef0b)       
#>  stringi       1.4.6      2020-02-17 [1] CRAN (R 4.0.0)                      
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 4.0.0)                      
#>  testthat      2.3.2      2020-03-02 [1] CRAN (R 4.0.0)                      
#>  tibble        3.0.1      2020-04-20 [1] CRAN (R 4.0.0)                      
#>  tidyselect    1.0.0      2020-01-27 [1] CRAN (R 4.0.0)                      
#>  tidytext    * 0.2.4      2020-04-28 [1] Github (juliasilge/tidytext@a1c0220)
#>  tokenizers    0.2.1      2018-03-29 [1] CRAN (R 4.0.0)                      
#>  usethis       1.6.1.9000 2020-05-01 [1] Github (r-lib/usethis@4487260)      
#>  utf8          1.1.4      2018-05-24 [1] CRAN (R 4.0.0)                      
#>  vctrs         0.2.4      2020-03-10 [1] CRAN (R 4.0.0)                      
#>  withr         2.2.0      2020-04-20 [1] CRAN (R 4.0.0)                      
#>  xfun          0.13.1     2020-04-30 [1] Github (yihui/xfun@bf8afdd)         
#>  yaml          2.2.1      2020-02-01 [1] CRAN (R 4.0.0)                      
#> 
#> [1] C:/Program Files/R/R-4.0.0/library

May 03 '20 21:05 jooyoungseo

The main problem you are having is that when you remove stop words, you remove some entire documents. Then when you use the data argument in the stm() function for the prevalence and/or content covariates, the number of observations don't line up; there are more observations in gadarian than in gadarian_sparse. You can get this to work if you don't remove stop words:

library(tidytext)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(stm)
#> stm v1.3.5 successfully loaded. See ?stm for help. 
#>  Papers, resources, and other materials at structuraltopicmodel.com

gadarian_sparse <- gadarian %>%
  mutate(document = row_number()) %>%
  unnest_tokens(word, open.ended.response) %>%
  count(document, word) %>%
  cast_sparse(document, word, n)

topic_model <- stm(
  gadarian_sparse,
  K = 3, init.type = "Spectral",
  prevalence = ~ treatment + s(pid_rep),
  data = gadarian,
  verbose = FALSE
)

summary(topic_model)
#> A topic model with 3 topics, 341 documents and a 1512 word dictionary.
#> Topic 1 Top Words:
#>       Highest Prob: the, to, of, people, is, in, country 
#>       FREX: from, come, coming, if, entering, illegally, united 
#>       Lift: afraid, if, mean, unsecured, been, entering, from 
#>       Score: the, to, from, coming, people, come, it 
#> Topic 2 Top Words:
#>       Highest Prob: that, and, a, i, they, not, our 
#>       FREX: that, they, we, have, pay, so, usa 
#>       Lift: asians, east, indians, usa, bums, contibution, goverment 
#>       Score: that, we, they, not, our, have, here 
#> Topic 3 Top Words:
#>       Highest Prob: for, immigrants, illegal, of, and, jobs, our 
#>       FREX: security, social, job, health, mexico, workers, loss 
#>       Lift: caused, ducation, hospitals, lowering, quality, bombings, killing 
#>       Score: illegal, for, security, jobs, immigrants, loss, our

^{Created on 2020-05-04 by the reprex package (v0.3.0)}

Another option is to create a new dataframe for covariates that only contains the observations in gadarian_sparse, if removing stop words is important for your topic model.

I think a good option would be to rewrite / expand the topic modeling vignette to use stm throughout and add a section for document-level covariates. It needs some updating anyway.

May 04 '20 19:05 juliasilge

Thank you very much for your kind explanation, @juliasilge!

On top of your advice, I have got it to work. What do you think about my approach below?

library(tidytext)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(stm)
#> stm v1.3.5 successfully loaded. See ?stm for help. 
#>  Papers, resources, and other materials at structuraltopicmodel.com

gadarian2 <- gadarian %>%
  mutate(document = as.character(row_number()))

gadarian_sparse <- gadarian2 %>%
  unnest_tokens(word, open.ended.response) %>%
  anti_join(stop_words) %>%
  count(document, word) %>%
  cast_sparse(document, word, n)
#> Joining, by = "word"

covariate_df <- tibble(document = rownames(gadarian_sparse)) %>%
  inner_join(gadarian2)
#> Joining, by = "document"

topic_model <- stm(gadarian_sparse,
  K = 3, init.type = "Spectral",
  prevalence = ~ treatment + s(pid_rep),
  data = covariate_df,
  verbose = FALSE
)

summary(topic_model)
#> A topic model with 3 topics, 335 documents and a 1160 word dictionary.
#> Topic 1 Top Words:
#>       Highest Prob: taxes, security, illegals, immigrants, english, language, social 
#>       FREX: 1, law, taxes, terrorists, due, lost, 3 
#>       Lift: extent, fined, fullest, ileagles, on't, sneack, buttons 
#>       Score: 1, assimilate, security, english, law, 3, recieve 
#> Topic 2 Top Words:
#>       Highest Prob: jobs, illegal, immigration, welfare, country, care, americans 
#>       FREX: healthcare, cost, hospitals, strain, welfare, lack, im 
#>       Lift: crowding, hospitals, cheap, draining, allowing, immigrates, sealing 
#>       Score: jobs, im, cost, loss, welfare, capitalist, question 
#> Topic 3 Top Words:
#>       Highest Prob: people, immigrants, illegal, country, immigration, coming, border 
#>       FREX: people, coming, live, illegally, process, means, support 
#>       Lift: live, coming, term, false, process, required, people 
#>       Score: people, coming, process, illegally, stop, businesses, suffering

^{Created on 2020-05-04 by the reprex package (v0.3.0)}

Session info

devtools::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 4.0.0 (2020-04-24)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United States.1252  
#>  ctype    English_United States.1252  
#>  tz       America/New_York            
#>  date     2020-05-04                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version    date       lib source                              
#>  assertthat    0.2.1      2019-03-21 [1] CRAN (R 4.0.0)                      
#>  backports     1.1.6      2020-04-05 [1] CRAN (R 4.0.0)                      
#>  callr         3.4.3      2020-03-28 [1] CRAN (R 4.0.0)                      
#>  cli           2.0.2      2020-02-28 [1] CRAN (R 4.0.0)                      
#>  crayon        1.3.4      2017-09-16 [1] CRAN (R 4.0.0)                      
#>  data.table    1.12.8     2019-12-09 [1] CRAN (R 4.0.0)                      
#>  desc          1.2.0      2018-05-01 [1] CRAN (R 4.0.0)                      
#>  devtools      2.2.2.9000 2020-05-01 [1] Github (r-lib/devtools@b166195)     
#>  digest        0.6.25     2020-02-23 [1] CRAN (R 4.0.0)                      
#>  dplyr       * 0.8.5      2020-03-07 [1] CRAN (R 4.0.0)                      
#>  ellipsis      0.3.0      2019-09-20 [1] CRAN (R 4.0.0)                      
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 4.0.0)                      
#>  fansi         0.4.1      2020-01-08 [1] CRAN (R 4.0.0)                      
#>  fs            1.4.1      2020-04-04 [1] CRAN (R 4.0.0)                      
#>  generics      0.0.2      2018-11-29 [1] CRAN (R 4.0.0)                      
#>  glue          1.4.0      2020-04-03 [1] CRAN (R 4.0.0)                      
#>  highr         0.8        2019-03-20 [1] CRAN (R 4.0.0)                      
#>  htmltools     0.4.0      2019-10-04 [1] CRAN (R 4.0.0)                      
#>  janeaustenr   0.1.5      2017-06-10 [1] CRAN (R 4.0.0)                      
#>  knitr         1.28.5     2020-04-28 [1] Github (yihui/knitr@93b46ba)        
#>  lattice       0.20-41    2020-04-02 [1] CRAN (R 4.0.0)                      
#>  lifecycle     0.2.0      2020-03-06 [1] CRAN (R 4.0.0)                      
#>  magrittr      1.5        2014-11-22 [1] CRAN (R 4.0.0)                      
#>  Matrix        1.2-18     2019-11-27 [1] CRAN (R 4.0.0)                      
#>  matrixStats   0.56.0     2020-03-13 [1] CRAN (R 4.0.0)                      
#>  memoise       1.1.0      2017-04-21 [1] CRAN (R 4.0.0)                      
#>  pillar        1.4.3      2019-12-20 [1] CRAN (R 4.0.0)                      
#>  pkgbuild      1.0.7      2020-04-25 [1] CRAN (R 4.0.0)                      
#>  pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.0.0)                      
#>  pkgload       1.0.2      2018-10-29 [1] CRAN (R 4.0.0)                      
#>  prettyunits   1.1.1      2020-01-24 [1] CRAN (R 4.0.0)                      
#>  processx      3.4.2      2020-02-09 [1] CRAN (R 4.0.0)                      
#>  ps            1.3.2      2020-02-13 [1] CRAN (R 4.0.0)                      
#>  purrr         0.3.4      2020-04-17 [1] CRAN (R 4.0.0)                      
#>  R6            2.4.1      2019-11-12 [1] CRAN (R 4.0.0)                      
#>  Rcpp          1.0.4.6    2020-04-09 [1] CRAN (R 4.0.0)                      
#>  remotes       2.1.1      2020-02-15 [1] CRAN (R 4.0.0)                      
#>  rlang         0.4.6      2020-05-02 [1] CRAN (R 4.0.0)                      
#>  rmarkdown     2.1.3      2020-05-03 [1] Github (rstudio/rmarkdown@d7e1bda)  
#>  rprojroot     1.3-2      2018-01-03 [1] CRAN (R 4.0.0)                      
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 4.0.0)                      
#>  SnowballC     0.7.0      2020-04-01 [1] CRAN (R 4.0.0)                      
#>  stm         * 1.3.5      2020-04-28 [1] Github (bstewart/stm@c95ef0b)       
#>  stringi       1.4.6      2020-02-17 [1] CRAN (R 4.0.0)                      
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 4.0.0)                      
#>  testthat      2.3.2      2020-03-02 [1] CRAN (R 4.0.0)                      
#>  tibble        3.0.1      2020-04-20 [1] CRAN (R 4.0.0)                      
#>  tidyselect    1.0.0      2020-01-27 [1] CRAN (R 4.0.0)                      
#>  tidytext    * 0.2.4      2020-04-28 [1] Github (juliasilge/tidytext@a1c0220)
#>  tokenizers    0.2.1      2018-03-29 [1] CRAN (R 4.0.0)                      
#>  usethis       1.6.1.9000 2020-05-01 [1] Github (r-lib/usethis@4487260)      
#>  vctrs         0.2.4      2020-03-10 [1] CRAN (R 4.0.0)                      
#>  withr         2.2.0      2020-04-20 [1] CRAN (R 4.0.0)                      
#>  xfun          0.13.1     2020-04-30 [1] Github (yihui/xfun@bf8afdd)         
#>  yaml          2.2.1      2020-02-01 [1] CRAN (R 4.0.0)                      
#> 
#> [1] C:/Program Files/R/R-4.0.0/library

May 04 '20 23:05 jooyoungseo

Yep, that is what I would do! 🙌

May 04 '20 23:05 juliasilge

tidytext tidytext copied to clipboard

Example needed for tidy approach for stm modeling with covariates

tidytext
tidytext copied to clipboard