tidytext
tidytext copied to clipboard
Example needed for tidy approach for stm modeling with covariates
In the current tidytext document explaining about the tidy approach to stm object, there is no specific example of how to add covariates.
I wanted to try that out with stm::gadarian data using prevalence = ~treatment + s(pid_rep) covariate formula; however, I have faced some errors.
Would you mind adding one example on how to address this kind of model to the tidytext package document?
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(stm)
#> stm v1.3.5 successfully loaded. See ?stm for help.
#> Papers, resources, and other materials at structuraltopicmodel.com
library(tidytext)
glimpse(gadarian)
#> Rows: 341
#> Columns: 4
#> $ MetaID <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
#> $ treatment <dbl> 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0,...
#> $ pid_rep <dbl> 1.00000, 1.00000, 0.33300, 0.50000, 0.66667, 0....
#> $ open.ended.response <chr> "problems caused by the influx of illegal immig...
gadarian2 <- gadarian %>%
mutate(document = row_number())
gadarian_sparse <- gadarian2 %>%
unnest_tokens(word, open.ended.response) %>%
anti_join(stop_words) %>%
count(document, word) %>%
cast_sparse(document, word, n)
#> Joining, by = "word"
topic_model <- stm(gadarian_sparse,
K = 3, init.type = "Spectral",
prevalence = ~ treatment + s(pid_rep),
data = gadarian2,
verbose = FALSE
)
#> Error in stm(gadarian_sparse, K = 3, init.type = "Spectral", prevalence = ~treatment + : number of observations in content covariate (335) prevalence covariate (341) and documents (335) are not all equal.
Created on 2020-05-03 by the reprex package (v0.3.0)
Session info
devtools::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 4.0.0 (2020-04-24)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_United States.1252
#> ctype English_United States.1252
#> tz America/New_York
#> date 2020-05-03
#>
#> - Packages -------------------------------------------------------------------
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.0)
#> backports 1.1.6 2020-04-05 [1] CRAN (R 4.0.0)
#> callr 3.4.3 2020-03-28 [1] CRAN (R 4.0.0)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 4.0.0)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.0)
#> data.table 1.12.8 2019-12-09 [1] CRAN (R 4.0.0)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.0)
#> devtools 2.2.2.9000 2020-05-01 [1] Github (r-lib/devtools@b166195)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 4.0.0)
#> dplyr * 0.8.5 2020-03-07 [1] CRAN (R 4.0.0)
#> ellipsis 0.3.0 2019-09-20 [1] CRAN (R 4.0.0)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.0)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.0)
#> fs 1.4.1 2020-04-04 [1] CRAN (R 4.0.0)
#> generics 0.0.2 2018-11-29 [1] CRAN (R 4.0.0)
#> glue 1.4.0 2020-04-03 [1] CRAN (R 4.0.0)
#> highr 0.8 2019-03-20 [1] CRAN (R 4.0.0)
#> htmltools 0.4.0 2019-10-04 [1] CRAN (R 4.0.0)
#> janeaustenr 0.1.5 2017-06-10 [1] CRAN (R 4.0.0)
#> knitr 1.28.5 2020-04-28 [1] Github (yihui/knitr@93b46ba)
#> lattice 0.20-41 2020-04-02 [1] CRAN (R 4.0.0)
#> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.0)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.0)
#> Matrix 1.2-18 2019-11-27 [1] CRAN (R 4.0.0)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.0)
#> pillar 1.4.3 2019-12-20 [1] CRAN (R 4.0.0)
#> pkgbuild 1.0.7 2020-04-25 [1] CRAN (R 4.0.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0)
#> pkgload 1.0.2 2018-10-29 [1] CRAN (R 4.0.0)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.0)
#> processx 3.4.2 2020-02-09 [1] CRAN (R 4.0.0)
#> ps 1.3.2 2020-02-13 [1] CRAN (R 4.0.0)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.0)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 4.0.0)
#> Rcpp 1.0.4.6 2020-04-09 [1] CRAN (R 4.0.0)
#> remotes 2.1.1 2020-02-15 [1] CRAN (R 4.0.0)
#> rlang 0.4.6 2020-05-02 [1] CRAN (R 4.0.0)
#> rmarkdown 2.1.3 2020-05-03 [1] Github (rstudio/rmarkdown@d7e1bda)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 4.0.0)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.0)
#> SnowballC 0.7.0 2020-04-01 [1] CRAN (R 4.0.0)
#> stm * 1.3.5 2020-04-28 [1] Github (bstewart/stm@c95ef0b)
#> stringi 1.4.6 2020-02-17 [1] CRAN (R 4.0.0)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.0)
#> testthat 2.3.2 2020-03-02 [1] CRAN (R 4.0.0)
#> tibble 3.0.1 2020-04-20 [1] CRAN (R 4.0.0)
#> tidyselect 1.0.0 2020-01-27 [1] CRAN (R 4.0.0)
#> tidytext * 0.2.4 2020-04-28 [1] Github (juliasilge/tidytext@a1c0220)
#> tokenizers 0.2.1 2018-03-29 [1] CRAN (R 4.0.0)
#> usethis 1.6.1.9000 2020-05-01 [1] Github (r-lib/usethis@4487260)
#> utf8 1.1.4 2018-05-24 [1] CRAN (R 4.0.0)
#> vctrs 0.2.4 2020-03-10 [1] CRAN (R 4.0.0)
#> withr 2.2.0 2020-04-20 [1] CRAN (R 4.0.0)
#> xfun 0.13.1 2020-04-30 [1] Github (yihui/xfun@bf8afdd)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0)
#>
#> [1] C:/Program Files/R/R-4.0.0/library
The main problem you are having is that when you remove stop words, you remove some entire documents. Then when you use the data argument in the stm() function for the prevalence and/or content covariates, the number of observations don't line up; there are more observations in gadarian than in gadarian_sparse. You can get this to work if you don't remove stop words:
library(tidytext)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(stm)
#> stm v1.3.5 successfully loaded. See ?stm for help.
#> Papers, resources, and other materials at structuraltopicmodel.com
gadarian_sparse <- gadarian %>%
mutate(document = row_number()) %>%
unnest_tokens(word, open.ended.response) %>%
count(document, word) %>%
cast_sparse(document, word, n)
topic_model <- stm(
gadarian_sparse,
K = 3, init.type = "Spectral",
prevalence = ~ treatment + s(pid_rep),
data = gadarian,
verbose = FALSE
)
summary(topic_model)
#> A topic model with 3 topics, 341 documents and a 1512 word dictionary.
#> Topic 1 Top Words:
#> Highest Prob: the, to, of, people, is, in, country
#> FREX: from, come, coming, if, entering, illegally, united
#> Lift: afraid, if, mean, unsecured, been, entering, from
#> Score: the, to, from, coming, people, come, it
#> Topic 2 Top Words:
#> Highest Prob: that, and, a, i, they, not, our
#> FREX: that, they, we, have, pay, so, usa
#> Lift: asians, east, indians, usa, bums, contibution, goverment
#> Score: that, we, they, not, our, have, here
#> Topic 3 Top Words:
#> Highest Prob: for, immigrants, illegal, of, and, jobs, our
#> FREX: security, social, job, health, mexico, workers, loss
#> Lift: caused, ducation, hospitals, lowering, quality, bombings, killing
#> Score: illegal, for, security, jobs, immigrants, loss, our
Created on 2020-05-04 by the reprex package (v0.3.0)
Another option is to create a new dataframe for covariates that only contains the observations in gadarian_sparse, if removing stop words is important for your topic model.
I think a good option would be to rewrite / expand the topic modeling vignette to use stm throughout and add a section for document-level covariates. It needs some updating anyway.
Thank you very much for your kind explanation, @juliasilge!
On top of your advice, I have got it to work. What do you think about my approach below?
library(tidytext)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(stm)
#> stm v1.3.5 successfully loaded. See ?stm for help.
#> Papers, resources, and other materials at structuraltopicmodel.com
gadarian2 <- gadarian %>%
mutate(document = as.character(row_number()))
gadarian_sparse <- gadarian2 %>%
unnest_tokens(word, open.ended.response) %>%
anti_join(stop_words) %>%
count(document, word) %>%
cast_sparse(document, word, n)
#> Joining, by = "word"
covariate_df <- tibble(document = rownames(gadarian_sparse)) %>%
inner_join(gadarian2)
#> Joining, by = "document"
topic_model <- stm(gadarian_sparse,
K = 3, init.type = "Spectral",
prevalence = ~ treatment + s(pid_rep),
data = covariate_df,
verbose = FALSE
)
summary(topic_model)
#> A topic model with 3 topics, 335 documents and a 1160 word dictionary.
#> Topic 1 Top Words:
#> Highest Prob: taxes, security, illegals, immigrants, english, language, social
#> FREX: 1, law, taxes, terrorists, due, lost, 3
#> Lift: extent, fined, fullest, ileagles, on't, sneack, buttons
#> Score: 1, assimilate, security, english, law, 3, recieve
#> Topic 2 Top Words:
#> Highest Prob: jobs, illegal, immigration, welfare, country, care, americans
#> FREX: healthcare, cost, hospitals, strain, welfare, lack, im
#> Lift: crowding, hospitals, cheap, draining, allowing, immigrates, sealing
#> Score: jobs, im, cost, loss, welfare, capitalist, question
#> Topic 3 Top Words:
#> Highest Prob: people, immigrants, illegal, country, immigration, coming, border
#> FREX: people, coming, live, illegally, process, means, support
#> Lift: live, coming, term, false, process, required, people
#> Score: people, coming, process, illegally, stop, businesses, suffering
Created on 2020-05-04 by the reprex package (v0.3.0)
Session info
devtools::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 4.0.0 (2020-04-24)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_United States.1252
#> ctype English_United States.1252
#> tz America/New_York
#> date 2020-05-04
#>
#> - Packages -------------------------------------------------------------------
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.0)
#> backports 1.1.6 2020-04-05 [1] CRAN (R 4.0.0)
#> callr 3.4.3 2020-03-28 [1] CRAN (R 4.0.0)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 4.0.0)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.0)
#> data.table 1.12.8 2019-12-09 [1] CRAN (R 4.0.0)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.0)
#> devtools 2.2.2.9000 2020-05-01 [1] Github (r-lib/devtools@b166195)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 4.0.0)
#> dplyr * 0.8.5 2020-03-07 [1] CRAN (R 4.0.0)
#> ellipsis 0.3.0 2019-09-20 [1] CRAN (R 4.0.0)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.0)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.0)
#> fs 1.4.1 2020-04-04 [1] CRAN (R 4.0.0)
#> generics 0.0.2 2018-11-29 [1] CRAN (R 4.0.0)
#> glue 1.4.0 2020-04-03 [1] CRAN (R 4.0.0)
#> highr 0.8 2019-03-20 [1] CRAN (R 4.0.0)
#> htmltools 0.4.0 2019-10-04 [1] CRAN (R 4.0.0)
#> janeaustenr 0.1.5 2017-06-10 [1] CRAN (R 4.0.0)
#> knitr 1.28.5 2020-04-28 [1] Github (yihui/knitr@93b46ba)
#> lattice 0.20-41 2020-04-02 [1] CRAN (R 4.0.0)
#> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.0)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.0)
#> Matrix 1.2-18 2019-11-27 [1] CRAN (R 4.0.0)
#> matrixStats 0.56.0 2020-03-13 [1] CRAN (R 4.0.0)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.0)
#> pillar 1.4.3 2019-12-20 [1] CRAN (R 4.0.0)
#> pkgbuild 1.0.7 2020-04-25 [1] CRAN (R 4.0.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0)
#> pkgload 1.0.2 2018-10-29 [1] CRAN (R 4.0.0)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.0)
#> processx 3.4.2 2020-02-09 [1] CRAN (R 4.0.0)
#> ps 1.3.2 2020-02-13 [1] CRAN (R 4.0.0)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.0)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 4.0.0)
#> Rcpp 1.0.4.6 2020-04-09 [1] CRAN (R 4.0.0)
#> remotes 2.1.1 2020-02-15 [1] CRAN (R 4.0.0)
#> rlang 0.4.6 2020-05-02 [1] CRAN (R 4.0.0)
#> rmarkdown 2.1.3 2020-05-03 [1] Github (rstudio/rmarkdown@d7e1bda)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 4.0.0)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.0)
#> SnowballC 0.7.0 2020-04-01 [1] CRAN (R 4.0.0)
#> stm * 1.3.5 2020-04-28 [1] Github (bstewart/stm@c95ef0b)
#> stringi 1.4.6 2020-02-17 [1] CRAN (R 4.0.0)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.0)
#> testthat 2.3.2 2020-03-02 [1] CRAN (R 4.0.0)
#> tibble 3.0.1 2020-04-20 [1] CRAN (R 4.0.0)
#> tidyselect 1.0.0 2020-01-27 [1] CRAN (R 4.0.0)
#> tidytext * 0.2.4 2020-04-28 [1] Github (juliasilge/tidytext@a1c0220)
#> tokenizers 0.2.1 2018-03-29 [1] CRAN (R 4.0.0)
#> usethis 1.6.1.9000 2020-05-01 [1] Github (r-lib/usethis@4487260)
#> vctrs 0.2.4 2020-03-10 [1] CRAN (R 4.0.0)
#> withr 2.2.0 2020-04-20 [1] CRAN (R 4.0.0)
#> xfun 0.13.1 2020-04-30 [1] Github (yihui/xfun@bf8afdd)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0)
#>
#> [1] C:/Program Files/R/R-4.0.0/library
Yep, that is what I would do! 🙌