Spacy_parse and spacy_tokenize do not preserve docvars from corpus
Dear spacyr maintainers,
first, let me thank you sincerely for quanteda and spacyr. I have been using the first for months and the second since recently and I really appreciate the care and the excellent work that went into crafting these packages!
I recently encountered the following issue: when trying to add spacyr into my quanteda workflow I found that the functions spacy_parse and spacy_tokenize offer great output, but remove the document variables (docvars) in the process. This is especially unfortunate if these are needed downstream, e.g. for structural topic models. I verified with str() that the docvars are present in the corpus object before processing and that they are not present in the resulting parse and tokens objects. The standard quanteda tokenizer preserves all docvars downstream just fine.
Code
### Parse setup
parse.out <- spacy_parse(quanteda_corpus,
pos = FALSE,
tag = FALSE,
lemma = FALSE,
entity = TRUE, # Activate Named Entity Recognition
dependency = FALSE,
nounphrase = FALSE,
multithread = FALSE)
### Spacy_tokenize setup
tokens.out <- spacy_tokenize(quanteda_corpus,
what = c("word"),
remove_punct = TRUE,
remove_url = TRUE,
remove_numbers = TRUE,
remove_separators = TRUE,
remove_symbols = TRUE,
padding = FALSE,
multithread = FALSE,
output = c("data.frame"))
Session Info
R version 3.6.3 (2020-02-29)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 31 (Workstation Edition)
Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
locale:
[1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
[3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
[5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8
[7] LC_PAPER=en_US.utf8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tidystm_0.0.0.9000 spacyr_1.2.1 qgraph_1.6.5 igraph_1.2.5
[5] dplyr_0.8.5 stringr_1.4.0 lubridate_1.7.4 stm_1.3.5
[9] readtext_0.76 quanteda_2.0.1
loaded via a namespace (and not attached):
[1] httr_1.4.1 jsonlite_1.6.1 splines_3.6.3
[4] gtools_3.8.1 RcppParallel_5.0.1 Formula_1.2-3
[7] assertthat_0.2.1 BDgraph_2.62 stats4_3.6.3
[10] latticeExtra_0.6-29 d3Network_0.5.2.1 pbivnorm_0.6.0
[13] pillar_1.4.4 backports_1.1.7 lattice_0.20-38
[16] reticulate_1.16 glue_1.4.1 digest_0.6.25
[19] RColorBrewer_1.1-2 checkmate_2.0.0 colorspace_1.4-1
[22] htmltools_0.4.0 Matrix_1.2-18 plyr_1.8.6
[25] psych_1.9.12.31 pkgconfig_2.0.3 purrr_0.3.4
[28] corpcor_1.6.9 scales_1.1.1 whisker_0.4
[31] glasso_1.11 jpeg_0.1-8.1 fdrtool_1.2.15
[34] huge_1.3.4.1 htmlTable_1.13.3 tibble_3.0.1
[37] ggplot2_3.3.1 usethis_1.6.1 ellipsis_0.3.1
[40] pbapply_1.4-2 nnet_7.3-12 mnormt_1.5-7
[43] survival_3.1-8 magrittr_1.5 crayon_1.3.4
[46] stopwords_2.0 fs_1.4.1 nlme_3.1-144
[49] MASS_7.3-51.5 foreign_0.8-75 tools_3.6.3
[52] data.table_1.12.8 lifecycle_0.2.0 munsell_0.5.0
[55] cluster_2.1.0 compiler_3.6.3 rlang_0.4.6
[58] grid_3.6.3 rstudioapi_0.11 rjson_0.2.20
[61] htmlwidgets_1.5.1 lavaan_0.6-6 base64enc_0.1-3
[64] gtable_0.3.0 abind_1.4-5 reshape2_1.4.4
[67] R6_2.4.1 gridExtra_2.3 knitr_1.28
[70] fastmatch_1.1-0 Hmisc_4.4-0 stringi_1.4.6
[73] parallel_3.6.3 Rcpp_1.0.4.6 vctrs_0.3.1
[76] rpart_4.1-15 acepack_1.4.1 png_0.1-7
[79] tidyselect_1.0.0 xfun_0.14
You are absolutely right about it. We have not fully integrated spacyr tokenization into the quanteda workflow. A workaround at the moment is:
- convert
spacy_tokenize()outputs toquantedatoken object - assign
docvers()
Like this:
library(spacyr)
library(quanteda)
library(tidyverse)
toks <- data_corpus_dailnoconf1991 %>%
texts() %>%
spacy_tokenize() %>% as.tokens()
docvars(toks) <- docvars(data_corpus_dailnoconf1991)
This is a bit tedious, and we may change it in future. What do you think @kbenoit?