spacyr icon indicating copy to clipboard operation
spacyr copied to clipboard

Spacy_parse and spacy_tokenize do not preserve docvars from corpus

Open Mstes opened this issue 5 years ago • 1 comments

Dear spacyr maintainers,

first, let me thank you sincerely for quanteda and spacyr. I have been using the first for months and the second since recently and I really appreciate the care and the excellent work that went into crafting these packages!

I recently encountered the following issue: when trying to add spacyr into my quanteda workflow I found that the functions spacy_parse and spacy_tokenize offer great output, but remove the document variables (docvars) in the process. This is especially unfortunate if these are needed downstream, e.g. for structural topic models. I verified with str() that the docvars are present in the corpus object before processing and that they are not present in the resulting parse and tokens objects. The standard quanteda tokenizer preserves all docvars downstream just fine.

Code

### Parse setup

parse.out <- spacy_parse(quanteda_corpus,
                                      pos = FALSE,
                                      tag = FALSE,
                                      lemma = FALSE,
                                      entity = TRUE, # Activate Named Entity Recognition
                                      dependency = FALSE,
                                      nounphrase = FALSE,
                                      multithread = FALSE)


### Spacy_tokenize setup

tokens.out <- spacy_tokenize(quanteda_corpus,
                                              what = c("word"),
                                              remove_punct = TRUE,
                                              remove_url = TRUE,
                                              remove_numbers = TRUE,
                                              remove_separators = TRUE,
                                              remove_symbols = TRUE,
                                              padding = FALSE,
                                              multithread = FALSE,
                                              output = c("data.frame"))


Session Info

R version 3.6.3 (2020-02-29)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 31 (Workstation Edition)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C             
 [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8    
 [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=en_US.utf8   
 [7] LC_PAPER=en_US.utf8       LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] tidystm_0.0.0.9000 spacyr_1.2.1       qgraph_1.6.5       igraph_1.2.5      
 [5] dplyr_0.8.5        stringr_1.4.0      lubridate_1.7.4    stm_1.3.5         
 [9] readtext_0.76      quanteda_2.0.1    

loaded via a namespace (and not attached):
 [1] httr_1.4.1          jsonlite_1.6.1      splines_3.6.3      
 [4] gtools_3.8.1        RcppParallel_5.0.1  Formula_1.2-3      
 [7] assertthat_0.2.1    BDgraph_2.62        stats4_3.6.3       
[10] latticeExtra_0.6-29 d3Network_0.5.2.1   pbivnorm_0.6.0     
[13] pillar_1.4.4        backports_1.1.7     lattice_0.20-38    
[16] reticulate_1.16     glue_1.4.1          digest_0.6.25      
[19] RColorBrewer_1.1-2  checkmate_2.0.0     colorspace_1.4-1   
[22] htmltools_0.4.0     Matrix_1.2-18       plyr_1.8.6         
[25] psych_1.9.12.31     pkgconfig_2.0.3     purrr_0.3.4        
[28] corpcor_1.6.9       scales_1.1.1        whisker_0.4        
[31] glasso_1.11         jpeg_0.1-8.1        fdrtool_1.2.15     
[34] huge_1.3.4.1        htmlTable_1.13.3    tibble_3.0.1       
[37] ggplot2_3.3.1       usethis_1.6.1       ellipsis_0.3.1     
[40] pbapply_1.4-2       nnet_7.3-12         mnormt_1.5-7       
[43] survival_3.1-8      magrittr_1.5        crayon_1.3.4       
[46] stopwords_2.0       fs_1.4.1            nlme_3.1-144       
[49] MASS_7.3-51.5       foreign_0.8-75      tools_3.6.3        
[52] data.table_1.12.8   lifecycle_0.2.0     munsell_0.5.0      
[55] cluster_2.1.0       compiler_3.6.3      rlang_0.4.6        
[58] grid_3.6.3          rstudioapi_0.11     rjson_0.2.20       
[61] htmlwidgets_1.5.1   lavaan_0.6-6        base64enc_0.1-3    
[64] gtable_0.3.0        abind_1.4-5         reshape2_1.4.4     
[67] R6_2.4.1            gridExtra_2.3       knitr_1.28         
[70] fastmatch_1.1-0     Hmisc_4.4-0         stringi_1.4.6      
[73] parallel_3.6.3      Rcpp_1.0.4.6        vctrs_0.3.1        
[76] rpart_4.1-15        acepack_1.4.1       png_0.1-7          
[79] tidyselect_1.0.0    xfun_0.14          

Mstes avatar Jun 11 '20 02:06 Mstes

You are absolutely right about it. We have not fully integrated spacyr tokenization into the quanteda workflow. A workaround at the moment is:

  1. convert spacy_tokenize() outputs to quanteda token object
  2. assign docvers()

Like this:

library(spacyr)
library(quanteda)
library(tidyverse)

toks <- data_corpus_dailnoconf1991 %>% 
  texts() %>%
  spacy_tokenize() %>% as.tokens()

docvars(toks) <- docvars(data_corpus_dailnoconf1991)

This is a bit tedious, and we may change it in future. What do you think @kbenoit?

amatsuo avatar Jun 11 '20 05:06 amatsuo