spacyr Turning dependency parsing off doesn't offer much of a speed increase

Dependency parsing is normally a quite intensive task and I'm surprised that isn't reflected in spacyr.

library(spacyr)
library(modeldata)

data("okc_text")

spacy_initialize()

bench::mark(iterations = 10, check = FALSE,
  spacy_parse(okc_text$essay0, dependency = TRUE),
  spacy_parse(okc_text$essay0, dependency = FALSE)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression                                         min median `itr/sec`
#>   <bch:expr>                                       <bch> <bch:>     <dbl>
#> 1 spacy_parse(okc_text$essay0, dependency = TRUE)  13.8s  14.3s    0.0702
#> 2 spacy_parse(okc_text$essay0, dependency = FALSE) 13.3s  13.9s    0.0711
#> # … with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>

^{Created on 2020-05-19 by the reprex package (v0.3.0)}

Looking at the code it appears that setting dependency = FALSE doesn't stop the dependency parsing from occurring but rather stopping spacyr from extraction the information. Is that current?

If the answer to the above question is yes, would it be possible to have a way to turn of dependency parsing from happening in the first place? :)

May 19 '20 20:05 EmilHvitfeldt

Thanks for pointing this out, it's time we revisited this.

@amatsuo It used to be true (spaCy 1.0 I think) that we could not get entities or tags without also parsing dependencies, but that's no longer true in >= v2.0. See https://spacy.io/usage/spacy-101#pipelines for instance, esp the part Does the order of pipeline components matter?.

We don't do the extras for spacy_tokenize(), so let's work on turning off the heavy lifting for most spacy_parse() operations too.

May 20 '20 08:05 kbenoit

@kbenoit is right about the pipeline sequence, but the workaround already exists in spacyr.

This is the spaCy pipeline sequence:

https://spacy.io/usage/processing-pipelines

As you can see, ner depends on perser and without it, we cannot conduct named entity recognition. By default, spacyr will turn on all three pipelines.

The pipeline sequence is determined when nlp = spacy.load() is called internally in Python and even if we skip a part of pipeline later, the processing does not speed up (at least when we were working this part of spacyr, it was the case).

The workaround exists in spacy_initialize().

In spacyr, nlp = spacy.load() is called when spacy_initialize() is called. You can turn parser and ner off by re-initializing with entity = FALSE. See the comparison below. The processing time is faster (coming down roughly from 3 sec to 2 sec) by omitting later elements of the pipeline.

library(spacyr)
suppressMessages(library(quanteda))

spacy_initialize(entity = T)
#> Found 'spacy_condaenv'. spacyr will use this environment
#> successfully initialized (spaCy Version: 2.2.3, language model: en_core_web_sm)
#> (python options: type = "condaenv", value = "spacy_condaenv")

txt <- texts(data_corpus_irishbudget2010)

microbenchmark::microbenchmark(
    dep_on = spacy_parse(txt, dependency = T),
    dep_ent_off = spacy_parse(txt, dependency = F, entity = F),
times = 10)
#> Unit: seconds
#>         expr      min       lq     mean   median       uq      max neval
#>       dep_on 3.222558 3.227064 3.298940 3.238554 3.251128 3.855071    10
#>  dep_ent_off 3.148875 3.155166 3.159326 3.161741 3.163900 3.165395    10

spacy_finalize()
spacy_initialize(entity = F)
#> Python space is already attached.  If you want to switch to a different Python, please restart R.
#> successfully initialized (spaCy Version: 2.2.3, language model: en_core_web_sm)
#> (python options: type = "condaenv", value = "spacy_condaenv")

microbenchmark::microbenchmark(
    dep_ent_off = spacy_parse(txt, dependency = F, entity = F),
times = 10)
#> Unit: seconds
#>         expr     min       lq     mean   median       uq      max neval
#>  dep_ent_off 2.10815 2.111668 2.169676 2.113744 2.123295 2.662206    10

^{Created on 2020-05-20 by the reprex package (v0.3.0)}

May 20 '20 09:05 amatsuo

We should add these as a vignette on how to improve or maximise performance.

Sep 01 '22 09:09 kbenoit