Turning dependency parsing off doesn't offer much of a speed increase
Dependency parsing is normally a quite intensive task and I'm surprised that isn't reflected in spacyr.
library(spacyr)
library(modeldata)
data("okc_text")
spacy_initialize()
bench::mark(iterations = 10, check = FALSE,
spacy_parse(okc_text$essay0, dependency = TRUE),
spacy_parse(okc_text$essay0, dependency = FALSE)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec`
#> <bch:expr> <bch> <bch:> <dbl>
#> 1 spacy_parse(okc_text$essay0, dependency = TRUE) 13.8s 14.3s 0.0702
#> 2 spacy_parse(okc_text$essay0, dependency = FALSE) 13.3s 13.9s 0.0711
#> # … with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>
Created on 2020-05-19 by the reprex package (v0.3.0)
Looking at the code it appears that setting dependency = FALSE doesn't stop the dependency parsing from occurring but rather stopping spacyr from extraction the information. Is that current?
If the answer to the above question is yes, would it be possible to have a way to turn of dependency parsing from happening in the first place? :)
Thanks for pointing this out, it's time we revisited this.
@amatsuo It used to be true (spaCy 1.0 I think) that we could not get entities or tags without also parsing dependencies, but that's no longer true in >= v2.0. See https://spacy.io/usage/spacy-101#pipelines for instance, esp the part Does the order of pipeline components matter?.
We don't do the extras for spacy_tokenize(), so let's work on turning off the heavy lifting for most spacy_parse() operations too.
@kbenoit is right about the pipeline sequence, but the workaround already exists in spacyr.
This is the spaCy pipeline sequence:
https://spacy.io/usage/processing-pipelines
As you can see, ner depends on perser and without it, we cannot conduct named entity recognition. By default, spacyr will turn on all three pipelines.
The pipeline sequence is determined when nlp = spacy.load() is called internally in Python and even if we skip a part of pipeline later, the processing does not speed up (at least when we were working this part of spacyr, it was the case).
The workaround exists in spacy_initialize().
In spacyr, nlp = spacy.load() is called when spacy_initialize() is called. You can turn parser and ner off by re-initializing with entity = FALSE. See the comparison below. The processing time is faster (coming down roughly from 3 sec to 2 sec) by omitting later elements of the pipeline.
library(spacyr)
suppressMessages(library(quanteda))
spacy_initialize(entity = T)
#> Found 'spacy_condaenv'. spacyr will use this environment
#> successfully initialized (spaCy Version: 2.2.3, language model: en_core_web_sm)
#> (python options: type = "condaenv", value = "spacy_condaenv")
txt <- texts(data_corpus_irishbudget2010)
microbenchmark::microbenchmark(
dep_on = spacy_parse(txt, dependency = T),
dep_ent_off = spacy_parse(txt, dependency = F, entity = F),
times = 10)
#> Unit: seconds
#> expr min lq mean median uq max neval
#> dep_on 3.222558 3.227064 3.298940 3.238554 3.251128 3.855071 10
#> dep_ent_off 3.148875 3.155166 3.159326 3.161741 3.163900 3.165395 10
spacy_finalize()
spacy_initialize(entity = F)
#> Python space is already attached. If you want to switch to a different Python, please restart R.
#> successfully initialized (spaCy Version: 2.2.3, language model: en_core_web_sm)
#> (python options: type = "condaenv", value = "spacy_condaenv")
microbenchmark::microbenchmark(
dep_ent_off = spacy_parse(txt, dependency = F, entity = F),
times = 10)
#> Unit: seconds
#> expr min lq mean median uq max neval
#> dep_ent_off 2.10815 2.111668 2.169676 2.113744 2.123295 2.662206 10
Created on 2020-05-20 by the reprex package (v0.3.0)
We should add these as a vignette on how to improve or maximise performance.