tidytext
tidytext copied to clipboard
Any chance we can get parallel processing for n-grams for example?
Thanks for a great package by the way
I'm not sure that identifying n-grams is parallelizable in a straightforward way, since we need to slide along the text to find the overlapping tokens. You could do something like this using furrr, if you wanted to find n-grams for separate documents using parallel processing.
library(tidyverse)
library(tidytext)
library(furrr)
#> Loading required package: future
## nest by document
nested_austen <- janeaustenr::austen_books() %>%
mutate(title = book) %>%
nest(data = c(title, text))
nested_austen
#> # A tibble: 6 × 2
#> book data
#> <fct> <list>
#> 1 Sense & Sensibility <tibble [12,624 × 2]>
#> 2 Pride & Prejudice <tibble [13,030 × 2]>
#> 3 Mansfield Park <tibble [15,349 × 2]>
#> 4 Emma <tibble [16,235 × 2]>
#> 5 Northanger Abbey <tibble [7,856 × 2]>
#> 6 Persuasion <tibble [8,328 × 2]>
plan(multisession, workers = 2)
tokenized <-
nested_austen %>%
mutate(tokens = future_map(
data,
~ unnest_tokens(., bigram, text, collapse = "title", token = "ngrams", n = 2)
))
tokenized %>%
select(tokens) %>%
unnest(tokens)
#> # A tibble: 725,049 × 2
#> title bigram
#> <fct> <chr>
#> 1 Sense & Sensibility sense and
#> 2 Sense & Sensibility and sensibility
#> 3 Sense & Sensibility sensibility by
#> 4 Sense & Sensibility by jane
#> 5 Sense & Sensibility jane austen
#> 6 Sense & Sensibility austen 1811
#> 7 Sense & Sensibility 1811 chapter
#> 8 Sense & Sensibility chapter 1
#> 9 Sense & Sensibility 1 the
#> 10 Sense & Sensibility the family
#> # … with 725,039 more rows
Created on 2021-11-29 by the reprex package (v2.0.1)
In the general case, there is a fair amount of complexity in specifying what chunks of text should go to parallel workers.
Thanks Julia. You are right. This may be more effort than it is worth. I would appreciate others' thoughts before making a decision on whether or not to close the issue.
Let me know if you have further questions!
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.