tidytext
tidytext copied to clipboard
Speed improvement for bind_tf_idf
Hey, guys, I noticed that bind_tf_idf()
doesn't really use dplyr, which has better performance relative to base R. I had a 30% improvement in speed for getting tfidf for a corpus of 100,000 tweets using this code:
corpus %>%
group_by(TextID, word) %>%
count() %>%
group_by(TextID) %>%
mutate(tf = n / sum(n)) %>%
group_by(word) %>%
mutate(Documents = n()) %>%
ungroup() %>%
mutate(idf = log(length(unique(TextID)) / Documents),
tf_idf = tf * idf)
There have been some really big improvements in vctrs and dplyr since this code was originally written, so it would be a great idea for us to update it. 👍
I started working on this today but I noticed that using dplyr more directly is slower in the cases I have tested out:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(janeaustenr)
library(tidytext)
book_words <- austen_books() |>
unnest_tokens(word, text) |>
count(book, word, sort = TRUE)
bench::mark(
current_tidytext = bind_tf_idf(book_words, word, book, n),
use_dplyr = book_words |>
mutate(tf = n / sum(n), .by = "book") %>%
mutate(doc_total = n(), .by = "word") %>%
mutate(idf = log(n_distinct(book) / doc_total),
tf_idf = tf * idf) |>
select(-doc_total)
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 current_tidytext 28.6ms 29.1ms 33.8 9.13MB 7.25
#> 2 use_dplyr 46.3ms 46.7ms 21.4 6.24MB 25.7
Created on 2023-07-03 with reprex v2.0.2
Let me find a convenient dataset with a lot more short texts to compare.
Hmmm, it still looks faster to keep as is, even with shorter and more numerous documents:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidytext)
word_counts <- modeldata::tate_text |>
unnest_tokens(word, title) |>
count(id, word, sort = TRUE)
bench::mark(
current_tidytext = bind_tf_idf(word_counts, word, id, n),
use_dplyr = word_counts |>
mutate(tf = n / sum(n), .by = "id") %>%
mutate(doc_total = n(), .by = "word") %>%
mutate(idf = log(n_distinct(id) / doc_total),
tf_idf = tf * idf) |>
select(-doc_total)
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 current_tidytext 19.8ms 20.3ms 49.2 4.27MB 6.71
#> 2 use_dplyr 22.6ms 22.7ms 44.0 2.83MB 249.
Created on 2023-07-03 with reprex v2.0.2
@sometimesabird can you show me an example where this would be faster?
Hi, I came across this issue a bit randomly but thought I'd give it a try. I use the text preparations steps described here to get a large enough word count. I added the package collapse
in the benchmark. It's a super-fast, dependency-free package that is built for speed and to work well with dplyr
syntax, so I'm just putting it here if you want to consider it:
suppressPackageStartupMessages({
library(dplyr)
library(collapse)
library(sotu)
library(readtext)
library(tidytext)
})
file_paths <- sotu_dir()
sotu_texts <- readtext(file_paths)
sotu_whole <-
sotu_meta %>%
arrange(president) %>% # sort metadata
bind_cols(sotu_texts) %>% # combine with texts
as_tibble()
tidy_sotu <- sotu_whole %>%
unnest_tokens(word, text) |>
fcount(doc_id, word, sort = TRUE, name = "n")
bench::mark(
current_tidytext = bind_tf_idf(tidy_sotu, word, doc_id, n),
use_collapse = tidy_sotu |>
fgroup_by(doc_id) |>
fmutate(tf = n / sum(n)) %>%
fungroup() |>
fcount(word, name = "doc_total", add = TRUE) |>
fmutate(idf = log(n_distinct(doc_id) / doc_total),
tf_idf = tf * idf) |>
fselect(-doc_total),
use_dplyr = tidy_sotu |>
mutate(tf = n / sum(n), .by = "doc_id") %>%
mutate(doc_total = n(), .by = "word") %>%
mutate(idf = log(n_distinct(doc_id) / doc_total),
tf_idf = tf * idf) |>
select(-doc_total)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 current_tidytext 415.5ms 421.6ms 2.37 73.5MB 2.37
#> 2 use_collapse 29.2ms 35.8ms 22.7 27.5MB 11.3
#> 3 use_dplyr 331.5ms 351.5ms 2.84 46.4MB 2.84
Thanks @etiennebacher! I also should try out using vctrs directly for comparison.