tidytext icon indicating copy to clipboard operation
tidytext copied to clipboard

Speed improvement for bind_tf_idf

Open sometimesabird opened this issue 1 year ago • 5 comments

Hey, guys, I noticed that bind_tf_idf() doesn't really use dplyr, which has better performance relative to base R. I had a 30% improvement in speed for getting tfidf for a corpus of 100,000 tweets using this code:

corpus %>%
  group_by(TextID, word) %>% 
  count() %>% 
  group_by(TextID) %>% 
  mutate(tf = n / sum(n)) %>% 
  group_by(word) %>% 
  mutate(Documents = n()) %>% 
  ungroup() %>% 
  mutate(idf = log(length(unique(TextID)) / Documents),
         tf_idf = tf * idf)

sometimesabird avatar Jun 26 '23 05:06 sometimesabird

There have been some really big improvements in vctrs and dplyr since this code was originally written, so it would be a great idea for us to update it. 👍

juliasilge avatar Jul 02 '23 21:07 juliasilge

I started working on this today but I noticed that using dplyr more directly is slower in the cases I have tested out:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(janeaustenr)
library(tidytext)

book_words <- austen_books() |> 
  unnest_tokens(word, text) |> 
  count(book, word, sort = TRUE)

bench::mark(
  current_tidytext = bind_tf_idf(book_words, word, book, n),
  use_dplyr = book_words |> 
    mutate(tf = n / sum(n), .by = "book") %>% 
    mutate(doc_total = n(), .by = "word") %>% 
    mutate(idf = log(n_distinct(book) / doc_total),
           tf_idf = tf * idf) |>
    select(-doc_total)
)
#> # A tibble: 2 × 6
#>   expression            min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 current_tidytext   28.6ms   29.1ms      33.8    9.13MB     7.25
#> 2 use_dplyr          46.3ms   46.7ms      21.4    6.24MB    25.7

Created on 2023-07-03 with reprex v2.0.2

Let me find a convenient dataset with a lot more short texts to compare.

juliasilge avatar Jul 03 '23 23:07 juliasilge

Hmmm, it still looks faster to keep as is, even with shorter and more numerous documents:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidytext)

word_counts <- modeldata::tate_text |> 
  unnest_tokens(word, title) |> 
  count(id, word, sort = TRUE)

bench::mark(
  current_tidytext = bind_tf_idf(word_counts, word, id, n),
  use_dplyr = word_counts |> 
    mutate(tf = n / sum(n), .by = "id") %>% 
    mutate(doc_total = n(), .by = "word") %>% 
    mutate(idf = log(n_distinct(id) / doc_total),
           tf_idf = tf * idf) |>
    select(-doc_total)
)
#> # A tibble: 2 × 6
#>   expression            min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 current_tidytext   19.8ms   20.3ms      49.2    4.27MB     6.71
#> 2 use_dplyr          22.6ms   22.7ms      44.0    2.83MB   249.

Created on 2023-07-03 with reprex v2.0.2

@sometimesabird can you show me an example where this would be faster?

juliasilge avatar Jul 03 '23 23:07 juliasilge

Hi, I came across this issue a bit randomly but thought I'd give it a try. I use the text preparations steps described here to get a large enough word count. I added the package collapse in the benchmark. It's a super-fast, dependency-free package that is built for speed and to work well with dplyr syntax, so I'm just putting it here if you want to consider it:

suppressPackageStartupMessages({
  library(dplyr)
  library(collapse)
  library(sotu)
  library(readtext)
  library(tidytext)
})

file_paths <- sotu_dir()
sotu_texts <- readtext(file_paths)

sotu_whole <- 
  sotu_meta %>%  
  arrange(president) %>% # sort metadata
  bind_cols(sotu_texts) %>% # combine with texts
  as_tibble()

tidy_sotu <- sotu_whole %>%
  unnest_tokens(word, text) |> 
  fcount(doc_id, word, sort = TRUE, name = "n")


bench::mark(
  current_tidytext = bind_tf_idf(tidy_sotu, word, doc_id, n),
  
  use_collapse = tidy_sotu |> 
    fgroup_by(doc_id) |> 
    fmutate(tf = n / sum(n)) %>% 
    fungroup() |> 
    fcount(word, name = "doc_total", add = TRUE) |> 
    fmutate(idf = log(n_distinct(doc_id) / doc_total),
            tf_idf = tf * idf) |>
    fselect(-doc_total),
  
  use_dplyr = tidy_sotu |> 
    mutate(tf = n / sum(n), .by = "doc_id") %>% 
    mutate(doc_total = n(), .by = "word") %>% 
    mutate(idf = log(n_distinct(doc_id) / doc_total),
           tf_idf = tf * idf) |>
    select(-doc_total)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 × 6
#>   expression            min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 current_tidytext  415.5ms  421.6ms      2.37    73.5MB     2.37
#> 2 use_collapse       29.2ms   35.8ms     22.7     27.5MB    11.3 
#> 3 use_dplyr         331.5ms  351.5ms      2.84    46.4MB     2.84

etiennebacher avatar Sep 29 '23 15:09 etiennebacher

Thanks @etiennebacher! I also should try out using vctrs directly for comparison.

juliasilge avatar Sep 29 '23 19:09 juliasilge