tidytext icon indicating copy to clipboard operation
tidytext copied to clipboard

Suggestion to add BM25 Score

Open OmaymaS opened this issue 7 years ago • 8 comments

I suggest to add a function to bind BM25 score (which is based on a probabilistic term weighting model). It is useful in some cases as it gives control over:

  • Term frequency saturation
  • Document/Field length normalization

It is commonly used as a ranking function by search engines.

I implemented a function bind_bm25 in the forked repo HERE

# bind_bm25 is given bare names -------------------

bind_bm25 <- function(tbl, term_col, document_col, n_col, k = 1.2, b = 1) {
  bind_bm25_(tbl,
               col_name(substitute(term_col)),
               col_name(substitute(document_col)),
               col_name(substitute(n_col)),
               k = k,
               b = b)
}

# bind_bm25_ is given strings -------------------------

bind_bm25_ <- function(tbl, term_col, document_col, n_col, k = 1.2, b = 1) {
  terms <- tbl[[term_col]]
  documents <- tbl[[document_col]]
  n <- tbl[[n_col]]

  doc_totals <- tapply(n, documents, sum)
  avg_dl <- mean(doc_totals)

  idf <- log(length(doc_totals) / table(terms))

  tbl$tf_bm25 <- ((k+1)*n)/(n+(k*((1-b)+b*(as.numeric(doc_totals[documents])/avg_dl))))
  tbl$idf <- as.numeric(idf[terms])
  tbl$bm25 <- tbl$tf_bm25 * tbl$idf

  tbl
}

OmaymaS avatar Apr 23 '17 14:04 OmaymaS

This seems super useful! I might suggest adding substitution for lazy evaluation (so it matches the rest of the code) and experimenting around with S3 methods in case this falls over for data.tables, but I'm happy to do that work and fully integrate it if @juliasilge and/or @dgrtwo give a thumbs up to the general ticket scope?

Ironholds avatar May 01 '17 08:05 Ironholds

Thanks Just need to make sure what issues could appear with data.table. I think it will work properly like bind_tf_idf, or you meant sth else?

OmaymaS avatar May 01 '17 11:05 OmaymaS

Oh, just the indices-based selection can sometimes get gnarly since it behaves somewhat differently. It'll probably be fine, but I'll check to make sure once David/Julia sign off (hinthint)

Ironholds avatar May 02 '17 01:05 Ironholds

We are working on getting broken things fixed, cleaned up, etc for our 0.1.3 release, but let's come back and get this implemented for tidytext 0.1.4!

juliasilge avatar May 02 '17 21:05 juliasilge

If that's the goal, I'll add it to the to-do! Anything I can do to help with the fixing, cleanup, etc?

Ironholds avatar May 02 '17 21:05 Ironholds

Is there any update on this? It would be good to have a TF-IDF alternative.

jl5000 avatar Jun 24 '21 10:06 jl5000

No recent work on this, but if you are looking for an alternative to tf-idf that may fit your needs better, check out weighted log odds with the tidylo package.

juliasilge avatar Jun 24 '21 14:06 juliasilge

That's very helpful, thank you.

jl5000 avatar Jun 24 '21 14:06 jl5000