tidytext
tidytext copied to clipboard
Suggestion to add BM25 Score
I suggest to add a function to bind BM25 score (which is based on a probabilistic term weighting model). It is useful in some cases as it gives control over:
- Term frequency saturation
- Document/Field length normalization
It is commonly used as a ranking function by search engines.
I implemented a function bind_bm25
in the forked repo HERE
# bind_bm25 is given bare names -------------------
bind_bm25 <- function(tbl, term_col, document_col, n_col, k = 1.2, b = 1) {
bind_bm25_(tbl,
col_name(substitute(term_col)),
col_name(substitute(document_col)),
col_name(substitute(n_col)),
k = k,
b = b)
}
# bind_bm25_ is given strings -------------------------
bind_bm25_ <- function(tbl, term_col, document_col, n_col, k = 1.2, b = 1) {
terms <- tbl[[term_col]]
documents <- tbl[[document_col]]
n <- tbl[[n_col]]
doc_totals <- tapply(n, documents, sum)
avg_dl <- mean(doc_totals)
idf <- log(length(doc_totals) / table(terms))
tbl$tf_bm25 <- ((k+1)*n)/(n+(k*((1-b)+b*(as.numeric(doc_totals[documents])/avg_dl))))
tbl$idf <- as.numeric(idf[terms])
tbl$bm25 <- tbl$tf_bm25 * tbl$idf
tbl
}
This seems super useful! I might suggest adding substitution for lazy evaluation (so it matches the rest of the code) and experimenting around with S3 methods in case this falls over for data.tables, but I'm happy to do that work and fully integrate it if @juliasilge and/or @dgrtwo give a thumbs up to the general ticket scope?
Thanks
Just need to make sure what issues could appear with data.table
.
I think it will work properly like bind_tf_idf
, or you meant sth else?
Oh, just the indices-based selection can sometimes get gnarly since it behaves somewhat differently. It'll probably be fine, but I'll check to make sure once David/Julia sign off (hinthint)
We are working on getting broken things fixed, cleaned up, etc for our 0.1.3 release, but let's come back and get this implemented for tidytext 0.1.4!
If that's the goal, I'll add it to the to-do! Anything I can do to help with the fixing, cleanup, etc?
Is there any update on this? It would be good to have a TF-IDF alternative.
No recent work on this, but if you are looking for an alternative to tf-idf that may fit your needs better, check out weighted log odds with the tidylo package.
That's very helpful, thank you.