textreuse icon indicating copy to clipboard operation
textreuse copied to clipboard

Implement true, simple sum of matches

Open mayeulk opened this issue 10 months ago • 1 comments

This is different from the resolution of https://github.com/ropensci/textreuse/issues/7

https://github.com/ropensci/textreuse/commit/13f11caf3fb0995b77948421bea90a4a79d15af7 indeed does not implement sum of matches, but ratio of matches (if I can read the documentation and the code correctly)

https://github.com/ropensci/textreuse/blob/895b5ff2990e4c73fbf16cc4a9829424cc2e436d/R/similarity.R#L128C1-L131C2

 ratio_of_matches.default <- function(a, b) {
   assert_that(all(class(a) == class(b)))
   sum(b %in% a) / length(b)
 }

I would need sum(b %in% a) I believe sum(b %in% a) == sum(a %in% b) , so this is an undirected measure.

My use case as a teacher: among my students who submitted their works (assignment)s), I want to find those who borrowed part of their work from fellow students. Say students A and B have 2 paragraphs in common. Assignment A has 5 paragraphs; assignments B has 20 paragraphs. What matters to me is that, with 2 identical paragraphs (say, 20 n-grams), I'm sure they shared some text. sum(b %in% a) seems easier to interprete in this case, because it is an absolute value, and because it is symmetric (non-directional).

mayeulk avatar Feb 21 '25 13:02 mayeulk