TextAnalysis.jl icon indicating copy to clipboard operation
TextAnalysis.jl copied to clipboard

Implementation of cosine similarity?

Open hhaensel opened this issue 4 years ago • 3 comments

I needed the calculation of cosine similarity. My first attempt was a bare implementation of a wikpedia article. But I found out, that this was not as fast as desired (approx. 60s). Finally, I found a way to improve speed by three orders of magnitude by applying a matrix algorithm. If I did my maths correctly, the following function does the job:

function cos_similarity(tfidf::AbstractMatrix})
    cs = tfidf * tfidf'
    d = sqrt.(diag(cs))
    # prevent division by zero  (only occurs for empty documents)
    d[findall(iszero, d)] .= 1
    cs .= cs ./ (d * d')
end

In case that some people find it useful, I'd be happy to submit a PR.

hhaensel avatar Aug 14 '20 14:08 hhaensel

Will be very useful, please submit a PR, preferably with some tests and docs.

aviks avatar Nov 02 '20 15:11 aviks

@aviks Where's the best location to put it, utils.jl or tf_idf.jl or shall I include a new file similarity.jl?

hhaensel avatar Jan 06 '21 13:01 hhaensel

Tf-idf.jl would be best, I think

aviks avatar Jan 06 '21 14:01 aviks