TextAnalysis.jl
TextAnalysis.jl copied to clipboard
Implementation of cosine similarity?
I needed the calculation of cosine similarity. My first attempt was a bare implementation of a wikpedia article. But I found out, that this was not as fast as desired (approx. 60s). Finally, I found a way to improve speed by three orders of magnitude by applying a matrix algorithm. If I did my maths correctly, the following function does the job:
function cos_similarity(tfidf::AbstractMatrix})
cs = tfidf * tfidf'
d = sqrt.(diag(cs))
# prevent division by zero (only occurs for empty documents)
d[findall(iszero, d)] .= 1
cs .= cs ./ (d * d')
end
In case that some people find it useful, I'd be happy to submit a PR.
Will be very useful, please submit a PR, preferably with some tests and docs.
@aviks Where's the best location to put it, utils.jl
or tf_idf.jl
or shall I include a new file similarity.jl
?
Tf-idf.jl would be best, I think