TextAnalysis.jl icon indicating copy to clipboard operation
TextAnalysis.jl copied to clipboard

improper stemming of NGram documents

Open tanmaykm opened this issue 6 years ago • 4 comments

Stemming a NGramDocument stems only the last word of each ngram. Notice below how repository is stemmed to repositori in one place but left intact in another.

julia> td = NGramDocument("this repository of julia language", 3)
NGramDocument{AbstractString}(Dict{AbstractString,Int64}("language"=>1,"repository"=>1,"this"=>1,"this repository of"=>1,"of julia language"=>1,"julia language"=>1,"of"=>1,"julia"=>1,"this repository"=>1,"repository of"=>1…), 3, TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time"))

julia> stem!(td); td
NGramDocument{AbstractString}(Dict{AbstractString,Int64}("languag"=>1,"this"=>1,"this repository of"=>1,"of julia languag"=>1,"this repositori"=>1,"of"=>1,"julia"=>1,"repositori"=>1,"repository of"=>1,"of julia"=>1…), 3, TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time"))

While stemming a StringDocument stems each word:

julia> sd = StringDocument("this repository of julia language")
StringDocument{String}("this repository of julia language", TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time"))

julia> stem!(sd); sd
StringDocument{String}("this repositori of julia languag", TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time"))

tanmaykm avatar May 03 '19 10:05 tanmaykm

Is work still needed on this issue? @aviks

sean-gauss avatar Jan 21 '20 20:01 sean-gauss

@aviks is this issue fixed or still help needed?

bnriiitb avatar Jul 24 '20 15:07 bnriiitb

I intended to finish this, however, at the moment I am a bit busy with my internship. If you can resolve this issue you can freely proceed.

sean-gauss avatar Jul 24 '20 17:07 sean-gauss

@aviks Hi! I think I figured out what's going on here. It comes down to the stem function in line 38 of stemmer.jl below, which stems the n-gram (token), resulting in its stemmed version (new_token):

https://github.com/JuliaText/TextAnalysis.jl/blob/a38d8d70e9588c77b889c52b8f1f623920e34630/src/stemmer.jl#L36-L48

The problem arises from the fact that token (the n-gram) is actually just stored as a string. The name "token" is maybe a bit of a misnomer—each n-gram is really a string of tokens that we want stemmed, so we either want to think about it as a StringDocument and stem each word in the string, or we'd want to think about it as a TokenDocument and stem each token of the n-gram individually. Right now, the n-gram is stemmed as just a String, which means the n-gram is interpreted as one single entity which has its end stemmed, rather than a list of n entities to be stemmed individually.

This might mean fundamentally altering the nature of NGramDocuments to be made up of either StringDocuments or vectors of strings like TokenDocuments are (the former probably being easier to actually implement, the latter perhaps being a little more meaningful?). I'd be glad to help implement a change in either direction!

(Or, if you want a lazy fix that doesn't think about anything else that's going on, you can just change

new_token = stem(stemmer, token)

to

new_token = stem_all(stemmer, token)

and be done with it, which is also an option...)

mostol avatar Feb 16 '22 21:02 mostol