improper stemming of NGram documents
Stemming a NGramDocument stems only the last word of each ngram. Notice below how repository is stemmed to repositori in one place but left intact in another.
julia> td = NGramDocument("this repository of julia language", 3)
NGramDocument{AbstractString}(Dict{AbstractString,Int64}("language"=>1,"repository"=>1,"this"=>1,"this repository of"=>1,"of julia language"=>1,"julia language"=>1,"of"=>1,"julia"=>1,"this repository"=>1,"repository of"=>1…), 3, TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time"))
julia> stem!(td); td
NGramDocument{AbstractString}(Dict{AbstractString,Int64}("languag"=>1,"this"=>1,"this repository of"=>1,"of julia languag"=>1,"this repositori"=>1,"of"=>1,"julia"=>1,"repositori"=>1,"repository of"=>1,"of julia"=>1…), 3, TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time"))
While stemming a StringDocument stems each word:
julia> sd = StringDocument("this repository of julia language")
StringDocument{String}("this repository of julia language", TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time"))
julia> stem!(sd); sd
StringDocument{String}("this repositori of julia languag", TextAnalysis.DocumentMetadata(Languages.English(), "Untitled Document", "Unknown Author", "Unknown Time"))
Is work still needed on this issue? @aviks
@aviks is this issue fixed or still help needed?
I intended to finish this, however, at the moment I am a bit busy with my internship. If you can resolve this issue you can freely proceed.
@aviks Hi! I think I figured out what's going on here. It comes down to the stem function in line 38 of stemmer.jl below, which stems the n-gram (token), resulting in its stemmed version (new_token):
https://github.com/JuliaText/TextAnalysis.jl/blob/a38d8d70e9588c77b889c52b8f1f623920e34630/src/stemmer.jl#L36-L48
The problem arises from the fact that token (the n-gram) is actually just stored as a string. The name "token" is maybe a bit of a misnomer—each n-gram is really a string of tokens that we want stemmed, so we either want to think about it as a StringDocument and stem each word in the string, or we'd want to think about it as a TokenDocument and stem each token of the n-gram individually. Right now, the n-gram is stemmed as just a String, which means the n-gram is interpreted as one single entity which has its end stemmed, rather than a list of n entities to be stemmed individually.
This might mean fundamentally altering the nature of NGramDocuments to be made up of either StringDocuments or vectors of strings like TokenDocuments are (the former probably being easier to actually implement, the latter perhaps being a little more meaningful?). I'd be glad to help implement a change in either direction!
(Or, if you want a lazy fix that doesn't think about anything else that's going on, you can just change
new_token = stem(stemmer, token)
to
new_token = stem_all(stemmer, token)
and be done with it, which is also an option...)