litstudy
litstudy copied to clipboard
Filtering outliers from Corpus with strange behavior
Description
While building a Corpus
, using the litstudy.build_corpus()
method I have found that min_docs
and max_docs_ratio
are not working as expected.
For example, when forcing outliers to be kept in Corpus by setting min_docs=1
and max_docs_ratio=1
, the outliers are still being removed. The following example shows a situation for which no filter should be applied (except smart stemming and stopwords):
Corpus = litstudy.build_corpus(docs=curtailment_docs,
remove_words=None,
min_word_length=None,
min_docs=1,
max_docs_ratio=1,
max_tokens=1000,
replace_words=None,
custom_bigrams=None,
ngram_threshold=None)
Expected behavior
After performing a "dumb filter" on my database, prior to building the Corpus:
curtailment_docs = docs.filter_docs(lambda d: d.abstract is not None)
curtailment_docs = db.filter_docs(lambda d: 'curtailment' in d.abstract.lower())
I was expecting to see 'curtailment' as a "forced outlier".
'curtailment' in [token[1] for token in list(Corpus.dictionary.items())]
But it gives me:
False
Observations
Please keep in mind that this is not very easy to test. You might need a very specific word, that is not a STOPWORD and must be very frequent on a reasonable ammount of papers. In my case, I've been reviewing papers about "Curtailment in Power Systems", so I've managed to get a list of about 1000 papers which contain the word curtailment in the abstract, and that is the curtailment_docs
that I'm working with.
Thanks for using litstudy!
Interesting problem, I'm not sure what is causing this problem. I'll look into this. The lack of proper tests for build_corpus
and Corpus
do not help, unfortunately :-(. Now might be to time to invest into those.
Look at the code, do you have any feeling on what problem could be? The only thing that look suspicious to me is the call to filter_extremes
.
Indeed. It seems that dic.filter_extremes(keep_n=max_tokens)
is providing a similar functionality as preprocess_outliers()
, so even if the preprocess_outliers()
filter is behaving as expected (which I believe it is), once the filter_extremes()
is called it overlaps the desired behavior.
I think it would be better to just keep filter_extremes()
and incorporating the idea of using min_docs
, max_docs
and max_tokens
in this method. I've checked the documentation and it might work:
Documenation:
gensim.corpora.Dictionary.filter_extremes
@stijnh, can you assign this issue to me? I'll look into it and try improve the tests for build_corpus
Thanks for looking into this. I was not aware that filter_extremes
would also filter tokens based on the number of documents