litstudy icon indicating copy to clipboard operation
litstudy copied to clipboard

Filtering outliers from Corpus with strange behavior

Open ettoreaquino opened this issue 1 year ago • 4 comments

Description

While building a Corpus, using the litstudy.build_corpus() method I have found that min_docs and max_docs_ratio are not working as expected.

For example, when forcing outliers to be kept in Corpus by setting min_docs=1 and max_docs_ratio=1, the outliers are still being removed. The following example shows a situation for which no filter should be applied (except smart stemming and stopwords):

Corpus = litstudy.build_corpus(docs=curtailment_docs,
                               remove_words=None,
                               min_word_length=None,
                               min_docs=1,
                               max_docs_ratio=1,
                               max_tokens=1000,
                               replace_words=None,
                               custom_bigrams=None,
                               ngram_threshold=None)

Expected behavior

After performing a "dumb filter" on my database, prior to building the Corpus:

curtailment_docs = docs.filter_docs(lambda d: d.abstract is not None)
curtailment_docs = db.filter_docs(lambda d: 'curtailment' in d.abstract.lower())

I was expecting to see 'curtailment' as a "forced outlier".

'curtailment' in [token[1] for token in list(Corpus.dictionary.items())]

But it gives me:

False

Observations

Please keep in mind that this is not very easy to test. You might need a very specific word, that is not a STOPWORD and must be very frequent on a reasonable ammount of papers. In my case, I've been reviewing papers about "Curtailment in Power Systems", so I've managed to get a list of about 1000 papers which contain the word curtailment in the abstract, and that is the curtailment_docs that I'm working with.

ettoreaquino avatar May 11 '23 09:05 ettoreaquino

Thanks for using litstudy!

Interesting problem, I'm not sure what is causing this problem. I'll look into this. The lack of proper tests for build_corpus and Corpus do not help, unfortunately :-(. Now might be to time to invest into those.

Look at the code, do you have any feeling on what problem could be? The only thing that look suspicious to me is the call to filter_extremes.

stijnh avatar May 11 '23 09:05 stijnh

Indeed. It seems that dic.filter_extremes(keep_n=max_tokens) is providing a similar functionality as preprocess_outliers(), so even if the preprocess_outliers() filter is behaving as expected (which I believe it is), once the filter_extremes() is called it overlaps the desired behavior.

I think it would be better to just keep filter_extremes() and incorporating the idea of using min_docs, max_docs and max_tokens in this method. I've checked the documentation and it might work:

Documenation: gensim.corpora.Dictionary.filter_extremes

ettoreaquino avatar May 11 '23 10:05 ettoreaquino

@stijnh, can you assign this issue to me? I'll look into it and try improve the tests for build_corpus

ettoreaquino avatar May 11 '23 12:05 ettoreaquino

Thanks for looking into this. I was not aware that filter_extremes would also filter tokens based on the number of documents

stijnh avatar May 11 '23 13:05 stijnh