textacy icon indicating copy to clipboard operation
textacy copied to clipboard

Phrase models vs n-grams in pre-processing

Open rtbs-dev opened this issue 7 years ago • 3 comments

Expected Behavior

It would be nice if some of the features of the gensim.models.Phrases() tool could get implemented into the doc.to_terms_list() method, or even elsewhere (expecially in the textacy.preprocess.preprocess_text() method).

Current Behavior

Currently, we can use ngrams=(1,2,3) kwarg in doc.to_terms_list() to get up to the 3rd-order n-grams included in the "bag of words" frequency representation later on. I don't currently see a way of modeling phrases to combine tokens that should really be together to begin with (like ice_cream) in Textacy.

Context

For example, here's a great instance of using a phrase model as opposed to just including all n-grams in a Bag of Words. The phrase model tokenizes the n-grams together before any vectorization.

rtbs-dev avatar Mar 17 '17 17:03 rtbs-dev

Hi @tbsexton , I'm familiar with gensim's collocation model, and indeed it is nice! I've not planned to reproduce its functionality in textacy, but that's not to say I'm opposed to it. Would you like to give it a shot, and submit a PR? You'll want to run over a large corpus to generate the uni- and bi-gram counts; could be a good use case for textacy.corpora.WikiReader().

In the meantime, you might consider joining named entities into single tokens during text parsing, so that ["Barack", "Obama", ...] => ["Barack Obama", ...]. Check out https://spacy.io/docs/usage/customizing-pipeline and textacy.spacy_pipelines.merged_entities_pipeline. I realize that's not exactly what you want, but may move you in the right direction.

bdewilde avatar Mar 18 '17 16:03 bdewilde

@bdewilde, so I haven't forgotten about this, but it hasn't been high on my todo list while I could just use gensim as a preprocessor and output the needed .txt files before using Textacy. I do think that this functionality would be really useful as a native Textacy functionality.

I'm not sure that it is intended to be done as a pre-trained tool (though it certainly could), but rather done on the token/bigram counts within the corpus of interest (see the paper here, page 6, eq. 6). This is esp. true with domain-specific content like academic papers or technical reports, etc. With that in mind, I'm wondering how exactly you might be envisioning this being implemented within the library. I think there's two main ways, either:

  1. As a preprocessing method, done before spaCy has parsed the input. This would (maybe?) be preferable to me, since it modifies the text in-place. As far as I can tell this is the main point of this type of phrase modeling, which "removes" the sub-tokens from "existence" once they're combined a priori using some kind of token separator (e.g. _, as seen in this example, cells 22 - 36)

  2. Done at parse or vectorization time, as part of the Vectorizor estimator class (I love the improved sklearn compatibility, btw). This means we would already have the token/bigram counts, which seem unintuitive to calculate before we've loaded spacy.

Option (2) seems straightforward, perhaps as a kwarg like collocation=None as the default, and integer input being the number of iterations to combine the text into bi-grams. This could be done either as a separation character _ or as a merge method to join a span. Either way, I'd like to have the functionality of removing the "sub-tokens" in the text itself, which isn't done when we do simple n-gram tf-idf, for example.

Thoughts?

rtbs-dev avatar Oct 27 '17 15:10 rtbs-dev

Option (2) looks good to me. Whatever gensim is doing, but by using spacy's analytics.

PS: I'm currently using a pipeline similar to the post you provided here, where I pass the data (all in memory) back and forth between spacy (parsing, lemmatise, stopwords, POS, etc) -> gensim (phrase model) -> textacy (most_discriminant).

adrianog avatar May 21 '18 19:05 adrianog