textacy
textacy copied to clipboard
Phrase models vs n-grams in pre-processing
Expected Behavior
It would be nice if some of the features of the gensim.models.Phrases()
tool could get implemented into the doc.to_terms_list()
method, or even elsewhere (expecially in the textacy.preprocess.preprocess_text()
method).
Current Behavior
Currently, we can use ngrams=(1,2,3)
kwarg in doc.to_terms_list()
to get up to the 3rd-order n-grams included in the "bag of words" frequency representation later on. I don't currently see a way of modeling phrases to combine tokens that should really be together to begin with (like ice_cream
) in Textacy.
Context
For example, here's a great instance of using a phrase model as opposed to just including all n-grams in a Bag of Words. The phrase model tokenizes the n-grams together before any vectorization.
Hi @tbsexton , I'm familiar with gensim
's collocation model, and indeed it is nice! I've not planned to reproduce its functionality in textacy
, but that's not to say I'm opposed to it. Would you like to give it a shot, and submit a PR? You'll want to run over a large corpus to generate the uni- and bi-gram counts; could be a good use case for textacy.corpora.WikiReader()
.
In the meantime, you might consider joining named entities into single tokens during text parsing, so that ["Barack", "Obama", ...] => ["Barack Obama", ...]. Check out https://spacy.io/docs/usage/customizing-pipeline and textacy.spacy_pipelines.merged_entities_pipeline
. I realize that's not exactly what you want, but may move you in the right direction.
@bdewilde, so I haven't forgotten about this, but it hasn't been high on my todo list while I could just use gensim as a preprocessor and output the needed .txt files before using Textacy. I do think that this functionality would be really useful as a native Textacy functionality.
I'm not sure that it is intended to be done as a pre-trained tool (though it certainly could), but rather done on the token/bigram counts within the corpus of interest (see the paper here, page 6, eq. 6). This is esp. true with domain-specific content like academic papers or technical reports, etc. With that in mind, I'm wondering how exactly you might be envisioning this being implemented within the library. I think there's two main ways, either:
-
As a preprocessing method, done before spaCy has parsed the input. This would (maybe?) be preferable to me, since it modifies the text in-place. As far as I can tell this is the main point of this type of phrase modeling, which "removes" the sub-tokens from "existence" once they're combined a priori using some kind of token separator (e.g.
_
, as seen in this example, cells 22 - 36) -
Done at parse or vectorization time, as part of the
Vectorizor
estimator class (I love the improved sklearn compatibility, btw). This means we would already have the token/bigram counts, which seem unintuitive to calculate before we've loaded spacy.
Option (2) seems straightforward, perhaps as a kwarg like collocation=None
as the default, and integer input being the number of iterations to combine the text into bi-grams. This could be done either as a separation character _
or as a merge
method to join a span
. Either way, I'd like to have the functionality of removing the "sub-tokens" in the text itself, which isn't done when we do simple n-gram tf-idf, for example.
Thoughts?
Option (2) looks good to me. Whatever gensim is doing, but by using spacy's analytics.
PS: I'm currently using a pipeline similar to the post you provided here, where I pass the data (all in memory) back and forth between spacy (parsing, lemmatise, stopwords, POS, etc) -> gensim (phrase model) -> textacy (most_discriminant).