gensim
gensim copied to clipboard
Model.Phrases - Specify what is considered a MWE component/word
Problem description
When using the Phrases model, words and punctuation are treated alike. While the corpus can be cleaned previously, it will destroy the corpus structure that is useful for some tasks. Just like the possibility to specify a list of connective words (ENGLISH_CONNECTOR_WORDS) it would be nice to be able to discard other tokens for being part of a word.
Possible solutions
If you feel this is something valuable to gensim, I am happy to provide a PR, just need to know what solution you prefer:
- Allow the user to specify, a priori, the complete vocabulary. I really do not like this idea, but is a possibility
- Allow the user to specify a function that, given a token returns a bool, whether that token can be part of a MWE or not
- Add extra parameters to the scoring functions, so that they can score 0 if any of the words should not be taken into account (while it works, I do not like it, too)
- Add a regexp that decides if a token is, or not, a word (I would use something like, if the token matches
[!?.:;,#|0-9/\\\]\[{}()], it would be discarded... - any other option you think best
If we have a roadmap, let me know, and I will prepare a PR, and we can then polish.
Cheers
Phrases takes a sequence of lists-of-tokens.
It's completely up to the user what's in those lists-of-tokens, and most projects will do some project-specific preprocessing to ensure the units (possibly including punctuation) most useful to their purposes are retained.
If there are extra filters desired, my senses is that it's better to apply them outside of Phrases, in a generic manner that allows the same filters to be reused elsewhere if desired. As far as I can tell all the proposed functionality can be done in a few lines as a wrapper around any raw corpus. For the Phrases class, which only needs one pass over the corpus, this can just be a generator. (For other models that need multiple iterations, like Word2Vec, the wrapper would have to be a little more complicated.)
For example:
-
if
corpuswas original corpus, &restricted_vocabthe subset of acceptable tokens:filtered_corpus = ([token for token in item if token in restricted_vocab] for item in corpus) -
if
filter_funcis the desired test:filtered_corpus = ([token for token in item if filt_func(token)] for item in corpus) -
to apply the proposed regular-expression for rejection:
pattern = re.compile("[!?.:;,#|0-9/\\\]\[{}()]") filtered_corpus = ([token for token in item if not pattern.match(token)] for item in corpus)
Unless there's some performance/discoverability/cpmprehension benefit to rolling this into Phrases, doing it outside seems the cleaner & better approach.
If considering this, though, I would caution that:
- a per-word test might add a lot of overhead to the full-scan used by
Phrases, slowing a run noticeably for little change in output (especially if only a small percentage of all tokens are being skipped) - it's not clear all non-words should be excluded from the sort of statistical-combining happening here, which even when it provides a benefit to downstream IR/calssification/etc tasks, tends to create lots of phrases that 'look wrong' to human sensibilities. If a number-token, or punctuation-token, appears so often in certain token-pairs it would pass the statistical test, maybe logically it is best considered part of its neighbor for all downstream analysis steps (even if non-aesthetic).
- applying a filter to just the
Phrasesanalysis could mean pairing-stats that don't match the 'true' pairings (that could happen when thePhrasesis later applied to a full unfiltered corpus). I wouldn't expect this to be a very noticeable effect, except in extreme situations... but I similarly expect such extra filtering to only be needed in weird situations.
Dear @gojomo, thank you for taking the time to answer me.
I may be wrong on how Phrases work. Suppose I have the original sentences:
He was present at the European Commission . There was a lot of people .
If we remove punctuation, Phrases will get se sequence of tokens:
He was present at the European Commission There was a lot of people
In this way, Phrases will treat European Comission the same way it will treat Comission There. Of course I expect that, probability speaking, the first would occur a lot more times. But, suppose it doesn't. The model will suggest Comission There as a multiword, and probably not suggest European Comission as it should.
With the original punctuation, it might happen that the suggestion is Comission . and not European Comission (ok, my example is not the best... and that will probably not happen for such clear MWE).
In the other hand, if I have the way to tell phrases that, whenever it asks for MPI on (Comission, .) or (., There) that the result should be 0, then at the end they will not be considered a MWE.
If you are worried about performance, adding the two words in the call to the scorer function will keep the same performance for the current behavior, and only if the user overrides it, it will downgrade.
Am I misunderstanding any step of this process?
In this way,
Phraseswill treatEuropean Comissionthe same way it will treatComission There.
No – you pass in sentences (lists of tokens) to Phrases, not strings of space-separated-tokens.
Right. So you are suggesting I break my sentences in any non-word token. That might be a solution I did not think of. Will get back to you, probably tomorrow, as my main job is not, unfortunately, in NLP.
I don't know about non-word tokens. But definitely on full stops, to avoid that example of Commission There cross-sentence overlap.
Yes, it depends on what are the goals. ------- Original Message ------- On Friday, April 29th, 2022 at 13:05, Radim Řehůřek @.***> wrote:
I don't know about non-word tokens. But definitely on full stops, to avoid that example of Commission There overlap.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
If the statistics suggest ('Commission', '.') is a 'good' bigram – occurring significantly more often than the individual word-frequencies would suggest – then it might benefit downstream info-retrieval, or classification, or clustering, steps to create a Commission_. pseudoword token. That is, it's not self-evident to me, based on aesthetics alone, that such extra constraints would offer a benefit in any real situation. Have you encountered such a situation?
If providing a differently-filtered set-of-texts to the Phrases somehow actually manages to make ('Commission', 'There') look, statistically, like a good-bigram (even if it wouldn't before creating those extra post-filtering artificial pairings), that seems to me risk a bigger issue than the filtering (might) be solving.
So I'm not convinced this would improve the results of Phrases, except on an non-quantitative aesthetic level – guaranteeing certain things a person might not think of as a Multi-Word-Expresssion never appear – that might be deleterious on quantititave evaluations.
And, if someone has a particular corpus, & set of goals, where such extra-filtering is proven to help, it's easy enough – and in some respects cleaner – to apply as a separate filtering step/wrapper, before passing to Phrases.
I believe that your suggestion that the scoring-function see the tokens (worda, wordb), not just their counts, might enable new possibilities as well, even separate from what a generic filter/token-disqualification-rule might enable. But, I'm not sure such possibilities would ever be necessary copared to other more-siple approaches. Still, those words themselves could conceivably be offered the scoring-function, and all existing scorers could ignore them fairly efficiently (so little complexity overhead added). So a vivid example of that working to provide a tangible benefit, with little cost to normal usage, would be welcome.
(But thinking about that just made me realize another way for a user to veto unwanted bigrams. After the survey-pass, iterate over the internal vocab dict, removing keys with unwanted tokens/characters/etc. Then, no later ops will create such bigrams. Though this would use a bit more memory to collect those counts only to be discarded, it might go faster by matching against only the unique terms in the final tally, rather than every term, repeatedly, as it comes up over and over again in the original texts.)