pytextrank icon indicating copy to clipboard operation
pytextrank copied to clipboard

Is parser a relavant pipeline component beyond noun chunking?

Open ghost opened this issue 4 years ago • 4 comments

Hello,

I'm using pytextrank with texts in Portuguese. Thanks to issue #54 I'm able to use POS information to produce some basic noun chunking, instead of syntactic information from the parser.

My question is, in this case where I'm producing chunks from POS, am I loosing something if I disable the parser and create a new pipeline component just for chunking? Are there other relevant information given by the parser used?

ghost avatar Mar 12 '20 13:03 ghost

Hi @imeano,

Given how TextRank works, there are strict needs for what the parsers tend to produce:

  • sentence and word segmentation
  • part-of-speech tagging
  • lemmatization

The noun chunking part was an extended case that I have added (along with use of lemmatization) to make the algorithm more effective.

Does that help?

Also, does https://spacy.io/models/pt provide an effective parser for Portuguese?

ceteri avatar Mar 12 '20 21:03 ceteri

Thanks for the response. It does answer my question, even tough I didn't asked as best I could.

I used spaCy's terminology without specifying it clearly. Because spacy's DependencyParser, as a pipeline component, is called simply "parser" I tend to also just call it parser. From testing, I came up with the following:

  • Word segmentation: Done by Tokenizer, which doesn't need any other pipeline component.
  • Sentence segmentation: Done by DependencyParser. Can be bypassed by custom function added to pipeline
  • PoS tagging: Done by Tagger
  • Lemmatization: Done by Lemmatizer. Lemmatizer can make use of PoS information, but this is dependent on how it's implemented for the Language used.
  • Noun Chunking: Done by 'noun_chunks' syntax iterator in Languages where this is defined. Syntax iterators are defined within DependencyParser parameters. Disabling it will prevent noun chunking from occurring. Can also be bypassed by custom function added to pipeline.

So, assuming those features are the only ones needed for pytextrank to work properly, it seems I can disable the DependencyParser as long as I include noun_chunking and sentence segmentation pipeline components.

I was quite sure I could get it to work with these alterations, but was afraid to get different results.

Also, does https://spacy.io/models/pt provide an effective parser for Portuguese?

Mostly effective I would say. I've worked with linguists and they couldn't make much use of the syntactic trees produced (errors in syntactic parser tend to to accumulate as far from ROOT you get). Sentence segmentation is quite good for sentences that aren't too long. As for the Tagger, POS is quite good, but TAG_MAP is too huge in my opinion.

ghost avatar Mar 13 '20 16:03 ghost

Can you please elaborate on this one more explicitly?

i.e. if we can't remove

sentence and word segmentation part-of-speech tagging lemmatization

Then is doc = nlp(text, disable = ['ner', 'parser']) (for instance) an acceptable disable= situation or not?

Please spell the redundant parts more explicitly :bow:

It means a lot when the text is big and removing any redundant pipeline would help a lot, memory-wise.

guy4261 avatar Jun 27 '22 09:06 guy4261

Hi @guy4261,

No, none of the textgraph algorithms would work with a parser disabled.

Disabling NER might be an option. It depends on the language, version of other pipeline components, etc., so you'd need to experiment.

ceteri avatar Jul 25 '22 18:07 ceteri