pytextrank icon indicating copy to clipboard operation
pytextrank copied to clipboard

Update Sample Usage document: stop words must be lowercase

Open 0dB opened this issue 2 years ago • 3 comments

In Sample Document (https://derwen.ai/docs/ptr/sample/) I propose to update:

For each entry, you'll need to add a key that is the lemma and a value that's a list of its part-of-speech tags.

to

For each entry, you'll need to add a key that is the lemma (all lower-case) and a value that's a list of its part-of-speech tags.

0dB avatar Aug 06 '23 13:08 0dB

Hi @0dB , Existing documentation is exactly as it"s supposed to be. Lemma of a token is not always necessarily lower-case. for example Proper nouns like London have lemma_ as London and not london. So suggested change will not be an accurate representation of what the stopwords field expect.

In case user want to omit London also as a stopword, the code will look like

nlp.add_pipe("textrank", config={ "stopwords": { "word": ["NOUN"], "London": ["PROPN"] }}) 

Ankush-Chander avatar Aug 06 '23 15:08 Ankush-Chander

Ok, I understand. In my case it was the token "HGB" (acronym for a set of german laws for the B2B sector) that I had to lowercase to scrub it, so I thought this holds for all tokens. Ok, but I did trip over that 😊 Is it worth mentioning to others? You could point out what you wrote, no?

0dB avatar Aug 06 '23 19:08 0dB

Definitely, it's great to mention these points for others.

That examples/sample.ipynb notebooks is sort of the "backbone" for our MkDocs, and it could have more cases illustrated.

ceteri avatar Aug 07 '23 05:08 ceteri