pytextrank
pytextrank copied to clipboard
Update Sample Usage document: stop words must be lowercase
In Sample Document (https://derwen.ai/docs/ptr/sample/) I propose to update:
For each entry, you'll need to add a key that is the lemma and a value that's a list of its part-of-speech tags.
to
For each entry, you'll need to add a key that is the lemma (all lower-case) and a value that's a list of its part-of-speech tags.
Hi @0dB ,
Existing documentation is exactly as it"s supposed to be.
Lemma of a token is not always necessarily lower-case. for example Proper nouns like London have lemma_ as London and not london. So suggested change will not be an accurate representation of what the stopwords field expect.
In case user want to omit London also as a stopword, the code will look like
nlp.add_pipe("textrank", config={ "stopwords": { "word": ["NOUN"], "London": ["PROPN"] }})
Ok, I understand. In my case it was the token "HGB" (acronym for a set of german laws for the B2B sector) that I had to lowercase to scrub it, so I thought this holds for all tokens. Ok, but I did trip over that 😊 Is it worth mentioning to others? You could point out what you wrote, no?
Definitely, it's great to mention these points for others.
That examples/sample.ipynb notebooks is sort of the "backbone" for our MkDocs, and it could have more cases illustrated.