py-readability-metrics
py-readability-metrics copied to clipboard
Add support for additional languages beyond English?
Hi,
Thank you very much, this is awesome ! I would like to know if this works also for romance language like French, Spanish, Portuguese or it supports only English ?
Thanks so much for your interest @Prattjames.
Currently, the package supports English. The package uses sent_tokenize
which uses PunktSentenceTokenizer
(punkt) under the covers by default. punkt appears to supports other languages.
After a quick review, it seems to enable other language, we would need to update the [sent_tokenize](https://github.com/cdimascio/py-readability- metrics/blob/master/readability/text/analyzer.py#L66) call to specify another punkd supported language e.g.
sent_tokenize(text, language='spanish'): # where spanish is any language supported by punkt
It seems that making this configurable though this package would enable us to support more languages.
Such a change would enable all of the current scorers except for dale_chall
. In order to support dale_chall
propertly we need its list for each language. We could just ignore dale_chall
for now
If you have any thoughts or are interested in helping out, or even submitting a PR, I'd welcome it
Looking for help on this. We certainly don't need to support all languages. If support can be added for at least one additional language that will be a fantastic start!
Does it support Chinese?
It does not currently support Chinese. Im looking for help if folks are interested. PRs always welcome
Any updates on this? Does it support other languages?