py-readability-metrics icon indicating copy to clipboard operation
py-readability-metrics copied to clipboard

Add support for additional languages beyond English?

Open delzennejc opened this issue 6 years ago • 5 comments

Hi,

Thank you very much, this is awesome ! I would like to know if this works also for romance language like French, Spanish, Portuguese or it supports only English ?

delzennejc avatar Jan 17 '19 19:01 delzennejc

Thanks so much for your interest @Prattjames.

Currently, the package supports English. The package uses sent_tokenize which uses PunktSentenceTokenizer (punkt) under the covers by default. punkt appears to supports other languages.

After a quick review, it seems to enable other language, we would need to update the [sent_tokenize](https://github.com/cdimascio/py-readability- metrics/blob/master/readability/text/analyzer.py#L66) call to specify another punkd supported language e.g.

sent_tokenize(text, language='spanish'): # where spanish is any language supported by punkt

It seems that making this configurable though this package would enable us to support more languages.

Such a change would enable all of the current scorers except for dale_chall. In order to support dale_chall propertly we need its list for each language. We could just ignore dale_chall for now

If you have any thoughts or are interested in helping out, or even submitting a PR, I'd welcome it

cdimascio avatar Jan 17 '19 20:01 cdimascio

Looking for help on this. We certainly don't need to support all languages. If support can be added for at least one additional language that will be a fantastic start!

cdimascio avatar Oct 13 '19 03:10 cdimascio

Does it support Chinese?

wcc526 avatar Mar 22 '20 11:03 wcc526

It does not currently support Chinese. Im looking for help if folks are interested. PRs always welcome

cdimascio avatar Apr 02 '20 03:04 cdimascio

Any updates on this? Does it support other languages?

OanaIgnat avatar Mar 27 '24 19:03 OanaIgnat