ExplainToMe icon indicating copy to clipboard operation
ExplainToMe copied to clipboard

Languages other than English

Open ghost opened this issue 7 years ago • 7 comments

Hi there, thanks for the cool project. The bottom of the README says the support for other languages is a thing to look forward to -- could you elaborate on it a bit? Any particular plans? Let me know if you're looking for contributors that could handle different languages.

ghost avatar Aug 16 '16 17:08 ghost

Sure. So ExplainToMe currently does 3 things.

  1. Grabs HTML from Webpage
  2. Extracts the main article components.
  3. Generates semantic graph and computes it's centroid.

Currently #1, #2 do not care about language, mostly dealing with HTML and webpage metadata. #3 cares about language, but mostly dealing with stopwords and language cleaning. If the user specifies the language of the article in advance (sometimes we can discover in HTML), we can provide stopwords, and most romantic languages should generate a decent summary.

Most likely start by supporting those languages.

I am interested in doing non-romance languages, but we'll see how far we get

jjangsangy avatar Aug 16 '16 18:08 jjangsangy

Cool. I take it you only use sumy as the summarisation platform? It seems to support Czech, French, German, Portuguese, Slovak, and Spanish out-of-the-box (the stop words for these languages are included in the package).

On 16 Aug 2016, at 21:18, Sang Han [email protected] wrote:

Sure. So ExplainToMe currently does 3 things.

Grabs HTML from Webpage Extracts the main article components. Generates semantic graph and computes it's centroid. Currently #1 https://github.com/jjangsangy/ExplainToMe/issues/1, #2 https://github.com/jjangsangy/ExplainToMe/issues/2 do not care about language, mostly dealing with HTML and webpage metadata. #3 https://github.com/jjangsangy/ExplainToMe/issues/3 cares about language, but mostly dealing with stopwords and language cleaning. If the user specifies the language of the article in advance, we can provide stopwords, and most romantic languages should generate a decent summary.

Most likely start by supporting those languages.

I am interested in doing non-romance languages, but we'll see how far we get

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jjangsangy/ExplainToMe/issues/3#issuecomment-240190643, or mute the thread https://github.com/notifications/unsubscribe-auth/AIDKDuCFRSqFkUTrWD6Gb4rwCtoBRyugks5qgf75gaJpZM4JlrRb.

andykaycodes avatar Aug 16 '16 18:08 andykaycodes

Correct. Sumy provides the right framework for building document Summarizer as well as the most popular techniques implemented.

My main concern about adding more languages is I can't really attest to their accuracy in an intuitive way. My experience with cross-language NLP is that techniques vary on effectiveness based on latent cultural features.

jjangsangy avatar Aug 17 '16 02:08 jjangsangy

I'd love to help with Portuguese (Brazilian Portuguese). I've been looking for something like this in Portuguese for ages.

gioferreira avatar May 31 '17 15:05 gioferreira

Awesome. Where I would start looking is under textrank.py. There is a function called run_summarizer that takes in a keyword argument language. Currently there is no function for detecting the language, so you'll have to write one based on either metadata, HTML meta tag, or by introducing some library to detect the language.

jjangsangy avatar May 31 '17 20:05 jjangsangy

Heads up I'm making some changes that will be pushed upstream maybe this or next week. It shouldn't effect any code in textrank.py or the original api.

The code however does move a lot of files around. Mostly I've split the application into the flask server that only displays the webpage and a summarization backend which runs asynchronously on aws lambda. I've mostly been running the public heroku server for demo, but it's getting costly to maintain it even if it's not that much every month

jjangsangy avatar Jun 11 '17 19:06 jjangsangy