python-dandelion-eu icon indicating copy to clipboard operation
python-dandelion-eu copied to clipboard

Cannot detect language on EN text

Open manentai opened this issue 2 years ago • 4 comments

Hello, I am getting this error when I try to process this English text:

I'm getting a bit confused by tech companies' thinking around the future of remote working, and I imagine I'm not the only one.

Months of working from home have made many businesses and their employees question whether the typical 9-5 working model is necessary in an age where work is increasingly done in front of a computer that provides instantaneous connection to anyone, anywhere in the world.

In this special feature, ZDNet examines technology's role in helping business leaders build tomorrow's workforce, and employees keep their skills up to date and grow their careers.

The article is longer, you can find it here: https://www.zdnet.com/article/is-remote-working-good-or-bad-big-tech-companies-just-cant-seem-to-decide/

The error message:

Traceback (most recent call last):
  File "/home/manentai/mambaforge/envs/flaskenv/lib/python3.8/site-packages/flask/app.py", line 1516, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/manentai/mambaforge/envs/flaskenv/lib/python3.8/site-packages/flask/app.py", line 1502, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/home/manentai/mambaforge/envs/flaskenv/lib/python3.8/site-packages/flask_restful/__init__.py", line 467, in wrapper
    resp = resource(*args, **kwargs)
  File "/home/manentai/mambaforge/envs/flaskenv/lib/python3.8/site-packages/flask/views.py", line 84, in view
    return current_app.ensure_sync(self.dispatch_request)(*args, **kwargs)
  File "/home/manentai/mambaforge/envs/flaskenv/lib/python3.8/site-packages/flask_restful/__init__.py", line 582, in dispatch_request
    resp = meth(*args, **kwargs)
  File "/home/manentai/flaskAI/app.py", line 119, in post
    app.PD.process(document_id)
  File "/home/manentai/flaskAI/process_data.py", line 97, in process
    response = datatxt.nex(sentence)#, include_categories=True, include_types=True)
  File "/home/manentai/mambaforge/envs/flaskenv/lib/python3.8/site-packages/dandelion/datatxt.py", line 14, in nex
    return self.do_request(
  File "/home/manentai/mambaforge/envs/flaskenv/lib/python3.8/site-packages/dandelion/base.py", line 102, in do_request
    raise DandelionException(obj)
dandelion.base.DandelionException: Cannot detect language

what can cause this confusion on an English text? Am I supposed to tell Dandelion that it's in English?

manentai avatar Apr 08 '22 15:04 manentai

Hi Simone, I have tested it and it seems to work. Ca you please check your code? you should not have problems. You may write here the code snippet, if you want

giacbrd avatar Apr 08 '22 18:04 giacbrd

Hi, thanks for getting back to me.

Actually the snippet I have is working on other texts I am trying, so I guess it might be a problem with the encoding of the text:

# parse article with SpaCy
doc = nlp(document["text"])

# Get the list of sentences in all the articles
sentences = [i.text for i in doc.sents]

# extract NER and keywords 
for sentence in sentences:
    # extract NER with Dandelion.eu
    response = datatxt.nex(sentence)

I am at a loss actually, with some text I have works, and with other texts raise the error...

manentai avatar Apr 09 '22 10:04 manentai

ok I think I solved it... If I parse sentences like this:

sentences = [i.text for i in doc.sents]

I will also get empty sentences, and the API crashes... if you specified it in the docs, I missed, sorry...

So this one is sufficient to fix the issue:

sentences = [i.text.strip() for i in doc.sents if i.text.strip()!=""]

manentai avatar Apr 09 '22 10:04 manentai

Hi Simone, I am glad you have solved it.

I have checked: the full error response message from the API, on empty texts, is Cannot detect language:text is empty or null. It seems that the Python exception cuts out the second, meaningful part.

giacbrd avatar Apr 09 '22 12:04 giacbrd