web-tools icon indicating copy to clipboard operation
web-tools copied to clipboard

don't show word space for non-english Explorer queries

Open pypt opened this issue 4 years ago • 6 comments

Hi,

/api/explorer/words/count doesn't appear to like it too much when it receives Hindi UTF-8 response from /api/v2/wc/list, e.g. try:

https://api.mediacloud.org/api/v2/wc/list?q=%28%22%E0%A4%86%E0%A4%B0%E0%A5%8D%E0%A4%A5%E0%A4%BF%E0%A4%95+%E0%A4%B8%E0%A4%B6%E0%A4%95%E0%A5%8D%E0%A4%A4%E0%A4%BF%E0%A4%95%E0%A4%B0%E0%A4%A3%22+OR+%22%E0%A4%86%E0%A4%B0%E0%A5%8D%E0%A4%A5%E0%A4%BF%E0%A4%95+%E0%A4%B6%E0%A4%95%E0%A5%8D%E0%A4%A4%E0%A4%BF%22+OR+%22%E0%A4%B2%E0%A4%98%E0%A5%81+%E0%A4%B5%E0%A5%8D%E0%A4%AF%E0%A4%BE%E0%A4%AA%E0%A4%BE%E0%A4%B0%22+OR+%22%E0%A4%B2%E0%A4%98%E0%A5%81+%E0%A4%B5%E0%A5%8D%E0%A4%AF%E0%A4%B5%E0%A4%B8%E0%A4%BE%E0%A4%AF%22%29+AND+%28%28+tags_id_media%3A%289325106%29%29%29&num_words=100&sample_size=1000&include_stopwords=0&include_stats=0&ngram_size=1&fq=publish_day%3A%5B2019-01-01T00%3A00%3A00Z+TO+2019-06-30T00%3A00%3A00Z%5D&key=<...>

Relevant dokku/mc-explorer:latest log:

[19:58:47][DEBUG] mediacloud.api api.py:_query:426 | query GET to https://api.mediacloud.org/api/v2/wc/list with {'q': '("आर्थिक सशक्तिकरण" OR "आर्थिक शक्ति" OR "लघु व्यापार" OR "लघु व्यवसाय") AND (( tags_id_media:(9325106)))', 'l': None, 'num_words': 100, 'sample_size': 1000, 'include_stopwords': 0, 'include_stats': 0, 'ngram_size': 1, 'random_seed': None, 'fq': 'publish_day:[2019-01-01T00:00:00Z TO 2019-06-30T00:00:00Z]'} and None
[19:58:50][DEBUG] mediacloud.api api.py:_query:479 | Profiling: 3.0286505222320557s for GET to https://api.mediacloud.org/api/v2/wc/list (with {'q': '("आर्थिक सशक्तिकरण" OR "आर्थिक शक्ति" OR "लघु व्यापार" OR "लघु व्यवसाय") AND (( tags_id_media:(9325106)))', 'l': None, 'num_words': 100, 'sample_size': 1000, 'include_stopwords': 0, 'include_stats': 0, 'ngram_size': 1, 'random_seed': None, 'fq': 'publish_day:[2019-01-01T00:00:00Z TO 2019-06-30T00:00:00Z]', 'key': '<...>'} / null)
[19:58:50][ERROR] server app.py:log_exception:1891 | Exception on /api/explorer/words/count [GET]
Traceback (most recent call last):
  File "/app/.heroku/python/lib/python3.7/site-packages/flask/app.py", line 2446, in wsgi_app
    response = self.full_dispatch_request()
  File "/app/.heroku/python/lib/python3.7/site-packages/flask/app.py", line 1951, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/app/.heroku/python/lib/python3.7/site-packages/flask/app.py", line 1820, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/app/.heroku/python/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/app/.heroku/python/lib/python3.7/site-packages/flask/app.py", line 1949, in full_dispatch_request
    rv = self.dispatch_request()
  File "/app/.heroku/python/lib/python3.7/site-packages/flask/app.py", line 1935, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/app/.heroku/python/lib/python3.7/site-packages/flask_login/utils.py", line 261, in decorated_view
    return func(*args, **kwargs)
  File "/app/server/util/request.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/app/server/views/explorer/words.py", line 21, in api_explorer_words
    return _get_word_count()
  File "/app/server/views/explorer/words.py", line 37, in _get_word_count
    word_data = query_wordcount(solr_q, solr_fq, sample_size=sample_size)
  File "/app/server/views/explorer/words.py", line 104, in query_wordcount
    google_word2vec_data = apicache.word2vec_google_2d(words)
  File "/app/server/views/explorer/apicache.py", line 158, in word2vec_google_2d
    return _cached_word2vec_google_2d(words)
  File "</app/.heroku/python/lib/python3.7/site-packages/decorator.py:decorator-gen-31>", line 2, in _cached_word2vec_google_2d
  File "/app/.heroku/python/lib/python3.7/site-packages/dogpile/cache/region.py", line 1272, in get_or_create_for_user_func
    should_cache_fn, (arg, kw))
  File "/app/.heroku/python/lib/python3.7/site-packages/dogpile/cache/region.py", line 879, in get_or_create
    async_creator) as value:
  File "/app/.heroku/python/lib/python3.7/site-packages/dogpile/lock.py", line 186, in __enter__
    return self._enter()
  File "/app/.heroku/python/lib/python3.7/site-packages/dogpile/lock.py", line 93, in _enter
    generated = self._enter_create(value, createdtime)
  File "/app/.heroku/python/lib/python3.7/site-packages/dogpile/lock.py", line 179, in _enter_create
    return self.creator()
  File "/app/.heroku/python/lib/python3.7/site-packages/dogpile/cache/region.py", line 839, in gen_value
    created_value = creator(*creator_args[0], **creator_args[1])
  File "/app/server/views/explorer/apicache.py", line 165, in _cached_word2vec_google_2d
    word2vec_results = wordembeddings.google_news_2d(words)
  File "/app/server/util/wordembeddings.py", line 10, in google_news_2d
    {'words[]': words})
  File "/app/server/util/wordembeddings.py", line 28, in _query_for_json
    response_json = response.json()
  File "/app/.heroku/python/lib/python3.7/site-packages/requests/models.py", line 897, in json
    return complexjson.loads(self.text, **kwargs)
  File "/app/.heroku/python/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/app/.heroku/python/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/app/.heroku/python/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

So it looks like a call to /api/v2/google-news/2d (served by dokku/word-embeddings:latest) fails. Relevant lines from its log:

[19:58:50][DEBUG] server.request request.py:wrapper:29 | ImmutableMultiDict([('words[]', u'\u0906\u0930\u094d\u0925\u093f\u0915'), ('words[]', u'\u0936\u0915\u094d\u0924\u093f'), ('words[]', u'\u092d\u093e\u0930\u0924'), ('words[]', u'\u0938\u0936\u0915\u094d\u0924\u093f\u0915\u0930\u0923'), ('words[]', u'\u0926\u0947\u0936'), ('words[]', u'\u0930\u0942\u092a'), ('words[]', u'\u0935\u094d\u092f\u093e\u092a\u093e\u0930'), ('words[]', u'\u092e\u0939\u093f\u0932\u093e\u0913\u0902'), ('words[]', u'\u091c\u093e\u090f\u0917\u093e'), ('words[]', u'\u0932\u0918\u0941'), ('words[]', u'\u092e\u0939\u093f\u0932\u093e'), ('words[]', u'\u0935\u093f\u0915\u093e\u0938'), ('words[]', u'\u0938\u0930\u0915\u093e\u0930'), ('words[]', u'\u0935\u094d\u092f\u0935\u0938\u093e\u092f'), ('words[]', u'\u092e\u094b\u0926\u0940'), ('words[]', u'\u0926\u0941\u0928\u093f\u092f\u093e'), ('words[]', u'\u092c\u0928\u093e\u0928\u0947'), ('words[]', u'\u092a\u093e\u0902\u091a'), ('words[]', u'\u0938\u093e\u092e\u093e\u091c\u093f\u0915'), ('words[]', u'\u0909\u0928\u094d\u0939\u094b\u0902\u0928\u0947'), ('words[]', u'\u092f\u094b\u091c\u0928\u093e'), ('words[]', u'\u0935\u093f\u0936\u094d\u0935'), ('words[]', u'\u092e\u0902\u0924\u094d\u0930\u0940'), ('words[]', u'\u0938\u092e\u093e\u091c'), ('words[]', u'\u092c\u0921\u093c\u0940'), ('words[]', u'\u091a\u0940\u0928'), ('words[]', u'\u0915\u093e\u0930\u094d\u092f'), ('words[]', u'\u0915\u094d\u0937\u0947\u0924\u094d\u0930'), ('words[]', u'\u0915\u093e\u0930\u094d\u092f\u0915\u094d\u0930\u092e'), ('words[]', u'\u0939\u092e\u093e\u0930\u0940'), ('words[]', u'\u0930\u094b\u091c\u0917\u093e\u0930'), ('words[]', u'\u0905\u0932\u094d\u092a\u0938\u0902\u0916\u094d\u092f\u0915'), ('words[]', u'\u0915\u0947\u0902\u0926\u094d\u0930'), ('words[]', u'\u0936\u093e\u092e\u093f\u0932'), ('words[]', u'\u0928\u0940\u0924\u093f'), ('words[]', u'\u0915\u093e\u092e'), ('words[]', u'\u092e\u091c\u092c\u0942\u0924'), ('words[]', u'\u0938\u0902\u0938\u094d\u0925\u093e\u0928\u094b\u0902'), ('words[]', u'\u0936\u093f\u0915\u094d\u0937\u093e'), ('words[]', u'\u0930\u0939\u0940'), ('words[]', u'\u0930\u093e\u091c\u094d\u092f'), ('words[]', u'\u0915\u093f\u0938\u093e\u0928\u094b\u0902'), ('words[]', u'\u0906\u0917\u0947'), ('words[]', u'\u0935\u094d\u092f\u093e\u092a\u093e\u0930\u093f\u092f\u094b\u0902'), ('words[]', u'\u092e\u093e\u0927\u094d\u092f\u092e'), ('words[]', u'\u0935\u0930\u094d\u0937\u094b\u0902'), ('words[]', u'\u0930\u093e\u091c\u0928\u0940\u0924\u093f\u0915'), ('words[]', u'\u092c\u0928'), ('words[]', u'\u092a\u093e\u0915\u093f\u0938\u094d\u0924\u093e\u0928'), ('words[]', u'\u0909\u092d\u0930\u0924\u0940'), ('words[]', u'\u0930\u0941\u092a\u092f\u0947'), ('words[]', u'\u092c\u0924\u093e\u092f\u093e'), ('words[]', u'\u091c\u094d\u092f\u093e\u0926\u093e'), ('words[]', u'\u0905\u092e\u0947\u0930\u093f\u0915\u093e'), ('words[]', u'\u0939\u094b\u0917\u093e'), ('words[]', u'\u0935\u0948\u0936\u094d\u0935\u093f\u0915'), ('words[]', u'\u092c\u093e\u0924'), ('words[]', u'\u0926\u0940'), ('words[]', u'\u0915\u0930\u094b\u0921\u093c'), ('words[]', u'\u0939\u092e'), ('words[]', u'\u0932\u093e\u0916'), ('words[]', u'\u0930\u093e\u0937\u094d\u091f\u094d\u0930'), ('words[]', u'\u092f\u0941\u0926\u094d\u0927'), ('words[]', u'\u092c\u0922\u093c\u0924\u0940'), ('words[]', u'\u092c\u0928\u0928\u0947'), ('words[]', u'\u0928\u0947\u0924\u0943\u0924\u094d\u0935'), ('words[]', u'\u0939\u0947\u0924\u0941'), ('words[]', u'\u0935\u093e\u0932\u0940'), ('words[]', u'\u0928\u0915\u0935\u0940'), ('words[]', u'\u0936\u0948\u0915\u094d\u0937\u0923\u093f\u0915'), ('words[]', u'\u0935\u0930\u094d\u0937'), ('words[]', u'\u092a\u094d\u0930\u0927\u093e\u0928\u092e\u0902\u0924\u094d\u0930\u0940'), ('words[]', u'\u0926\u0947\u0928\u0947'), ('words[]', u'\u0909\u0926\u094d\u092f\u092e\u093f\u092f\u094b\u0902'), ('words[]', u'\u092e\u093f\u0932'), ('words[]', u'\u0905\u0930\u094d\u0925\u0935\u094d\u092f\u0935\u0938\u094d\u0925\u093e'), ('words[]', u'\u0938\u093e\u092e\u093e\u091c\u093f\u0915-\u0906\u0930\u094d\u0925\u093f\u0915'), ('words[]', u'\u0932\u095c\u0915\u093f\u092f\u094b\u0902'), ('words[]', u'\u0924\u0939\u0924'), ('words[]', u'\u091a\u093e\u0939\u0924\u0947'), ('words[]', u'\u0905\u092d\u093f\u092f\u093e\u0928'), ('words[]', u'\u0905\u0917\u0932\u0947'), ('words[]', u'\u0938\u094d\u0925\u093e\u092a\u093f\u0924'), ('words[]', u'\u092c\u0928\u0947'), ('words[]', u'\u0915\u093e\u0930\u0923'), ('words[]', u'\u0939\u094b\u0917\u0940'), ('words[]', u'\u092e\u093e\u0928\u093e'), ('words[]', u'\u092d\u093e\u0930\u0924\u0940\u092f'), ('words[]', u'\u092e\u0926\u0926'), ('words[]', u'\u092c\u0948\u0920\u0915'), ('words[]', u'\u0928\u0930\u0947\u0902\u0926\u094d\u0930'), ('words[]', u'\u0926\u0947\u0916'), ('words[]', u'\u0924\u0940\u0928'), ('words[]', u'\u0938\u0902\u0917\u0920\u0928'), ('words[]', u'\u092a\u094d\u0930\u0926\u0947\u0936'), ('words[]', u'\u0905\u0927\u093f\u0915'), ('words[]', u'\u0935\u093f\u0926\u094d\u092f\u093e\u0930\u094d\u0925\u093f\u092f\u094b\u0902'), ('words[]', u'\u092e\u094c\u0915\u0947'), ('words[]', u'\u092c\u091a\u094d\u091a\u094b\u0902'), ('words[]', u'\u092a\u094d\u0930\u092e\u0941\u0916')])
[19:58:50][ERROR] flask.app app.py:log_exception:1761 | Exception on /api/v2/google-news/2d [POST]
Traceback (most recent call last):
  File "/app/.heroku/python/lib/python2.7/site-packages/flask/app.py", line 2292, in wsgi_app
    response = self.full_dispatch_request()
  File "/app/.heroku/python/lib/python2.7/site-packages/flask/app.py", line 1815, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/app/.heroku/python/lib/python2.7/site-packages/flask/app.py", line 1718, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/app/.heroku/python/lib/python2.7/site-packages/flask/app.py", line 1813, in full_dispatch_request
    rv = self.dispatch_request()
  File "/app/.heroku/python/lib/python2.7/site-packages/flask/app.py", line 1799, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/app/server/request.py", line 31, in wrapper
    return func(*args, **kwargs)
  File "/app/server/views/api.py", line 23, in google_embeddings_2d
    results = _embeddings_2d(word_vectors, words)
  File "/app/server/views/api.py", line 83, in _embeddings_2d
    words_with_model_info.append({'word': words_in_model[i]['word'], 'x': two_d_embeddings[i][0], 'y': two_d_embeddings[i][1]})
IndexError: list index out of range

References #1731.

pypt avatar Mar 02 '20 20:03 pypt

These embeddings models only work with English. The fix is to try and determine the language somehow and not show the embeddings widget at all.

rahulbot avatar Mar 02 '20 21:03 rahulbot

we have a few choices in python - looks like lang-detect is the most straightforward

https://stackoverflow.com/questions/43377265/determine-if-text-is-in-english/48436520#48436520

cindyloo avatar Mar 03 '20 17:03 cindyloo

we use cld2 on the backend

hroberts avatar Mar 03 '20 17:03 hroberts

Guessing from the error itself (IndexError: list index out of range), the code probably assumes that it will always find something in that array at a particular index which is not the case, so maybe a simple bounds check (or even a try-catch block) would work here. Guessing languages for (potentially short) queries is just too non-deterministic.

pypt avatar Mar 03 '20 18:03 pypt

yes good point. I just tested the queries using lang-detect and found that, bc our logical connections are in English, it is difficult to correctly determine NOT english.

cindyloo avatar Mar 03 '20 18:03 cindyloo

we can avoid the call to apicache.word2vec_google_2d(words) in both explorer and topics by testing for empty word or word term lists.

cindyloo avatar Mar 03 '20 20:03 cindyloo