web-tools
web-tools copied to clipboard
don't show word space for non-english Explorer queries
Hi,
/api/explorer/words/count
doesn't appear to like it too much when it receives Hindi UTF-8 response from /api/v2/wc/list
, e.g. try:
https://api.mediacloud.org/api/v2/wc/list?q=%28%22%E0%A4%86%E0%A4%B0%E0%A5%8D%E0%A4%A5%E0%A4%BF%E0%A4%95+%E0%A4%B8%E0%A4%B6%E0%A4%95%E0%A5%8D%E0%A4%A4%E0%A4%BF%E0%A4%95%E0%A4%B0%E0%A4%A3%22+OR+%22%E0%A4%86%E0%A4%B0%E0%A5%8D%E0%A4%A5%E0%A4%BF%E0%A4%95+%E0%A4%B6%E0%A4%95%E0%A5%8D%E0%A4%A4%E0%A4%BF%22+OR+%22%E0%A4%B2%E0%A4%98%E0%A5%81+%E0%A4%B5%E0%A5%8D%E0%A4%AF%E0%A4%BE%E0%A4%AA%E0%A4%BE%E0%A4%B0%22+OR+%22%E0%A4%B2%E0%A4%98%E0%A5%81+%E0%A4%B5%E0%A5%8D%E0%A4%AF%E0%A4%B5%E0%A4%B8%E0%A4%BE%E0%A4%AF%22%29+AND+%28%28+tags_id_media%3A%289325106%29%29%29&num_words=100&sample_size=1000&include_stopwords=0&include_stats=0&ngram_size=1&fq=publish_day%3A%5B2019-01-01T00%3A00%3A00Z+TO+2019-06-30T00%3A00%3A00Z%5D&key=<...>
Relevant dokku/mc-explorer:latest
log:
[19:58:47][DEBUG] mediacloud.api api.py:_query:426 | query GET to https://api.mediacloud.org/api/v2/wc/list with {'q': '("आर्थिक सशक्तिकरण" OR "आर्थिक शक्ति" OR "लघु व्यापार" OR "लघु व्यवसाय") AND (( tags_id_media:(9325106)))', 'l': None, 'num_words': 100, 'sample_size': 1000, 'include_stopwords': 0, 'include_stats': 0, 'ngram_size': 1, 'random_seed': None, 'fq': 'publish_day:[2019-01-01T00:00:00Z TO 2019-06-30T00:00:00Z]'} and None
[19:58:50][DEBUG] mediacloud.api api.py:_query:479 | Profiling: 3.0286505222320557s for GET to https://api.mediacloud.org/api/v2/wc/list (with {'q': '("आर्थिक सशक्तिकरण" OR "आर्थिक शक्ति" OR "लघु व्यापार" OR "लघु व्यवसाय") AND (( tags_id_media:(9325106)))', 'l': None, 'num_words': 100, 'sample_size': 1000, 'include_stopwords': 0, 'include_stats': 0, 'ngram_size': 1, 'random_seed': None, 'fq': 'publish_day:[2019-01-01T00:00:00Z TO 2019-06-30T00:00:00Z]', 'key': '<...>'} / null)
[19:58:50][ERROR] server app.py:log_exception:1891 | Exception on /api/explorer/words/count [GET]
Traceback (most recent call last):
File "/app/.heroku/python/lib/python3.7/site-packages/flask/app.py", line 2446, in wsgi_app
response = self.full_dispatch_request()
File "/app/.heroku/python/lib/python3.7/site-packages/flask/app.py", line 1951, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/app/.heroku/python/lib/python3.7/site-packages/flask/app.py", line 1820, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/app/.heroku/python/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/app/.heroku/python/lib/python3.7/site-packages/flask/app.py", line 1949, in full_dispatch_request
rv = self.dispatch_request()
File "/app/.heroku/python/lib/python3.7/site-packages/flask/app.py", line 1935, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/app/.heroku/python/lib/python3.7/site-packages/flask_login/utils.py", line 261, in decorated_view
return func(*args, **kwargs)
File "/app/server/util/request.py", line 81, in wrapper
return func(*args, **kwargs)
File "/app/server/views/explorer/words.py", line 21, in api_explorer_words
return _get_word_count()
File "/app/server/views/explorer/words.py", line 37, in _get_word_count
word_data = query_wordcount(solr_q, solr_fq, sample_size=sample_size)
File "/app/server/views/explorer/words.py", line 104, in query_wordcount
google_word2vec_data = apicache.word2vec_google_2d(words)
File "/app/server/views/explorer/apicache.py", line 158, in word2vec_google_2d
return _cached_word2vec_google_2d(words)
File "</app/.heroku/python/lib/python3.7/site-packages/decorator.py:decorator-gen-31>", line 2, in _cached_word2vec_google_2d
File "/app/.heroku/python/lib/python3.7/site-packages/dogpile/cache/region.py", line 1272, in get_or_create_for_user_func
should_cache_fn, (arg, kw))
File "/app/.heroku/python/lib/python3.7/site-packages/dogpile/cache/region.py", line 879, in get_or_create
async_creator) as value:
File "/app/.heroku/python/lib/python3.7/site-packages/dogpile/lock.py", line 186, in __enter__
return self._enter()
File "/app/.heroku/python/lib/python3.7/site-packages/dogpile/lock.py", line 93, in _enter
generated = self._enter_create(value, createdtime)
File "/app/.heroku/python/lib/python3.7/site-packages/dogpile/lock.py", line 179, in _enter_create
return self.creator()
File "/app/.heroku/python/lib/python3.7/site-packages/dogpile/cache/region.py", line 839, in gen_value
created_value = creator(*creator_args[0], **creator_args[1])
File "/app/server/views/explorer/apicache.py", line 165, in _cached_word2vec_google_2d
word2vec_results = wordembeddings.google_news_2d(words)
File "/app/server/util/wordembeddings.py", line 10, in google_news_2d
{'words[]': words})
File "/app/server/util/wordembeddings.py", line 28, in _query_for_json
response_json = response.json()
File "/app/.heroku/python/lib/python3.7/site-packages/requests/models.py", line 897, in json
return complexjson.loads(self.text, **kwargs)
File "/app/.heroku/python/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/app/.heroku/python/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/app/.heroku/python/lib/python3.7/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
So it looks like a call to /api/v2/google-news/2d
(served by dokku/word-embeddings:latest
) fails. Relevant lines from its log:
[19:58:50][DEBUG] server.request request.py:wrapper:29 | ImmutableMultiDict([('words[]', u'\u0906\u0930\u094d\u0925\u093f\u0915'), ('words[]', u'\u0936\u0915\u094d\u0924\u093f'), ('words[]', u'\u092d\u093e\u0930\u0924'), ('words[]', u'\u0938\u0936\u0915\u094d\u0924\u093f\u0915\u0930\u0923'), ('words[]', u'\u0926\u0947\u0936'), ('words[]', u'\u0930\u0942\u092a'), ('words[]', u'\u0935\u094d\u092f\u093e\u092a\u093e\u0930'), ('words[]', u'\u092e\u0939\u093f\u0932\u093e\u0913\u0902'), ('words[]', u'\u091c\u093e\u090f\u0917\u093e'), ('words[]', u'\u0932\u0918\u0941'), ('words[]', u'\u092e\u0939\u093f\u0932\u093e'), ('words[]', u'\u0935\u093f\u0915\u093e\u0938'), ('words[]', u'\u0938\u0930\u0915\u093e\u0930'), ('words[]', u'\u0935\u094d\u092f\u0935\u0938\u093e\u092f'), ('words[]', u'\u092e\u094b\u0926\u0940'), ('words[]', u'\u0926\u0941\u0928\u093f\u092f\u093e'), ('words[]', u'\u092c\u0928\u093e\u0928\u0947'), ('words[]', u'\u092a\u093e\u0902\u091a'), ('words[]', u'\u0938\u093e\u092e\u093e\u091c\u093f\u0915'), ('words[]', u'\u0909\u0928\u094d\u0939\u094b\u0902\u0928\u0947'), ('words[]', u'\u092f\u094b\u091c\u0928\u093e'), ('words[]', u'\u0935\u093f\u0936\u094d\u0935'), ('words[]', u'\u092e\u0902\u0924\u094d\u0930\u0940'), ('words[]', u'\u0938\u092e\u093e\u091c'), ('words[]', u'\u092c\u0921\u093c\u0940'), ('words[]', u'\u091a\u0940\u0928'), ('words[]', u'\u0915\u093e\u0930\u094d\u092f'), ('words[]', u'\u0915\u094d\u0937\u0947\u0924\u094d\u0930'), ('words[]', u'\u0915\u093e\u0930\u094d\u092f\u0915\u094d\u0930\u092e'), ('words[]', u'\u0939\u092e\u093e\u0930\u0940'), ('words[]', u'\u0930\u094b\u091c\u0917\u093e\u0930'), ('words[]', u'\u0905\u0932\u094d\u092a\u0938\u0902\u0916\u094d\u092f\u0915'), ('words[]', u'\u0915\u0947\u0902\u0926\u094d\u0930'), ('words[]', u'\u0936\u093e\u092e\u093f\u0932'), ('words[]', u'\u0928\u0940\u0924\u093f'), ('words[]', u'\u0915\u093e\u092e'), ('words[]', u'\u092e\u091c\u092c\u0942\u0924'), ('words[]', u'\u0938\u0902\u0938\u094d\u0925\u093e\u0928\u094b\u0902'), ('words[]', u'\u0936\u093f\u0915\u094d\u0937\u093e'), ('words[]', u'\u0930\u0939\u0940'), ('words[]', u'\u0930\u093e\u091c\u094d\u092f'), ('words[]', u'\u0915\u093f\u0938\u093e\u0928\u094b\u0902'), ('words[]', u'\u0906\u0917\u0947'), ('words[]', u'\u0935\u094d\u092f\u093e\u092a\u093e\u0930\u093f\u092f\u094b\u0902'), ('words[]', u'\u092e\u093e\u0927\u094d\u092f\u092e'), ('words[]', u'\u0935\u0930\u094d\u0937\u094b\u0902'), ('words[]', u'\u0930\u093e\u091c\u0928\u0940\u0924\u093f\u0915'), ('words[]', u'\u092c\u0928'), ('words[]', u'\u092a\u093e\u0915\u093f\u0938\u094d\u0924\u093e\u0928'), ('words[]', u'\u0909\u092d\u0930\u0924\u0940'), ('words[]', u'\u0930\u0941\u092a\u092f\u0947'), ('words[]', u'\u092c\u0924\u093e\u092f\u093e'), ('words[]', u'\u091c\u094d\u092f\u093e\u0926\u093e'), ('words[]', u'\u0905\u092e\u0947\u0930\u093f\u0915\u093e'), ('words[]', u'\u0939\u094b\u0917\u093e'), ('words[]', u'\u0935\u0948\u0936\u094d\u0935\u093f\u0915'), ('words[]', u'\u092c\u093e\u0924'), ('words[]', u'\u0926\u0940'), ('words[]', u'\u0915\u0930\u094b\u0921\u093c'), ('words[]', u'\u0939\u092e'), ('words[]', u'\u0932\u093e\u0916'), ('words[]', u'\u0930\u093e\u0937\u094d\u091f\u094d\u0930'), ('words[]', u'\u092f\u0941\u0926\u094d\u0927'), ('words[]', u'\u092c\u0922\u093c\u0924\u0940'), ('words[]', u'\u092c\u0928\u0928\u0947'), ('words[]', u'\u0928\u0947\u0924\u0943\u0924\u094d\u0935'), ('words[]', u'\u0939\u0947\u0924\u0941'), ('words[]', u'\u0935\u093e\u0932\u0940'), ('words[]', u'\u0928\u0915\u0935\u0940'), ('words[]', u'\u0936\u0948\u0915\u094d\u0937\u0923\u093f\u0915'), ('words[]', u'\u0935\u0930\u094d\u0937'), ('words[]', u'\u092a\u094d\u0930\u0927\u093e\u0928\u092e\u0902\u0924\u094d\u0930\u0940'), ('words[]', u'\u0926\u0947\u0928\u0947'), ('words[]', u'\u0909\u0926\u094d\u092f\u092e\u093f\u092f\u094b\u0902'), ('words[]', u'\u092e\u093f\u0932'), ('words[]', u'\u0905\u0930\u094d\u0925\u0935\u094d\u092f\u0935\u0938\u094d\u0925\u093e'), ('words[]', u'\u0938\u093e\u092e\u093e\u091c\u093f\u0915-\u0906\u0930\u094d\u0925\u093f\u0915'), ('words[]', u'\u0932\u095c\u0915\u093f\u092f\u094b\u0902'), ('words[]', u'\u0924\u0939\u0924'), ('words[]', u'\u091a\u093e\u0939\u0924\u0947'), ('words[]', u'\u0905\u092d\u093f\u092f\u093e\u0928'), ('words[]', u'\u0905\u0917\u0932\u0947'), ('words[]', u'\u0938\u094d\u0925\u093e\u092a\u093f\u0924'), ('words[]', u'\u092c\u0928\u0947'), ('words[]', u'\u0915\u093e\u0930\u0923'), ('words[]', u'\u0939\u094b\u0917\u0940'), ('words[]', u'\u092e\u093e\u0928\u093e'), ('words[]', u'\u092d\u093e\u0930\u0924\u0940\u092f'), ('words[]', u'\u092e\u0926\u0926'), ('words[]', u'\u092c\u0948\u0920\u0915'), ('words[]', u'\u0928\u0930\u0947\u0902\u0926\u094d\u0930'), ('words[]', u'\u0926\u0947\u0916'), ('words[]', u'\u0924\u0940\u0928'), ('words[]', u'\u0938\u0902\u0917\u0920\u0928'), ('words[]', u'\u092a\u094d\u0930\u0926\u0947\u0936'), ('words[]', u'\u0905\u0927\u093f\u0915'), ('words[]', u'\u0935\u093f\u0926\u094d\u092f\u093e\u0930\u094d\u0925\u093f\u092f\u094b\u0902'), ('words[]', u'\u092e\u094c\u0915\u0947'), ('words[]', u'\u092c\u091a\u094d\u091a\u094b\u0902'), ('words[]', u'\u092a\u094d\u0930\u092e\u0941\u0916')])
[19:58:50][ERROR] flask.app app.py:log_exception:1761 | Exception on /api/v2/google-news/2d [POST]
Traceback (most recent call last):
File "/app/.heroku/python/lib/python2.7/site-packages/flask/app.py", line 2292, in wsgi_app
response = self.full_dispatch_request()
File "/app/.heroku/python/lib/python2.7/site-packages/flask/app.py", line 1815, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/app/.heroku/python/lib/python2.7/site-packages/flask/app.py", line 1718, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/app/.heroku/python/lib/python2.7/site-packages/flask/app.py", line 1813, in full_dispatch_request
rv = self.dispatch_request()
File "/app/.heroku/python/lib/python2.7/site-packages/flask/app.py", line 1799, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/app/server/request.py", line 31, in wrapper
return func(*args, **kwargs)
File "/app/server/views/api.py", line 23, in google_embeddings_2d
results = _embeddings_2d(word_vectors, words)
File "/app/server/views/api.py", line 83, in _embeddings_2d
words_with_model_info.append({'word': words_in_model[i]['word'], 'x': two_d_embeddings[i][0], 'y': two_d_embeddings[i][1]})
IndexError: list index out of range
References #1731.
These embeddings models only work with English. The fix is to try and determine the language somehow and not show the embeddings widget at all.
we have a few choices in python - looks like lang-detect is the most straightforward
https://stackoverflow.com/questions/43377265/determine-if-text-is-in-english/48436520#48436520
we use cld2 on the backend
Guessing from the error itself (IndexError: list index out of range
), the code probably assumes that it will always find something in that array at a particular index which is not the case, so maybe a simple bounds check (or even a try-catch
block) would work here. Guessing languages for (potentially short) queries is just too non-deterministic.
yes good point. I just tested the queries using lang-detect and found that, bc our logical connections are in English, it is difficult to correctly determine NOT english.
we can avoid the call to apicache.word2vec_google_2d(words)
in both explorer and topics by testing for empty word or word term lists.