elyzer
elyzer copied to clipboard
Unicode support
Great work! Also, it would be nice if this supports unicode text input (Example: café). See the error below:
➜ elyzer git:(master) python __main__.py --es "http://localhost:9200" --index my_index --analyzer my_analyzer "café"
TOKENIZER: kuromoji_tokenizer
Traceback (most recent call last):
File "__main__.py", line 47, in <module>
main()
File "__main__.py", line 36, in main
es=es))
File "/Users/toiwa/Projects/Private/elyzer/elyzer/elyzer.py", line 72, in stepWise
analyzeResp = es.indices.analyze(index=indexName, body=body)
File "/Library/Python/2.7/site-packages/elasticsearch/client/utils.py", line 73, in _wrapped
return func(*args, params=params, **kwargs)
File "/Library/Python/2.7/site-packages/elasticsearch/client/indices.py", line 32, in analyze
'_analyze'), params=params, body=body)
File "/Library/Python/2.7/site-packages/elasticsearch/transport.py", line 284, in perform_request
body = self.serializer.dumps(body)
File "/Library/Python/2.7/site-packages/elasticsearch/serializer.py", line 50, in dumps
raise SerializationError(data, e)
elasticsearch.exceptions.SerializationError: ({'text': 'caf\xc3\xa9', 'char_filter': [], 'tokenizer': u'kuromoji_tokenizer'}, UnicodeDecodeError('ascii', '"caf\xc3\xa9"', 4, 5, 'ordinal not in range(128)'))