elasticsearch-py
elasticsearch-py copied to clipboard
helpers.bulk: UnicodeEncodeError: 'utf-8' codec can't encode character '\udab4' in position x: surrogates not allowed
Elasticsearch version (bin/elasticsearch --version
): N/A
elasticsearch-py
version (elasticsearch.__versionstr__
): 7.11.0
Description of the problem including expected versus actual behavior:
Using elasticsearch.helpers.bulk()
produces the following backtrace:
Traceback (most recent call last):
File "/usr/src/app/./test.py", line 55, in <module>
elasticsearch.helpers.bulk(es, actions)
File "/usr/local/lib/python3.9/site-packages/elasticsearch-7.11.0-py3.9.egg/elasticsearch/helpers/actions.py", line 399, in bulk
for ok, item in streaming_bulk(client, actions, *args, **kwargs):
File "/usr/local/lib/python3.9/site-packages/elasticsearch-7.11.0-py3.9.egg/elasticsearch/helpers/actions.py", line 310, in streaming_bulk
for bulk_data, bulk_actions in _chunk_actions(
File "/usr/local/lib/python3.9/site-packages/elasticsearch-7.11.0-py3.9.egg/elasticsearch/helpers/actions.py", line 159, in _chunk_actions
ret = chunker.feed(action, data)
File "/usr/local/lib/python3.9/site-packages/elasticsearch-7.11.0-py3.9.egg/elasticsearch/helpers/actions.py", line 120, in feed
cur_size += len(data.encode("utf-8")) + 1
UnicodeEncodeError: 'utf-8' codec can't encode character '\udab4' in position 63016: surrogates not allowed
This seems to be some issue with the input data in a pipeline. Unfortunately I cannot share the exact data here. However there seems to be similar issues fixed in the past for single document indexing:
#611 https://github.com/elastic/elasticsearch-py-async/issues/62
This here https://github.com/mfl-q/elasticsearch-py/commit/ac06d77b7d11af823d4caf7af237192bb716e4e6 fixes it for me, however I assume additional test cases are required, and I am not sure if other places need fixing as well.
Steps to reproduce:
Use elasticsearch.helpers.bulk(es, actions)
where actions contain some data in a document which contains for example '\udab4'
.
The suggested solution (2nd arg to surrogatepass
) is not good for len()
checking, I would recommend replace
.
a = "\udab4"
surrogatepass = a.encode('utf8', "surrogatepass")
replace = a.encode('utf8', "replace")
print(surrogatepass, len(surrogatepass))
print(replace, len(replace))
--
b'\xed\xaa\xb4' 3
b'?' 1
I am not sure I can follow your argument here, isn't that the expected behavior? E.g. len(bytesobject) will always return the number of bytes represented by the object?
In [3]: len("ü".encode("utf-8"))
Out[3]: 2
So this would be bad and should have returned 1?
It's by the way consistent with the fix here: https://github.com/elastic/elasticsearch-py/pull/612/files/41fb14550c9732dc3708c82e74d5658c78c35d6e
Hi, I have the same bug. Is there any progress?