elasticsearch-py helpers.bulk: UnicodeEncodeError: 'utf-8' codec can't encode character '\udab4' in position x: surrogates not allowed

helpers.bulk: UnicodeEncodeError: 'utf-8' codec can't encode character '\udab4' in position x: surrogates not allowed

Open mfl-q opened this issue 3 years ago • 3 comments

Elasticsearch version (bin/elasticsearch --version): N/A

elasticsearch-py version (elasticsearch.__versionstr__): 7.11.0

Description of the problem including expected versus actual behavior:

Using elasticsearch.helpers.bulk() produces the following backtrace:

Traceback (most recent call last):
  File "/usr/src/app/./test.py", line 55, in <module>
    elasticsearch.helpers.bulk(es, actions)
  File "/usr/local/lib/python3.9/site-packages/elasticsearch-7.11.0-py3.9.egg/elasticsearch/helpers/actions.py", line 399, in bulk
    for ok, item in streaming_bulk(client, actions, *args, **kwargs):
  File "/usr/local/lib/python3.9/site-packages/elasticsearch-7.11.0-py3.9.egg/elasticsearch/helpers/actions.py", line 310, in streaming_bulk
    for bulk_data, bulk_actions in _chunk_actions(
  File "/usr/local/lib/python3.9/site-packages/elasticsearch-7.11.0-py3.9.egg/elasticsearch/helpers/actions.py", line 159, in _chunk_actions
    ret = chunker.feed(action, data)
  File "/usr/local/lib/python3.9/site-packages/elasticsearch-7.11.0-py3.9.egg/elasticsearch/helpers/actions.py", line 120, in feed
    cur_size += len(data.encode("utf-8")) + 1
UnicodeEncodeError: 'utf-8' codec can't encode character '\udab4' in position 63016: surrogates not allowed

This seems to be some issue with the input data in a pipeline. Unfortunately I cannot share the exact data here. However there seems to be similar issues fixed in the past for single document indexing:

#611 https://github.com/elastic/elasticsearch-py-async/issues/62

This here https://github.com/mfl-q/elasticsearch-py/commit/ac06d77b7d11af823d4caf7af237192bb716e4e6 fixes it for me, however I assume additional test cases are required, and I am not sure if other places need fixing as well.

Steps to reproduce:

Use elasticsearch.helpers.bulk(es, actions) where actions contain some data in a document which contains for example '\udab4'.

Mar 10 '21 19:03 mfl-q

The suggested solution (2nd arg to surrogatepass) is not good for len() checking, I would recommend replace.

a = "\udab4"

surrogatepass = a.encode('utf8', "surrogatepass")
replace = a.encode('utf8', "replace")

print(surrogatepass, len(surrogatepass))
print(replace, len(replace))
-- 
b'\xed\xaa\xb4' 3
b'?' 1

Mar 30 '21 10:03 Sparkycz

I am not sure I can follow your argument here, isn't that the expected behavior? E.g. len(bytesobject) will always return the number of bytes represented by the object?

In [3]: len("ü".encode("utf-8"))
Out[3]: 2

So this would be bad and should have returned 1?

It's by the way consistent with the fix here: https://github.com/elastic/elasticsearch-py/pull/612/files/41fb14550c9732dc3708c82e74d5658c78c35d6e

Jul 08 '21 10:07 mfl-q

Hi, I have the same bug. Is there any progress?

Sep 17 '21 10:09 nlyf

elasticsearch-py elasticsearch-py copied to clipboard

helpers.bulk: UnicodeEncodeError: 'utf-8' codec can't encode character '\udab4' in position x: surrogates not allowed

elasticsearch-py
elasticsearch-py copied to clipboard