redis-py
redis-py copied to clipboard
byte vector is incorrectly decoded as utf-8 string in ft result class
Version:
$ pip3 show redis
Name: redis
Version: 4.3.4
Platform: Python 3.9.2 on Debian GNU/Linux 11
Description: The bytes is converted to string in the vector search results and there is an error in this conversion. The bytes including b'\x80'
is converted to a wrong string.
Example Code
from redis import Redis
from redis.commands.search.field import VectorField
from redis.commands.search.query import Query
r = Redis(host='localhost',port=6379)
schema = (VectorField("v", "HNSW", {"TYPE": "FLOAT32", "DIM": 1, "DISTANCE_METRIC": "L2"}),)
r.ft().create_index(schema)
r.hset(f'{1}',mapping={'v':b'\x80\x00\x00\x00'})
q = Query("*=>[KNN 1 @v $vec AS vector_score]").dialect(2)
results = r.ft().search(q, query_params={"vec": b'\x80\x00\x00\x00'}).docs
for m in results:
print(m.v)
print('match emb =', bytes(m.v,'utf-8'))
The original bytes b'\x80\x00\x00\x00'
is converted to string '\x00\x00\x00'
.
Reason
# /redis/commands/search/result.py
dict(
dict(
zip(
map(to_string, res[i + fields_offset][::2]),
map(to_string, res[i + fields_offset][1::2]),
)
)
)
# /redis/commands/search/_util.py
def to_string(s):
if isinstance(s, str):
return s
elif isinstance(s, bytes):
return s.decode("utf-8", "ignore") # here!
else:
return s
@AnneYang720 did you find a workaround?
What about using "backslashreplace" mode instead of "ignore"?
@kamyabzad I think in this case, we should get the original bytes as result, rather than try any kind of unicode decoding? Since user may need to convert this back to a numpy array or float array.
I don't see a good solution or workaround under current search result parsing codebase though, maybe we need some ideas from the maintainers.