redis-rb
redis-rb copied to clipboard
Encoding issue with data from Redis
See https://github.com/mperham/sidekiq/issues/1597 for backstory.
The suspicious code IMO is in command_helper.rb. What is the purpose of force_encoding here? Is this a matter of data going into Redis in a bad format and throwing an error when it comes out, due to this force_encoding call?
We are having a similar problem.
What we are doing:
- Store a document in redis
- Crawl a page, if the document is already in redis, take it from redis, else crawl the page.
- Somewhere in between in the document is not properly docnverted.
Lang: en_US.UTF-8
Text once:
HämorrhoidenFrei - HämorrhoidenFrei - wirksames Mittel gegen Hämorrhoiden, gute Mittel gegen Hämorrhoiden, Hämorrhoiden Hausmittel und Hämorrhoiden behandeln, Haemorrhoiden, Hämoriden Behandlung, Haemoriden Behandlung, haemoriden, behandlung, hausmitt
Text again later:
HmorrhoidenFrei-HmorrhoidenFrei - wirksames Mittel gegen Hmorrhoiden, gute Mittel gegen Hmorrhoiden, Hmorrhoiden Hausmittel und Hmorrhoiden behandeln, Haemorrhoiden, Hmoriden Behandlung, Haemoriden Behandlung, haemoriden, behandlung, hausmitt
I think this happens when storing the document in Redis.
Also in the text I frequently see things like: xDCber uns
The code:
def set_document_and_raw_from_params_or_request!
redis = Redis.new
redis_doc = redis.get(url)
if @document
@document = @document
@raw = @document.to_s
# check if the document is currently in redis
elsif redis_doc
Rails.logger.info "Taking document from redis: #{url}"
# Encoding issue seems to happen here
@document = Nokogiri::HTML::Document.parse(redis_doc, nil, 'utf-8')
@raw = @document.to_s
else
if @page
raw = @page
else
# Some servers are blocking us. Set a timeout.
timeout(TIMEOUT_REQUESTS_AFTER) do
request = open_with_default_values(url)
raw = request
end
end
@raw = raw.read
@document = Nokogiri::HTML::Document.parse(@raw, nil, 'utf-8')
persist_doc_in_redis(@raw)
end
@document && @raw
end
We just temporarily uncommented the redis-store feature in our code. Hopefully this helps. Has anyone an idea?
For reference, here's the original commit that added the referenced behaviour: https://github.com/redis/redis-rb/commit/61fa1f884a643cd7dea8e0e56498860594058a39
This should be solved in 5.0 (to be released soon). redis-client assumes UTF-8 first, but check for validity, and return Encoding::BINARY strings otherwise.