redis-rb icon indicating copy to clipboard operation
redis-rb copied to clipboard

Encoding issue with data from Redis

Open mperham opened this issue 11 years ago • 2 comments

See https://github.com/mperham/sidekiq/issues/1597 for backstory.

The suspicious code IMO is in command_helper.rb. What is the purpose of force_encoding here? Is this a matter of data going into Redis in a bad format and throwing an error when it comes out, due to this force_encoding call?

mperham avatar Mar 30 '14 23:03 mperham

We are having a similar problem.

What we are doing:

  • Store a document in redis
  • Crawl a page, if the document is already in redis, take it from redis, else crawl the page.
  • Somewhere in between in the document is not properly docnverted.

Lang: en_US.UTF-8

Text once:

HämorrhoidenFrei - HämorrhoidenFrei - wirksames Mittel gegen Hämorrhoiden, gute Mittel gegen Hämorrhoiden, Hämorrhoiden Hausmittel und Hämorrhoiden behandeln, Haemorrhoiden, Hämoriden Behandlung, Haemoriden Behandlung, haemoriden, behandlung, hausmitt

Text again later:

HmorrhoidenFrei-HmorrhoidenFrei - wirksames Mittel gegen Hmorrhoiden, gute Mittel gegen Hmorrhoiden, Hmorrhoiden Hausmittel und Hmorrhoiden behandeln, Haemorrhoiden, Hmoriden Behandlung, Haemoriden Behandlung, haemoriden, behandlung, hausmitt

I think this happens when storing the document in Redis.

Also in the text I frequently see things like: xDCber uns

The code:

def set_document_and_raw_from_params_or_request!
  redis = Redis.new
  redis_doc = redis.get(url)
  if @document
    @document = @document
    @raw = @document.to_s
  # check if the document is currently in redis
  elsif redis_doc
    Rails.logger.info "Taking document from redis: #{url}"
    # Encoding issue seems to happen here
    @document = Nokogiri::HTML::Document.parse(redis_doc, nil, 'utf-8')
    @raw = @document.to_s
  else
    if @page
      raw = @page
    else
      # Some servers are blocking us. Set a timeout.
      timeout(TIMEOUT_REQUESTS_AFTER) do
        request = open_with_default_values(url)
        raw = request
      end
    end
    @raw = raw.read
    @document = Nokogiri::HTML::Document.parse(@raw, nil, 'utf-8')
    persist_doc_in_redis(@raw)

  end
  @document && @raw
end

We just temporarily uncommented the redis-store feature in our code. Hopefully this helps. Has anyone an idea?

hendricius avatar Apr 01 '14 10:04 hendricius

For reference, here's the original commit that added the referenced behaviour: https://github.com/redis/redis-rb/commit/61fa1f884a643cd7dea8e0e56498860594058a39

yaauie avatar Jun 03 '14 22:06 yaauie

This should be solved in 5.0 (to be released soon). redis-client assumes UTF-8 first, but check for validity, and return Encoding::BINARY strings otherwise.

byroot avatar Aug 17 '22 18:08 byroot