xmpp4r icon indicating copy to clipboard operation
xmpp4r copied to clipboard

xmpp4r throws the REXML::ParseException on Cyrillic characters

Open dustalov opened this issue 15 years ago • 22 comments

When simplest Jabber client (like at http://dumpz.org/9810/) receives the UTF-8 string which contains Cyrillic (e.g. Russian) characters, xmpp4r fails with REXML::ParseException and won't work anymore.

Backtrace here: http://dumpz.org/9806/

Thanks.

dustalov avatar Jun 16 '09 20:06 dustalov

The dumpz links don't work anymore

lnussbaum avatar Jun 18 '09 09:06 lnussbaum

erm, could you try to write a simple testcase for that bug, based on one of the existing testcases? According to the exception, REXML receives ASCII 8bit characters, not UTF-8 characters.

lnussbaum avatar Jun 19 '09 15:06 lnussbaum

Same situation. http://dumpz.org/17011/ ruby 1.9.1p378 (2010-01-10 revision 26273) [x86_64-linux] xmpp4r-0.5 Tested for [email protected]

kidoz avatar Feb 15 '10 18:02 kidoz

Yes. I would need a test case.

lnussbaum avatar Mar 22 '10 13:03 lnussbaum

http://megabytov.net/293 - test case.

vicpo avatar May 23 '10 13:05 vicpo

that could be a rexml problem. Since current version of ejabberd opens an xml stream with xml header missing encoding attribute, rexml fails to retrieve encoding and baseparser sets source.encoding to nil, which falls back to ASCII8BIT. That's wrong behaviour according to http://www.opentag.com/xfaq_enc.htm and I'm also unsure about force_utf8 detection algorithm. An immediate fix for me was to patch rexml's:

  • source.rb: http://pastie.org/1458174
  • parsers/baseparser.rb: http://pastie.org/1454110

Although nothing related seemed to break for now, I only used it in the development environment, not production. Regards

dotdoom avatar Jan 12 '11 21:01 dotdoom

I'm having this same issue. Is there any support for swapping out the xml parsing backend for something like nokogiri?

ajsharp avatar Aug 05 '11 23:08 ajsharp

I don't know. The last thing I came up with was

if RUBY_VERSION < "1.9"
# ...
else
    # Encoding patch
    require 'socket'
    class TCPSocket
        def external_encoding
            Encoding::BINARY
        end
    end

    require 'rexml/source'
    class REXML::IOSource
        alias_method :encoding_assign, :encoding=
        def encoding=(value)
            encoding_assign(value) if value
        end
    end

    begin
        # OpenSSL is optional and can be missing
        require 'openssl'
        class OpenSSL::SSL::SSLSocket
            def external_encoding
                Encoding::BINARY
            end
        end
    rescue
    end
end

monkey patch. But then more and more problems appeared (not affiliated with encoding but nvl) so I decided to rewrite the whole thing

dotdoom avatar Aug 05 '11 23:08 dotdoom

Which part did you rewrite? Is it open source?

ajsharp avatar Aug 05 '11 23:08 ajsharp

I've just recently started and have not much to commit yet. Will create the repository later.

dotdoom avatar Aug 05 '11 23:08 dotdoom

Hi, this seems related. Is there some workaround?

#<Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string)>
/usr/lib/ruby/1.9.1/rexml/source.rb:212:in 'match'
/usr/lib/ruby/1.9.1/rexml/source.rb:212:in 'match'
/usr/lib/ruby/1.9.1/rexml/parsers/baseparser.rb:425:in 'pull'
/usr/lib/ruby/1.9.1/rexml/parsers/sax2parser.rb:92:in 'parse'
/home/brian/.gem/gems/xmpp4r-0.5/lib/xmpp4r/streamparser.rb:79:in 'parse'
/home/brian/.gem/gems/xmpp4r-0.5/lib/xmpp4r/stream.rb:75:in 'block in start'
...
Exception parsing
Line: 
Position: 0
Last 80 unconsumed characters:
范</body>

bsl avatar Sep 17 '11 03:09 bsl

@bsl not to my knowledge, unfortunately. I think what we need at this point is an actively maintained fork of the xmpp4r library.

ajsharp avatar Sep 19 '11 17:09 ajsharp

btw, didn't the patch from https://github.com/ln/xmpp4r/issues/3#issuecomment-1739952 help?

dotdoom avatar Sep 19 '11 18:09 dotdoom

@dotdoom It does seem to fix it. Thank you!

bsl avatar Sep 19 '11 21:09 bsl

@dotdoom, works great for me. Thanks for the patch.

vosechu avatar Oct 03 '11 15:10 vosechu

@dodoom it works. thanks. only wait to be merged in. Is this project dead?

gutenye avatar Feb 29 '12 13:02 gutenye

The latest update was more than one year ago, so yes, it looks like it's dead... Maybe one of the forks is in better shape.

romanbsd avatar Mar 10 '12 19:03 romanbsd

This recently bit me, so I started working on a fork with better 1.9 compatibility at https://github.com/hoxworth/xmpp4r. Right now I simply replaced the stream parser with a Nokogiri SAX parser, which handles character encodings far better. I'd like to pull out as much REXML as possible, but XMPP4R was pretty tightly coupled to REXML.

hoxworth avatar Dec 14 '12 18:12 hoxworth

@hoxworth A while back I also started working on modernizing xmpp4r, https://github.com/whitehat101/xmpp4r, I merged most of the pulls and hacked some. I've been using that branch in "production" for months, and the only unresolved issue I'm bothered by is the utf-8 crashes. I'll check your stuff out, when I get a chance, and you might want to see mine.

I missed the monkeypatch in this issue b/c I only looked at pulls.

whitehat101 avatar Jan 20 '13 01:01 whitehat101

awesome, @whitehat101, I'll definitely take a look. didn't really like the monkey patch myself, and only worked for half of my use cases. we've been using my nokogiri patch for a while now with numerous utf-8 xmpp sources, and haven't had a crash since.

hoxworth avatar Jan 20 '13 01:01 hoxworth

Wow, @dotdoom, your code saved my life, thanks! :+1:

csfmeridian avatar Feb 01 '13 09:02 csfmeridian

@dotdoom oh!!! Good!!! you are so generous!! oh you save me and my money. Thank you so much!

sang2087 avatar Nov 13 '13 19:11 sang2087