SleekXMPP icon indicating copy to clipboard operation
SleekXMPP copied to clipboard

fix for malformed xml from Google and other sinners

Open jfarrimo opened this issue 12 years ago • 13 comments

I've run into this bug:

https://bugs.freedesktop.org/show_bug.cgi?id=33857

Basically, the iTeleportService client is sending malformed xml through gchat, and gchat is merrily sending it along. Since this is a Google service, there is almost no hope of getting it fixed. Therefore, I'm forced to hack deep into the bowels of SleekXMPP to take care of it. It makes me feel dirty to do this. There is no shower that will make me feel clean again.

I don't have the luxury of de-friending this user. I don't see a good way to catch the malformed xml other than to sit in the socket and watch the stream as it comes by. Ugh.

Please tell me there is a better way to do this. Please.

BTW, the worst part is that there is no error message SleekXMPP emits that gives even the foggiest clue that this is the problem. It took me a good long while to drill down into the code to figure out what was going on. This is only partly SleekXMPP's fault. It's also the fault of the xml parser.

jfarrimo avatar Sep 12 '12 02:09 jfarrimo

Hrm. Ok, looks like the pull request covers three commits. It's only meant to cover the last one that mentions iTeleportService in the subject.

jfarrimo avatar Sep 12 '12 02:09 jfarrimo

Sadly, that looks to be the best solution for now (for Python 2, this won't work for 3 because filesocket isn't used there). The only other way to do it with the current architecture would be to make another wrapper for the socket object to do the filtering, but that's pretty much what this is.

legastero avatar Sep 13 '12 18:09 legastero

I've done more digging, and the fault isn't so much that of SleekXMPP as it is that the expat parser doesn't provide any way to handle errors. That is probably ok most of the time, and it's the way the XML specification says things should work, but sometimes we have to be able to handle errors anyway. When Google passes us bad XML in iq messages, we have to roll with it since they're an 800 pound gorilla. It would be just the same if Facebook was the culprit.

It looks like the lxml parser has a mode where it will try to correct errors, but I'm not entirely comfortable with that either. In a perfect world, the expat parser would hand you the offending xml, let you deal with it as best you can, then hand it back to retry.

The one thing that really irks me is that the expat parser won't even give you back the offending piece of XML. I had to put logging in the socket function just to get it to tell me what it was blowing up on. All that they give you is a useless line number and offset. Ugh!

jfarrimo avatar Sep 13 '12 18:09 jfarrimo

I'm working on moving away from iterparse() and working directly with the expat parser (see https://gist.github.com/3716705). If you see a way to get the error correction hooks you need in there, we can add it in.

legastero avatar Sep 13 '12 18:09 legastero

FYI, this has brought to light a bigger bug that I haven't found a workaround for yet. I'm counting up exceptions caught during session_start() and trying to exit the program after a certain number in a row (I'm taking this to mean something's badly broken). This, however, doesn't work. Here's what's going on:

  1. in the session_start() event, call self.send_presence() and self.get_roster(), as usual, catching exceptions they throw
  2. immediately get the xml stream error from the bad presence data received from iTeleportService
  3. call self.disconnect() to end it, since things are un-recoverably hosed at this point
  4. infinite loop encountered, program never stops, threads never die, etc...

I'm pretty sure this is because XMLStream._process() automatically calls self.reconnect() on getting this error (as it should in case of network errors, etc...). This somehow over-rides my self.disconnect() call. Basically, I can't make it stop.

I'm doing all of this so that, in the future, when more problems with malformed XML in presence stanzas crop up, I'll notice it and shut down. Otherwise the program will continually try to reconnect and it'll be difficult to detect that things have gone wrong.

jfarrimo avatar Sep 13 '12 22:09 jfarrimo

Here's the offending code:

def session_start(self, blank_event):
    self.logger.debug("starting session")
    try:
        self.send_presence()
        self.get_roster()
    except IqError as err:
        self.start_errors += 1
        self.logger.error(
            'There was an error getting the roster (%r): error count = %s',
            err.iq['error']['condition'], self.start_errors)
        if self.start_errors >= self.error_limit:
            self.logger.error("Too many startup errors encountered, giving up")
            self.disconnect()
    except IqTimeout:
        self.start_errors += 1
        self.logger.error(
            'Server is taking too long to respond: error count = %s',
            self.start_errors)
        if self.start_errors >= self.error_limit:
            self.logger.error("Too many startup errors encountered, giving up")
            self.disconnect()
    else:
        self.start_errors = 0
        self.online = True
        self.logger.info("session started")

jfarrimo avatar Sep 13 '12 22:09 jfarrimo

Interesting, I'll look into this. I know there are several issues related to state changes like this that need addressing.

For now, this looks like a good candidate for the new abort() method, which just shuts everything down without going through the full stream closing sequence.

legastero avatar Sep 13 '12 22:09 legastero

Oh, sweetness! I hadn't noticed that. I'll plug it in now and give it a shot.

Thanks!

jfarrimo avatar Sep 13 '12 22:09 jfarrimo

So, it turns out that abort() doesn't work out of the box in this situation. I had to comment out the self.stop.clear() call at the beginning of XMLStream._connect(). Apparently, there is a race condition between the event thread and the main thread where the event thread calls abort between the time that the main thread calls reconnect() and reconnect() transitions from the disconnected to the connected state (looks like a 2 second wait is in there). Do you know of any dangers from removing this self.stop.clear() call? The other option would be to add a similar self.aborting member and look for it before calling the self.stop.clear() call.

jfarrimo avatar Sep 14 '12 23:09 jfarrimo

Good catch. The self.stop.clear() call really ought to be in connect() instead of _connect(). That should eliminate the race condition.

legastero avatar Sep 18 '12 17:09 legastero

Ok, cool.

jfarrimo avatar Sep 19 '12 00:09 jfarrimo

Hi. Sorry to dig out a 3 years old issue. I am using Firebase Cloud Messaging and I am getting errors at the end of exchanges with the FCM server, because it seems it does not send </stream:stream> when it does not have anything more to send.

@jfarrimo , is this the kind of errors you were talking about? I have put more details on stack overflow.

azmeuk avatar Sep 05 '16 14:09 azmeuk

I don't remember the details of this issue at this point.

jfarrimo avatar Sep 06 '16 17:09 jfarrimo