sseclient icon indicating copy to clipboard operation
sseclient copied to clipboard

sseclient 0.0.24 loses events due to Unterminated string... ValueError

Open xqt opened this issue 6 years ago • 20 comments

sseclient 0.0.24 fails randomly when json.load is processed at the EventSource's data. Randomly means that you can find this failure pretty sure if you process say 50 events from stream. The problem occures with Python 3.7; I've not tested other Python releases. This problem does not occur with sseclient 0.0.22.

I've have implemented a test suite to check this behavior, see https://gerrit.wikimedia.org/r/#/c/pywikibot/core/+/509132/2/tests/eventstreams_tests.py

Additional details you can find on our bug tracker https://phabricator.wikimedia.org/T222885

xqt avatar May 09 '19 19:05 xqt

Digging deeper into this matter, this failure was introduced with 0.0.23 if using this line chunk = self.resp.raw._fp.fp.read1(self.chunk_size) ignoring this and using chunk = self.resp.raw._fp.fp.read1(self.chunk_size) my test passes but the sse socket wasn't closed.

xqt avatar May 09 '19 20:05 xqt

I made a pull request PR#29 to check this ugly behaviour

xqt avatar May 10 '19 10:05 xqt

Running into this exact same issue as well with read1 not actually reading full events before passing it off to the handler.

OverlordQ avatar Aug 03 '19 03:08 OverlordQ

Okay, so I think it comes down to this.

Calling read1 using the raw file-pointer is context agnostic. You say read, it goes gobble gobble and spits back everything it finds.

Problem is, SSE streams require context. If you dump the headers returned from connecting to an SSE Endpoint, you'll probably notice a Transfer-Encoding: chunked header.

What gets pre-pended to every chunk? Size of the chunk.

resp.raw is still a http.client.HTTPResponse object, calling read on that will take into account chunked encoding and strip off the relavent bits and return the actual data.

Using resp.raw._fp.fp.read1 bypasses all of that.

So for example, if you're feeding JSON data through an SSE stream, unless it always matches up perfectly to a chunk, using read1 is going to insert the chunksize and accompanying newlines into the middle of your data, breaking JSON formatting.

There's a reason you shouldn't poke into the guts.

OverlordQ avatar Aug 03 '19 09:08 OverlordQ

Thanks for digging into that @OverlordQ.

Looks like we probably need to revert the "short read" change from @mutantmonkey from a few months ago. (See https://github.com/btubbs/sseclient/issues/8 and https://github.com/btubbs/sseclient/commit/8585bac65841c0ba9f6b79d59eb77c438abac86d.)

btubbs avatar Aug 03 '19 18:08 btubbs

Regarding Transfer-Encoding: chunked, the Server-Sent Events specification recommends against doing this:

Authors are also cautioned that HTTP chunking can have unexpected negative effects on the reliability of this protocol. Where possible, chunking should be disabled for serving event streams unless the rate of messages is high enough for this not to matter.

If this is the issue, then it sounds like we'll need to handle chunked encoding directly in this library, or detect when it is being used and disable the short read functionality. @OverlordQ Do you happen to have an example SSE endpoint that uses chunked encoding that I would be able to test with?

I understand the concern with poking into the guts, but isn't the purpose of this library to support server-sent events without requiring a low-level understanding of HTTP? There's no other way to properly deal with timely delivery of messages shorter than the specified chunk_size while still using http.client.

mutantmonkey avatar Aug 05 '19 05:08 mutantmonkey

I would guess that the Wikimedia test case @xqt gave will exhibit this, but http://stream.pushshift.io/ was where I was running into this behavior.

And yes while it does say should, it doesn't say must, so it is a case that would to be handled

OverlordQ avatar Aug 05 '19 06:08 OverlordQ

@OverlordQ I just pushed a branch that falls back to the old method when chunked encoding is enabled. Can you give it a try and confirm that it fixes your issue? https://github.com/mutantmonkey/sseclient/tree/disable_short_reads_when_chunked

Ideally, chunked encoding support should be implemented directly in this library, but that's going to be more involved. Just disabling short reads when using chunked encoding is a much quicker fix since this behavior is broken.

mutantmonkey avatar Aug 05 '19 07:08 mutantmonkey

@OverlordQ please see the above?

TheSandDoctor avatar Feb 27 '20 18:02 TheSandDoctor

@mutantmonkey @TheSandDoctor

I hit the same issue as @OverlordQ today, and stumbled onto this. The "disable_short_reads_when_chunked" fixed the issue for me.

I can't speak to the internals because I haven't spent the time digging in that Overlord did, because I found this pretty quickly and its working... but its working.

sudologic avatar Apr 01 '20 18:04 sudologic

mutantmonkeys fix works for me also. i have read that handling chunked encoding is a requirement for a http1.1 client see https://en.wikipedia.org/wiki/Chunked_transfer_encoding

Shoelace avatar Sep 08 '20 05:09 Shoelace

I checked this again with sseclient 0.0.23 and had 136 such issues within 10 minutes. During the same time I runned the same script with sseclient 0.0.22 and had no such error. Seems this is caused by this change.

See also https://phabricator.wikimedia.org/T222885

xqt avatar May 26 '22 16:05 xqt

Yes, I advise to try mutantmonkey's patch (#35), it fixed the issue for me

eino avatar May 27 '22 06:05 eino

Yes, I advise to try mutantmonkey's patch (#35), it fixed the issue for me

There are two patches which would solve it but they are pending for 3 years.

xqt avatar May 27 '22 19:05 xqt

Can confirm that this bug is still present in May 2023

grantcurell avatar May 31 '23 16:05 grantcurell

This is still happening in sseclient-0.0.27. Mini repo

import json
import sseclient

for event  in sseclient.SSEClient('https://stream.wikimedia.org/v2/stream/recentchange'):
  if event.event == 'message' and event.data:
    try:
      change = json.loads(event.data)
    except json.JSONDecodeError as err:
      print(event, err)
      raise

Fails with

{"$schema":"/mediawiki/recentchange/1.0.0","meta":{"uri":"https://www.wikidata.org/wiki/Q71160922","request_id":"b17b0516-1468-4893-ac7f-42f402e76f44","id":"f06b24db-9549-46e3-ad47-9e23e3b9738b","dt":"2024-08-08T17:22:54Z","domain":"[www.wiki](http://www.wiki/) Unterminated string starting at: line 1 column 233 (char 232)

This appears to be due to a partial read as the input is incomplete JSON.

sorensenjs avatar Aug 08 '24 17:08 sorensenjs

Yes. Issue still present. Expecting ',' delimiter: line 1 column 4073 (char 4072). Any workaround @sorensenjs @TheSandDoctor

pi3ch avatar Feb 24 '25 09:02 pi3ch

I abandoned trying to use this sseclient and did the following which worked well enough for me. It seems to me not hard to introduce incomplete chunk reading into the unit tests but decided against investing in that here as the project seems unmaintained.

_FIELD_SEPARATOR = ':'

class SSEClient(object):
  """Minimal SSE client.

  Adapted from https://integrand.io/blog/how-to-consume-sse-python
  """

  def __init__(self, event_source: Generator[bytes, None, None]):
    self._event_source = event_source

  def _read(self):
    data = b''
    for chunk in self._event_source:
     for line in chunk.splitlines(True):
       data += line
       if data.endswith((b'\r\r', b'\n\n', b'\r\n\r\n')):
         yield data
         data = b''
    if data:
      yield data

  def events(self):
    for chunk in self._read():
      event = { 'id': None, 'event': 'message', 'data': '' }
      for raw_line in chunk.splitlines():
        line = raw_line.decode('UTF-8')
        if not line.strip() or line.startswith(_FIELD_SEPARATOR):
          continue
        data = line.split(_FIELD_SEPARATOR, 1)
        field = data[0]
        # Ignore unknown fields.
        if field not in event.keys():
          continue
        value = ''
        if len(data) > 1:
          value = data[1].lstrip()
        if field == 'data':
          event['data'] += value + '\n'
        else:
          event[field] = value
      if not event['data']:
        continue
      yield event

  def close(self) -> None:
    self._event_source.close()

sorensenjs avatar Feb 24 '25 12:02 sorensenjs

It took running this for several hours to appear and in over a day's worth of logs (650MB) only occurred once, but it is still an issue that can occur even with #35 merged. Here is the one that I ran into:

2025-05-31 23:29:25,898 ERROR: JSONDecodeError: Unterminated string starting at: line 1 column 502 (char 501) | Raw: {"$schema":"/mediawiki/recentchange/1.0.0","meta":{"uri":"https://en.wikipedia.org/wiki/User:Dholua777","request_id":"90e0e9a1-9f56-40b3-af9c-bc265bb18a48","id":"1a7d3ccb-9ac9-490e-b6a3-0d11ef7c5048","dt":"2025-06-01T06:29:25Z","domain":"en.wikipedia.org","stream":"mediawiki.recentchange","topic":"eqiad.mediawiki.recentchange","partition":0,"offset":5632775405},"id":1909280348,"type":"new","namespace":2,"title":"User:Dholua777","title_url":"https://en.wikipedia.org/wiki/User:Dholua777","comment":"[[WP:AES|←]]Created page with 'NIPON DHOLUA …. ……………[ 🎬✨ A Voice for the Voiceless, A Vision for the Future ✨🎬

I managed to craft a unit test that exposes the bug (yay) and will work on getting a patch written. Both the eventual patch and the new unit test will become part of the repo.

TheSandDoctor avatar Jun 02 '25 04:06 TheSandDoctor

For the record, the previous PRs didn't actually fix it sadly. I should have a fix for this ready tomorrow, but I've run out of time tonight to finish it up. I think that there are enough fixes as of late in this to bump the version, even if I sadly don't have rights to actually release it. @btubbs can you assist on that end by any chance when the patch is up?

TheSandDoctor avatar Jun 03 '25 04:06 TheSandDoctor