sseclient 0.0.24 loses events due to Unterminated string... ValueError
sseclient 0.0.24 fails randomly when json.load is processed at the EventSource's data. Randomly means that you can find this failure pretty sure if you process say 50 events from stream. The problem occures with Python 3.7; I've not tested other Python releases. This problem does not occur with sseclient 0.0.22.
I've have implemented a test suite to check this behavior, see https://gerrit.wikimedia.org/r/#/c/pywikibot/core/+/509132/2/tests/eventstreams_tests.py
Additional details you can find on our bug tracker https://phabricator.wikimedia.org/T222885
Digging deeper into this matter, this failure was introduced with 0.0.23 if using this line
chunk = self.resp.raw._fp.fp.read1(self.chunk_size)
ignoring this and using chunk = self.resp.raw._fp.fp.read1(self.chunk_size) my test passes but the sse socket wasn't closed.
I made a pull request PR#29 to check this ugly behaviour
Running into this exact same issue as well with read1 not actually reading full events before passing it off to the handler.
Okay, so I think it comes down to this.
Calling read1 using the raw file-pointer is context agnostic. You say read, it goes gobble gobble and spits back everything it finds.
Problem is, SSE streams require context. If you dump the headers returned from connecting to an SSE Endpoint, you'll probably notice a Transfer-Encoding: chunked header.
What gets pre-pended to every chunk? Size of the chunk.
resp.raw is still a http.client.HTTPResponse object, calling read on that will take into account chunked encoding and strip off the relavent bits and return the actual data.
Using resp.raw._fp.fp.read1 bypasses all of that.
So for example, if you're feeding JSON data through an SSE stream, unless it always matches up perfectly to a chunk, using read1 is going to insert the chunksize and accompanying newlines into the middle of your data, breaking JSON formatting.
There's a reason you shouldn't poke into the guts.
Thanks for digging into that @OverlordQ.
Looks like we probably need to revert the "short read" change from @mutantmonkey from a few months ago. (See https://github.com/btubbs/sseclient/issues/8 and https://github.com/btubbs/sseclient/commit/8585bac65841c0ba9f6b79d59eb77c438abac86d.)
Regarding Transfer-Encoding: chunked, the Server-Sent Events specification recommends against doing this:
Authors are also cautioned that HTTP chunking can have unexpected negative effects on the reliability of this protocol. Where possible, chunking should be disabled for serving event streams unless the rate of messages is high enough for this not to matter.
If this is the issue, then it sounds like we'll need to handle chunked encoding directly in this library, or detect when it is being used and disable the short read functionality. @OverlordQ Do you happen to have an example SSE endpoint that uses chunked encoding that I would be able to test with?
I understand the concern with poking into the guts, but isn't the purpose of this library to support server-sent events without requiring a low-level understanding of HTTP? There's no other way to properly deal with timely delivery of messages shorter than the specified chunk_size while still using http.client.
I would guess that the Wikimedia test case @xqt gave will exhibit this, but http://stream.pushshift.io/ was where I was running into this behavior.
And yes while it does say should, it doesn't say must, so it is a case that would to be handled
@OverlordQ I just pushed a branch that falls back to the old method when chunked encoding is enabled. Can you give it a try and confirm that it fixes your issue? https://github.com/mutantmonkey/sseclient/tree/disable_short_reads_when_chunked
Ideally, chunked encoding support should be implemented directly in this library, but that's going to be more involved. Just disabling short reads when using chunked encoding is a much quicker fix since this behavior is broken.
@OverlordQ please see the above?
@mutantmonkey @TheSandDoctor
I hit the same issue as @OverlordQ today, and stumbled onto this. The "disable_short_reads_when_chunked" fixed the issue for me.
I can't speak to the internals because I haven't spent the time digging in that Overlord did, because I found this pretty quickly and its working... but its working.
mutantmonkeys fix works for me also. i have read that handling chunked encoding is a requirement for a http1.1 client see https://en.wikipedia.org/wiki/Chunked_transfer_encoding
I checked this again with sseclient 0.0.23 and had 136 such issues within 10 minutes. During the same time I runned the same script with sseclient 0.0.22 and had no such error. Seems this is caused by this change.
See also https://phabricator.wikimedia.org/T222885
Yes, I advise to try mutantmonkey's patch (#35), it fixed the issue for me
Yes, I advise to try mutantmonkey's patch (#35), it fixed the issue for me
There are two patches which would solve it but they are pending for 3 years.
Can confirm that this bug is still present in May 2023
This is still happening in sseclient-0.0.27. Mini repo
import json
import sseclient
for event in sseclient.SSEClient('https://stream.wikimedia.org/v2/stream/recentchange'):
if event.event == 'message' and event.data:
try:
change = json.loads(event.data)
except json.JSONDecodeError as err:
print(event, err)
raise
Fails with
{"$schema":"/mediawiki/recentchange/1.0.0","meta":{"uri":"https://www.wikidata.org/wiki/Q71160922","request_id":"b17b0516-1468-4893-ac7f-42f402e76f44","id":"f06b24db-9549-46e3-ad47-9e23e3b9738b","dt":"2024-08-08T17:22:54Z","domain":"[www.wiki](http://www.wiki/) Unterminated string starting at: line 1 column 233 (char 232)
This appears to be due to a partial read as the input is incomplete JSON.
Yes. Issue still present. Expecting ',' delimiter: line 1 column 4073 (char 4072). Any workaround @sorensenjs @TheSandDoctor
I abandoned trying to use this sseclient and did the following which worked well enough for me. It seems to me not hard to introduce incomplete chunk reading into the unit tests but decided against investing in that here as the project seems unmaintained.
_FIELD_SEPARATOR = ':'
class SSEClient(object):
"""Minimal SSE client.
Adapted from https://integrand.io/blog/how-to-consume-sse-python
"""
def __init__(self, event_source: Generator[bytes, None, None]):
self._event_source = event_source
def _read(self):
data = b''
for chunk in self._event_source:
for line in chunk.splitlines(True):
data += line
if data.endswith((b'\r\r', b'\n\n', b'\r\n\r\n')):
yield data
data = b''
if data:
yield data
def events(self):
for chunk in self._read():
event = { 'id': None, 'event': 'message', 'data': '' }
for raw_line in chunk.splitlines():
line = raw_line.decode('UTF-8')
if not line.strip() or line.startswith(_FIELD_SEPARATOR):
continue
data = line.split(_FIELD_SEPARATOR, 1)
field = data[0]
# Ignore unknown fields.
if field not in event.keys():
continue
value = ''
if len(data) > 1:
value = data[1].lstrip()
if field == 'data':
event['data'] += value + '\n'
else:
event[field] = value
if not event['data']:
continue
yield event
def close(self) -> None:
self._event_source.close()
It took running this for several hours to appear and in over a day's worth of logs (650MB) only occurred once, but it is still an issue that can occur even with #35 merged. Here is the one that I ran into:
2025-05-31 23:29:25,898 ERROR: JSONDecodeError: Unterminated string starting at: line 1 column 502 (char 501) | Raw: {"$schema":"/mediawiki/recentchange/1.0.0","meta":{"uri":"https://en.wikipedia.org/wiki/User:Dholua777","request_id":"90e0e9a1-9f56-40b3-af9c-bc265bb18a48","id":"1a7d3ccb-9ac9-490e-b6a3-0d11ef7c5048","dt":"2025-06-01T06:29:25Z","domain":"en.wikipedia.org","stream":"mediawiki.recentchange","topic":"eqiad.mediawiki.recentchange","partition":0,"offset":5632775405},"id":1909280348,"type":"new","namespace":2,"title":"User:Dholua777","title_url":"https://en.wikipedia.org/wiki/User:Dholua777","comment":"[[WP:AES|←]]Created page with 'NIPON DHOLUA …. ……………[ 🎬✨ A Voice for the Voiceless, A Vision for the Future ✨🎬
I managed to craft a unit test that exposes the bug (yay) and will work on getting a patch written. Both the eventual patch and the new unit test will become part of the repo.
For the record, the previous PRs didn't actually fix it sadly. I should have a fix for this ready tomorrow, but I've run out of time tonight to finish it up. I think that there are enough fixes as of late in this to bump the version, even if I sadly don't have rights to actually release it. @btubbs can you assist on that end by any chance when the patch is up?