json-stream icon indicating copy to clipboard operation
json-stream copied to clipboard

Add option to receive strings as file-like streams

Open RamiAwar opened this issue 2 years ago • 22 comments

I noticed this in the future work section: supporting long string values as streams themselves. Is this already implemented by any chance?

If not, do you have a plan on how you're going to implement it? Maybe you can point me in the right direction and I can give it a try.

My use-case is streaming chunks of a json string and reading the key values in a "persistent" way as described in your docs, but also partial values. For example if I were to receive {"name": "John Doe"} then I first receive {"name": "J then ohn Doe"}, and I want to be able to iterate over it in Python like:

for x in response["name"]:
    print(x)

# J
# ohn Doe

Or some similar API that achieves the same thing.

RamiAwar avatar Jun 03 '23 13:06 RamiAwar

Hey, thanks for the issue.

If you're looking for something that's going to stream chunks as they are received exactly as it was chunked in network packets, then this isn't going to be the library for you.

However, assuming what you really want is for partially transmitted string data (that is embedded in json) to be made available as soon as possible, then something might be able to be done, as per my "future work" idea.

I think the best API would be if a file-like object was returned in place of the string. You can then read this like any file, and it will EOF when the particular json string is completed.

The API would be something like:

data = json_stream.load(f, strings_as_files=True)
long_string = data['name']
while s := long_string.read(1024):
    print(s)

I will think about this...

daggaz avatar Jun 09 '23 17:06 daggaz

@smheidrich I feel this this might end up in significant changes to the python tokenizer, which implies work for you, so I'll only proceed if you think you can support this in the rust tokenizer too.

daggaz avatar Jun 09 '23 17:06 daggaz

@daggaz Sure, that looks like it should be possible :+1:

smheidrich avatar Jun 09 '23 18:06 smheidrich

In pursuance of this, I have just created two PRs.

The first, #46, changes the tokenizer to read the stream into a buffer instead of reading a character at a time

The second builds on this buffered to allow strings to be return as file-like objects, basically implementing the API above.

@smheidrich there are some pretty big changes here so we can only proceed if you agree

daggaz avatar Jun 14 '23 22:06 daggaz

@daggaz I'll have a look as soon as possible and try to write an analogous MR, but I have to do some other stuff first (namely fix the build and align the Unicode surrogate pair code with https://github.com/daggaz/json-stream/pull/42), so it will take a while.

smheidrich avatar Jun 15 '23 20:06 smheidrich

Thanks for the quick response! A file-like will get the job done for me. Much cleaner than the quick implementation I went for.

Would it be possible to also wrap this file-like obj with a generator? Then it's even easier to consume. Do you see any downsides with that vs the file-like read?

RamiAwar avatar Jun 16 '23 11:06 RamiAwar

@smheidrich it occurs to me that this option should only be available for transient streams (or...we make them seekable in that case?) and that this has implications for #30

daggaz avatar Jun 16 '23 12:06 daggaz

@RamiAwar line-based file-like objects are already iterable, but JsonStreamReader is missing the required readline() implementation. I will update to add this. The API would then be:

data = json_stream.load(f, strings_as_files=True)
long_string = data['name']
for line in long_string:
    print(line)

If your data is not line based, obviously this doesn't work, in which case I think having access to the full file API is best.

daggaz avatar Jun 16 '23 12:06 daggaz

@RamiAwar line-based file-like objects are already iterable, but JsonStreamReader is missing the required readline() implementation. I will update to add this. The API would then be:

data = json_stream.load(f, strings_as_files=True)
long_string = data['name']
for line in long_string:
    print(line)

If your data is not line based, obviously this doesn't work, in which case I think having access to the full file API is best.

I like that yeah! Thanks!

RamiAwar avatar Jun 18 '23 07:06 RamiAwar

I've added the readline() support (and hence the iterator support).

daggaz avatar Jun 26 '23 22:06 daggaz

@smheidrich the testswill not pass on this branch until the rust tokenizer is implemented

daggaz avatar Jun 26 '23 22:06 daggaz

@smheidrich I'm considering the .completed part of the API optional. If you don't want to implement it, I can strip it out of the tests

daggaz avatar Jun 26 '23 22:06 daggaz

Also, totally open to ideas/suggestions/improvements!

daggaz avatar Jun 26 '23 22:06 daggaz

@smheidrich I've just updated the two PRs with some changes.

I've added a buffering argument with similar semantics to the open() method. This allows the end user to control the buffer size and in particular, pass 0 to disable buffering entirely (this led to the same issue as described in #49).

Of course, this causes everything to crash with TypeError: RustTokenizer.__new__() got an unexpected keyword argument 'buffering'.

It's a bit of a shame that the python tokenizer can't move forward (adding new args to control new behaviour) without upsetting the rust tokenizer.

Perhaps it can be made to ignore unknown args?

daggaz avatar Jun 30 '23 09:06 daggaz

@daggaz Sorry, I was pretty busy this week and have only started to wrap my head around the changes.

It's a bit of a shame that the python tokenizer can't move forward (adding new args to control new behaviour) without upsetting the rust tokenizer.

Perhaps it can be made to ignore unknown args?

I'm not sure that would be a good idea, swallowing potential errors silently etc...

If the goal is to be able to release this feature and others like it without waiting for the Rust tokenizer to have it too, wouldn't it make more sense to implement a switch in json-stream's tokenizer selection logic that looks at whether buffering or strings_as_files were requested? Right now, default_tokenizer is just set to one of the two tokenizers, but an alternative idea would be to make default_tokenizer itself a function that decides based on the supplied arguments which tokenizer to return. Then once the features have been implemented in the Rust tokenizer, the selection logic can change accordingly.

smheidrich avatar Jun 30 '23 20:06 smheidrich

Oh, absolutely! That's a great plan.

daggaz avatar Jun 30 '23 21:06 daggaz

I was thinking about this a bit further, and thought that maybe it should be up to the version of the rust tokenizer to decide if is supports x or y.

Perhaps rust_tokenizer_or_raise() could take a list of feature flags, and only return a tokenizer if it supports all the flags?

That way the python code doesn't need to understand the state of the rust tokenizer's feature support at all, especially if for example a person doesn't have the latest version?

daggaz avatar Jul 02 '23 16:07 daggaz

@daggaz Sure, that's not a problem from my end, and I can see the advantage that it won't require updates to json-stream which only bump the version of the dependency to json-stream-rs-tokenizer and change the selection logic. But the disadvantage is that then you always have to wait for me to add such flags to rust_tokenizer_or_raise when you introduce a new feature that I'm lagging behind on, whereas if it's all decided on the json-stream end, you can do what you want.

But I guess it's not that imporant since such cases will probably be rare in the future so I'll just merge it for now unless you have other suggestions for what the API of that call should look like?

smheidrich avatar Jul 02 '23 19:07 smheidrich

Ah, no, that's not quite what I meant, I mean that you take the list of requested feature flags (or just the kwargs that control them) and raise if there are any you don't support/understand.

This also means that you could support things ahead of the python tokenizer.

daggaz avatar Jul 02 '23 20:07 daggaz

Ohh I see, that makes more sense :+1: I've updated the PR

smheidrich avatar Jul 02 '23 20:07 smheidrich

My changes to run all the tests with both tokenizers has highlighted a bug with this PR.

Working on a fix!

daggaz avatar Jul 05 '23 21:07 daggaz

@daggaz sorry for the tag, but I need this feature, for now my workaround is splitting my strings in a list.

nacho00112 avatar Dec 23 '23 17:12 nacho00112