json-stream
                                
                                 json-stream copied to clipboard
                                
                                    json-stream copied to clipboard
                            
                            
                            
                        Add option to receive strings as file-like streams
I noticed this in the future work section: supporting long string values as streams themselves. Is this already implemented by any chance?
If not, do you have a plan on how you're going to implement it? Maybe you can point me in the right direction and I can give it a try.
My use-case is streaming chunks of a json string and reading the key values in a "persistent" way as described in your docs, but also partial values. For example if I were to receive {"name": "John Doe"} then I first receive {"name":  "J then ohn Doe"},
and I want to be able to iterate over it in Python like:
for x in response["name"]:
    print(x)
# J
# ohn Doe
Or some similar API that achieves the same thing.
Hey, thanks for the issue.
If you're looking for something that's going to stream chunks as they are received exactly as it was chunked in network packets, then this isn't going to be the library for you.
However, assuming what you really want is for partially transmitted string data (that is embedded in json) to be made available as soon as possible, then something might be able to be done, as per my "future work" idea.
I think the best API would be if a file-like object was returned in place of the string. You can then read this like any file, and it will EOF when the particular json string is completed.
The API would be something like:
data = json_stream.load(f, strings_as_files=True)
long_string = data['name']
while s := long_string.read(1024):
    print(s)
I will think about this...
@smheidrich I feel this this might end up in significant changes to the python tokenizer, which implies work for you, so I'll only proceed if you think you can support this in the rust tokenizer too.
@daggaz Sure, that looks like it should be possible :+1:
In pursuance of this, I have just created two PRs.
The first, #46, changes the tokenizer to read the stream into a buffer instead of reading a character at a time
The second builds on this buffered to allow strings to be return as file-like objects, basically implementing the API above.
@smheidrich there are some pretty big changes here so we can only proceed if you agree
@daggaz I'll have a look as soon as possible and try to write an analogous MR, but I have to do some other stuff first (namely fix the build and align the Unicode surrogate pair code with https://github.com/daggaz/json-stream/pull/42), so it will take a while.
Thanks for the quick response! A file-like will get the job done for me. Much cleaner than the quick implementation I went for.
Would it be possible to also wrap this file-like obj with a generator? Then it's even easier to consume. Do you see any downsides with that vs the file-like read?
@smheidrich it occurs to me that this option should only be available for transient streams (or...we make them seekable in that case?) and that this has implications for #30
@RamiAwar line-based file-like objects are already iterable, but JsonStreamReader is missing the required readline() implementation. I will update to add this. The API would then be:
data = json_stream.load(f, strings_as_files=True)
long_string = data['name']
for line in long_string:
    print(line)
If your data is not line based, obviously this doesn't work, in which case I think having access to the full file API is best.
@RamiAwar line-based file-like objects are already iterable, but JsonStreamReader is missing the required readline() implementation. I will update to add this. The API would then be:
data = json_stream.load(f, strings_as_files=True) long_string = data['name'] for line in long_string: print(line)If your data is not line based, obviously this doesn't work, in which case I think having access to the full file API is best.
I like that yeah! Thanks!
I've added the readline() support (and hence the iterator support).
@smheidrich the testswill not pass on this branch until the rust tokenizer is implemented
@smheidrich I'm considering the .completed part of the API optional. If you don't want to implement it, I can strip it out of the tests
Also, totally open to ideas/suggestions/improvements!
@smheidrich I've just updated the two PRs with some changes.
I've added a buffering argument with similar semantics to the open() method. This allows the end user to control the buffer size and in particular, pass 0 to disable buffering entirely (this led to the same issue as described in #49).
Of course, this causes everything to crash with TypeError: RustTokenizer.__new__() got an unexpected keyword argument 'buffering'.
It's a bit of a shame that the python tokenizer can't move forward (adding new args to control new behaviour) without upsetting the rust tokenizer.
Perhaps it can be made to ignore unknown args?
@daggaz Sorry, I was pretty busy this week and have only started to wrap my head around the changes.
It's a bit of a shame that the python tokenizer can't move forward (adding new args to control new behaviour) without upsetting the rust tokenizer.
Perhaps it can be made to ignore unknown args?
I'm not sure that would be a good idea, swallowing potential errors silently etc...
If the goal is to be able to release this feature and others like it without waiting for the Rust tokenizer to have it too, wouldn't it make more sense to implement a switch in json-stream's tokenizer selection logic that looks at whether buffering or strings_as_files were requested? Right now, default_tokenizer is just set to one of the two tokenizers, but an alternative idea would be to make default_tokenizer itself a function that decides based on the supplied arguments which tokenizer to return. Then once the features have been implemented in the Rust tokenizer, the selection logic can change accordingly.
Oh, absolutely! That's a great plan.
I was thinking about this a bit further, and thought that maybe it should be up to the version of the rust tokenizer to decide if is supports x or y.
Perhaps rust_tokenizer_or_raise() could take a list of feature flags, and only return a tokenizer if it supports all the flags?
That way the python code doesn't need to understand the state of the rust tokenizer's feature support at all, especially if for example a person doesn't have the latest version?
@daggaz Sure, that's not a problem from my end, and I can see the advantage that it won't require updates to json-stream which only bump the version of the dependency to json-stream-rs-tokenizer and change the selection logic. But the disadvantage is that then you always have to wait for me to add such flags to rust_tokenizer_or_raise when you introduce a new feature that I'm lagging behind on, whereas if it's all decided on the json-stream end, you can do what you want.
But I guess it's not that imporant since such cases will probably be rare in the future so I'll just merge it for now unless you have other suggestions for what the API of that call should look like?
Ah, no, that's not quite what I meant, I mean that you take the list of requested feature flags (or just the kwargs that control them) and raise if there are any you don't support/understand.
This also means that you could support things ahead of the python tokenizer.
Ohh I see, that makes more sense :+1: I've updated the PR
My changes to run all the tests with both tokenizers has highlighted a bug with this PR.
Working on a fix!
@daggaz sorry for the tag, but I need this feature, for now my workaround is splitting my strings in a list.