json-stream String streams

In response to #45

Jun 14 '23 22:06 daggaz

While working on a 2nd attempt at implementing this in the Rust tokenizer (https://github.com/smheidrich/py-json-stream-rs-tokenizer/pull/89), I noticed that my benchmarking test (which uses large randomly generated JSON files) exhibits transient failures for the Python tokenizer from this branch:

pytest error log

self = <TransientStreamingJSONList: TRANSIENT, STREAMING>

    def _iter_items(self):
        while True:
            if not self.streaming:
                return
            self._clear_child()
            try:
                item = self._load_item()
            except StopIteration:
                if self.streaming:
>                   raise ValueError(self.INCOMPLETE_ERROR)
E                   ValueError: Unterminated list at end of file

json-stream/src/json_stream/base.py:53: ValueError
----- Captured stdout call -----
generating random json...
generated random json /tmp/tmpshsa5y6s/random.json with size 1.000e+05 bytes
running with rust tokenizer
rust time: 0.03 s
running with python tokenizer
----- Captured stderr call -----
100%|██████████| 100000.0/100000.0 [00:00<00:00, 1289610.69it/s]
100%|██████████| 100/100 [00:00<00:00, 3080.47it/s]
100%|██████████| 100/100 [00:00<00:00, 1221.22it/s]
===== short test summary info =====
FAILED tests/test_via_benchmark.py::test_via_benchmark - ValueError: Unterminated list at end of file
===== 1 failed in 0.28s =====

@daggaz Could that be related to the bug you mentioned in https://github.com/daggaz/json-stream/issues/45#issuecomment-1622523515?

Aug 22 '23 11:08 smheidrich

hmm...I need to get back on this!

Aug 22 '23 12:08 daggaz

Maybe worth mentioning: While doing benchmarks to check for performance regressions in https://github.com/smheidrich/py-json-stream-rs-tokenizer/pull/91, I noticed that this branch here is only ~3-4 times slower than the Rust tokenizer. Thought I had a regression at first but tested against the other branches and they remained at 10-15 times slower. So I guess doing read(1) instead of proper buffering as introduced in this PR was the major bottleneck this entire time, not the "purely computational" Python instructions like I had thought.

Aug 25 '23 21:08 smheidrich