String streams
In response to #45
While working on a 2nd attempt at implementing this in the Rust tokenizer (https://github.com/smheidrich/py-json-stream-rs-tokenizer/pull/89), I noticed that my benchmarking test (which uses large randomly generated JSON files) exhibits transient failures for the Python tokenizer from this branch:
pytest error log
self = <TransientStreamingJSONList: TRANSIENT, STREAMING>
def _iter_items(self):
while True:
if not self.streaming:
return
self._clear_child()
try:
item = self._load_item()
except StopIteration:
if self.streaming:
> raise ValueError(self.INCOMPLETE_ERROR)
E ValueError: Unterminated list at end of file
json-stream/src/json_stream/base.py:53: ValueError
----- Captured stdout call -----
generating random json...
generated random json /tmp/tmpshsa5y6s/random.json with size 1.000e+05 bytes
running with rust tokenizer
rust time: 0.03 s
running with python tokenizer
----- Captured stderr call -----
100%|██████████| 100000.0/100000.0 [00:00<00:00, 1289610.69it/s]
100%|██████████| 100/100 [00:00<00:00, 3080.47it/s]
100%|██████████| 100/100 [00:00<00:00, 1221.22it/s]
===== short test summary info =====
FAILED tests/test_via_benchmark.py::test_via_benchmark - ValueError: Unterminated list at end of file
===== 1 failed in 0.28s =====
@daggaz Could that be related to the bug you mentioned in https://github.com/daggaz/json-stream/issues/45#issuecomment-1622523515?
hmm...I need to get back on this!
Maybe worth mentioning: While doing benchmarks to check for performance regressions in https://github.com/smheidrich/py-json-stream-rs-tokenizer/pull/91, I noticed that this branch here is only ~3-4 times slower than the Rust tokenizer. Thought I had a regression at first but tested against the other branches and they remained at 10-15 times slower. So I guess doing read(1) instead of proper buffering as introduced in this PR was the major bottleneck this entire time, not the "purely computational" Python instructions like I had thought.