Unique type definitions can consume all memory

Open philrz opened this issue 2 years ago • 0 comments

At the time this issue is being opened Zed is at commit 1ec7052.

@mattnibs recently pointed out that the kinds of changes in #4555 make it particularly easy to make Zed consume all available memory by creating lots of unique type definitions. For an easy repro, consider the following script.

$ cat manytypes.py 
#!/usr/bin/env python3

num=1
while True:
  print('{"' + str(num) + '": ' + str(num) + '}')
  num += 1

Run on an AWS t2.xlarge (16 GB of memory), it gets past 19-million values before all memory is consumed and then the system hangs.

$ ./manytypes.sh | zq -z -
...
{"19198499":19198499}
{"19198500":19198500}
{"19198501":19198501}

We can certainly document this as a known limitation to encourage users to structure their data in ways that won't bump into this (e.g., use a Zed "map" type). However, this kind of data is legal in formats like JSON, and I think Zed currently needs to be able to read such JSON data in full to turn it into a map. Also,jq don't have this same limitation (in a test I observed its memory usage at a flat 872 KB to reach this same point), which is unsurprising given its approach to "stateless dataflow". Therefore it might be worth finding a way to tolerate this kind of input and/or fail more gracefully.

Nov 22 '23 01:11 philrz