awkward icon indicating copy to clipboard operation
awkward copied to clipboard

ak.from_json should take a `limit` parameter to only read what is necessary

Open jpivarski opened this issue 1 year ago • 0 comments

Description of new feature

  • limit=NONNEGATIVE-INT should be passed to C++ to stop the JSON-reading as soon as the number of entries in the ArrayBuilder reaches limit.
  • limit=(NONNEGATIVE-INT, NONNEGATIVE-INT) should pass both values; the first is a lower limit. They're both non-negative: they do not/cannot count from the end of a JSON document or stream. The lower limit doesn't prevent any reading or parsing, but it prevents data from being passed into the ArrayBuilder, which can save a lot of memory.

These should apply equally well to single-document and line-delimited mode.

A similar feature for ak.from_iter is not needed because Python already has an itertools.islice that users can use. (If we were to implement limit on Python iterators for symmetry, we'd just use itertools.islice internally.)

ak.from_parquet has a way to select row groups, but it would be more intuitive to be able to work with the same sort of limit argument; we'd just need to look in the Parquet metadata to translate entry limit into row group numbers (and then slice the unwanted parts of the first and last row groups... like Uproot already does with entry_start and entry_stop). For a format like Parquet, negative limits, counting from the end of the file, would be doable.

jpivarski avatar Apr 20 '23 18:04 jpivarski