jaq icon indicating copy to clipboard operation
jaq copied to clipboard

How to handle data which does not fit into memory?

Open jobarr-amzn opened this issue 9 months ago • 5 comments

When jaq slurps a file (e.g. jsonlines format), it uses json_array, which collect()s from json_slice.

My Rust is not strong but it looks to me like this requires two passes over the data, one to parse all of it into memory and then a second to do the process it. Is that right, or am I missing something?

jobarr-amzn avatar Mar 28 '25 15:03 jobarr-amzn

You are right, this takes two passes. This is unfortunate, but there is no easy way out of this, I'm afraid. Some months ago, I've tried to come up with a value data type that would support lazily loaded data. However, I did not find a way to achieve this. Part of the problem is that it's really hard to know which part of a value has to be kept around and which one does not. For example, if you have a filter like .[2], .[0], then the interpreter has to figure out that it expects an array, of which it has to preserve the first value, then output the third value, and only then output the first value. Given that jq is a lazy Turing-complete language with side effects and sharing, this makes such endeavours a nightmare. Especially because of sharing. Even if such an endeavour would succeed, it would probably significantly reduce performance for use cases where all data does fit into memory.

The "jq way" of dealing with data that does not fit into memory is: Do not slurp your input; rather fold over your inputs, e.g. with reduce or foreach.

01mf02 avatar Mar 31 '25 11:03 01mf02

You can find an example about input reduction in the manual.

01mf02 avatar Mar 31 '25 11:03 01mf02

Thanks for the reference! I'll have to see if I can successfully apply that to my domain.

The reason I ask is because my team is working on an integration of jaq with the Ion data format (see amazon-ion/ion-cli#193) . Ion's type system is a superset of JSON's, and the notion of a value stream of independent top-level values (similar to the JSON Lines format) is core to many Ion data sets. Ion data sets can also grow quite large, it's not at all uncommon for our customers to have input files in the 10s of GB size range.

Looking at the jq manual I see a section on streaming, and then I see that streaming is among the advanced jq features that jaq does not aim to support.

This makes sense to me. I've never used jq or jaq for large inputs in this style, because we've never had this integration before. We have home-grown tools less versatile than jq for doing some data manipulation- as you noted there's a tension between expressive power and performance constraints.

Even then we generally won't need a streaming feature, I think most use cases can be met without it.

jobarr-amzn avatar Mar 31 '25 15:03 jobarr-amzn

Just a heads up: Don't confuse jq's streaming functionality, to read partial JSON values (the thing you linked) and input streaming, to read many full but separate JSON values (ex: echo '1 2 3' | jaq -n 'reduce inputs as $i (0; .+$i)'.

Maybe in your ion use-case what your looking for would be a custom input iterator that would iterate separate ion values? but i'm not sure if jaq support that yet? @01mf02

wader avatar Mar 31 '25 19:03 wader

@wader, thanks for your justification regarding streaming. Indeed, Ion's value streams seem to be equivalent to JSON streams, which both jq and jaq support. What jaq does not support is jq's --stream option, but from what you write, @jobarr-amzn, that is not what you are looking for anyway.

If you can write a function fn parse_ion_stream(...) -> impl Iterator<Item = IonVal> such that IonVal: jaq_core::ValT, then jaq should be able to process it "lazily", including support for the inputs function.

01mf02 avatar Apr 08 '25 07:04 01mf02