zed icon indicating copy to clipboard operation
zed copied to clipboard

Put query is failing on large json zng file

Open muthu-rk opened this issue 2 years ago • 12 comments

@nwt: This is related to #3881. Large JSON files shown in this issue are present in #3881.

I am using zq from main branch after #3881 is present. Unfortunately, the -version is showing unknown.

Update.zed: over arr | switch id ( case 2 => put level0.level1.value:="2-changed" default => pass ) | merge id | arr:=collect(this)

$ ~/go/bin/zq -I update.zed 500mb.zng | zq -Z 'over arr | id==2' - stdio:stdin: format detection error zeek: line 1: bad types/fields definition in zeek header zjson: line 1: invalid character 'O' looking for beginning of value zson: zson syntax error zng: zngio: uncompressed length exceeds MaxSize zng21: zng type ID out of range csv: record on line 1023: wrong number of fields json: invalid character 'O' looking for beginning of value parquet: auto-detection not supported zst: auto-detection not supported

The same issue is observed for 100MB file too. However the above query works for 10MB zng file. 10mb.zng.zip

Interestingly, cut operation like below works on 1GB zng file too. $ ~/go/bin/zq -z -i zng 'over arr | where id==101 | cut level0.level1' 1gb.zng

Are there any commandline parameters I can tweak to make this query work on large files?

muthu-rk avatar May 13 '22 17:05 muthu-rk

The auto-detector can't handle the large files and it's the second zq in the pipeline above causing the problem. You can avoid the auto-detector with -i. If you run this it should work:

~/go/bin/zq -I update.zed 500mb.zng | zq -i zng -Z 'over arr | id==2' -

mccanne avatar May 13 '22 20:05 mccanne

We should probably improve the ZNG auto-detector...

mccanne avatar May 13 '22 20:05 mccanne

@mccanne: With -i option, I dont see the error anymore. But I dont get any output printed on stdin from second zq.

Also, I missed to call, the latest zq binary, the second time. Fixed the query and ran this.

~/go/bin/zq -I update.zed 500mb.zng | ~/go/bin/zq -i zng -Z 'over arr | id==2' -

Can you try this at your end to see if it works for you? I see no output for all these files.

100mb 500mb 1gb

muthu-rk avatar May 13 '22 20:05 muthu-rk

@muthu-rk: The reason you're seeing no output is because there's no records in the input files where the id field is equal to 2, so the non-output is a correct reflection of "no matches" in that == expression. The following shows the lowermost id values for each file start much higher.

$ for file in 100mb.zng 500mb.zng 1gb.zng ; do   echo "${file}:";   zq -I update.zed -i zng $file | zq -i zng -f table 'over arr | count() by id | sort id | head 5' -;   echo; done

100mb.zng:
id    count
30828 1
30829 1
30830 1
30831 1
30832 1

500mb.zng:
id     count
270834 1
270835 1
270836 1
270837 1
270838 1

1gb.zng:
id     count
570834 1
570835 1
570836 1
570837 1
570838 1

So targeting those id values should show what you seek.

$ zq -I update.zed -i zng 100mb.zng | zq -i zng -Z 'over arr | id==30828' -
{
    id: 30828,
    next: 30829,
...

philrz avatar May 13 '22 21:05 philrz

@philrz: Interesting.

Could you help me understand how I get the output for below query? This suggests that value with id=2 is present in that zng file. Is my understanding correct?

$  ~/go/bin/zq  -i zng -Z 'over arr | id==2' 500mb.zng 
{
    id: 2,
    next: 3,
    _id: "625664e3ea574290b931f172",
    index: 0,
    guid: "e300c649-6f2c-4a60-9b51-bc1be08d0a14",
    isActive: false,
    balance: ",764.44",
    picture: "http://placehold.it/32x32",
    age: 38,
    eyeColor: "brown",
    name: "Hart Kline",
    gender: "male",
    company: "LUNCHPAD",
    email: "[email protected]",
    phone: "+1 (840) 496-2259",
    address: "643 Clara Street, Groveville, North Carolina, 4785",
    registered: "2015-11-02T04:02:38 -06:-30",
    latitude: 82.284556,
    longitude: -53.359112,
    tags: [
        "ex",
        "duis",
        "commodo",
        "et",
        "ad",
        "voluptate",
        "cupidatat"
    ],
    friends: [
        {
            id: 0,
            name: "Bradford Shaffer"
        },
        {
            id: 1,
            name: "Monroe Kent"
        },
        {
            id: 2,
            name: "John Carey"
        }
    ],
    greeting: "Hello, Hart Kline! You have 9 unread messages.",
    favoriteFruit: "strawberry",
    level0: {
        tags: [
            1,
            2,
            3
        ],
        value: "0",
        level1: {
            tags: [
                1,
                2,
                3
            ],
            value: "1",
            level2: {
                tags: [
                    1,
                    2,
                    3
                ],
...

muthu-rk avatar May 13 '22 21:05 muthu-rk

@muthu-rk: Indeed, it looks like there's something wrong with collect() such that it's only gathering together a subset of records into the array and your one with id==2 happens to be one that's getting dropped.

To piece this apart, I made a variation of your "update" Zed script to skip the collect() step.

$ cat update_dont_collect.zed
over arr | switch id (
case 2 => put level0.level1.value:="2-changed"
default => pass
)
| merge id

Using this to turn the 100mb input file into individual records, we get an even count of 60k.

$ zq -i zng -I update_dont_collect.zed 100mb.zng > 60k.zng

$ zq 'count()' 60k.zng 
{count:60000(uint64)}

However, if I then assemble those individual records into an array and output its length, it comes up way short.

$ zq 'arr:=collect(this) | yield len(arr)' 60k.zng 
29173

I'm guessing some memory limit is being hit it and it's silently giving bad results rather than erroring out, but I can't say for sure. Someone who knows the code like @mccanne, @nwt, or @mattnibs could hopefully speak to that.

philrz avatar May 13 '22 22:05 philrz

Ah yes. Even with my novice abilities to peek at the code, I'm reminded of #1813 and #1494 that speak of limits that likely come into play here. The comment in the code says:

For now we silently discard entries to maintain the size limit.

Speaking for my own experience, the "silent" part made this pretty hairy to debug. Not sure if other changes in Zed in the interim might have made it easier to raise/eliminate the limit altogether. As before, hopefully the core Zed dev team can chime in.

philrz avatar May 14 '22 00:05 philrz

Oops. Yeah memory limit in collect. We'll fix this!

@muthu-rk thanks for helping us QA. We haven't tested extensively enough on large individual values like yours as most the use cases to date have involved large numbers of smaller values, but this is an important area for us so we appreciate the help.

mccanne avatar May 14 '22 12:05 mccanne

@mccanne : Thanks for response. Kindly let me know if you need more info. I would be glad to help out.

muthu-rk avatar May 16 '22 05:05 muthu-rk

@mccanne: This issue is a blocker to continue large json tests. Can you tell when will this get fixed?

Thanks.

muthu-rk avatar May 23 '22 18:05 muthu-rk

@muthu-rk: We should have something for you on this in the next 24 hours.

nwt avatar May 24 '22 16:05 nwt

@muthu-rk: We've merged #3914, which should get you past this problem. We're also working on a couple more related changes:

  1. A flag to let you set that limit from the command line (in case 1 GB still isn't enough)
  2. Informing you if collect() exceeds that limit

nwt avatar May 25 '22 19:05 nwt

Of the two bullets in the most recent comment above, the first was addressed via #3921 and new issue #4102 has been opened to track the second. Therefore i'm closing this issue.

philrz avatar Sep 21 '22 22:09 philrz