zed icon indicating copy to clipboard operation
zed copied to clipboard

union with count limit

Open allukaZod opened this issue 2 years ago • 3 comments

Hi, Thank you for the great tool. Here is my problem:

when dealing with large amount set of data, such as

{"tag": "1", "ip": "1.1.1.1", "category": "some_cat1"}
{"tag": "2", "ip": "1.1.1.2", "category": "some_cat1"}
{"tag": "3", "ip": "1.1.1.3", "category": "some_cat1"}
{"tag": "4", "ip": "1.1.1.4", "category": "some_cat2"}
...
{"tag": "100000", "ip": "11.1.1.1", "category": "some_cat2"}

When use union function to union ip by category or tag, the zq command is: union(ip) by category. The result would be a large list of ip, such as

{"union": ["1.1.1.1", "1.1.1.2", "1.1.1.3", ... "11.1.1.1"], "category": "some_cat1"}

What I want was to set a limit in union, to limit the length of "union", and split this union with at most the number of limit. For example above, if I set the limit to be 2, union(ip) by category limit 2, the result would be:

{"union": ["1.1.1.1", "1.1.1.2"], "category": "some_cat1"}
{"union": ["1.1.1.3", "1.1.1.4"], "category": "some_cat1"}
...
{"union": ["11.1.1.1", "11.1.1.2"], "category": "some_cat1"}
{"union": ["1.11.1.3", "11.1.1.4"], "category": "some_cat1"}

allukaZod avatar Dec 14 '23 07:12 allukaZod

Hi @allukaZod! Thanks for your interest in Zed.

In your final table output below, I think you probably meant to have some entries with some_cat2? In any case, I think I understand the question, and if so, it's an interesting one. I'm looking into if Zed has the building blocks to do this with what's already there. I'll circle back after I've done more research.

philrz avatar Dec 14 '23 21:12 philrz

Hello again @allukaZod. I've got something close to what you were seeking. The approach using the currently-available building blocks in Zed creates the full sets via union() and then emits them in batches of the requested size after the fact. The approach is admittedly a little hacky and we might design a more direct way to achieve this in the future. For now I've wrapped the functionality in a User-Defined Operator so that way you don't have to reckon with it in the main part of your Zed pipeline. That said, it also provides an opportunity to understand some of the more advanced parts of the language like the spread operator and lateral subqueries.

Here's the user-defined operator in a file called batches.zed:

op emit_batches(complex_val, batch_size, group):
(
  over [...complex_val] with group =>
  (
    {id:(count()-1)/batch_size,val:this}
    | collect(val) by id
    | yield {group:group, batch:collect}
  )
)

And some sample data in a file data.json similar to what you showed:

{"tag": "1", "ip": "1.1.1.1", "category": "some_cat1"}
{"tag": "2", "ip": "1.1.1.2", "category": "some_cat1"}
{"tag": "3", "ip": "1.1.1.3", "category": "some_cat1"}
{"tag": "4", "ip": "1.1.1.4", "category": "some_cat1"}
{"tag": "5", "ip": "1.1.1.5", "category": "some_cat1"}
{"tag": "6", "ip": "1.1.1.6", "category": "some_cat1"}
{"tag": "7", "ip": "1.1.1.7", "category": "some_cat1"}
{"tag": "8", "ip": "1.1.1.8", "category": "some_cat1"}
{"tag": "9", "ip": "1.1.1.9", "category": "some_cat1"}
{"tag": "10", "ip": "1.1.1.10", "category": "some_cat1"}
{"tag": "11", "ip": "1.1.1.11", "category": "some_cat2"}
{"tag": "12", "ip": "1.1.1.12", "category": "some_cat2"}
{"tag": "13", "ip": "1.1.1.13", "category": "some_cat2"}
{"tag": "14", "ip": "1.1.1.14", "category": "some_cat2"}
{"tag": "15", "ip": "1.1.1.15", "category": "some_cat2"}
{"tag": "16", "ip": "1.1.1.16", "category": "some_cat2"}
{"tag": "17", "ip": "1.1.1.17", "category": "some_cat2"}
{"tag": "18", "ip": "1.1.1.18", "category": "some_cat2"}
{"tag": "19", "ip": "1.1.1.19", "category": "some_cat2"}
{"tag": "20", "ip": "1.1.1.20", "category": "some_cat2"}

And an example that ties it all together:

$ zq -I batches.zed 'union(ip) by category | emit_batches(union, 2, category)' data.json
{group:"some_cat2",batch:["1.1.1.11","1.1.1.12"]}
{group:"some_cat2",batch:["1.1.1.13","1.1.1.14"]}
{group:"some_cat2",batch:["1.1.1.15","1.1.1.16"]}
{group:"some_cat2",batch:["1.1.1.17","1.1.1.18"]}
{group:"some_cat2",batch:["1.1.1.19","1.1.1.20"]}
{group:"some_cat1",batch:["1.1.1.3","1.1.1.4"]}
{group:"some_cat1",batch:["1.1.1.5","1.1.1.6"]}
{group:"some_cat1",batch:["1.1.1.7","1.1.1.8"]}
{group:"some_cat1",batch:["1.1.1.9","1.1.1.10"]}
{group:"some_cat1",batch:["1.1.1.1","1.1.1.2"]}

However, I did bump into a new bug #4943 while working on this. The effects are evident if we use batch the same input data into groups of three.

$ zq -I batches.zed 'union(ip) by category | emit_batches(union, 3, category)' data.json
{group:"some_cat1",batch:["1.1.1.1","1.1.1.2","1.1.1.3"]}
{group:"some_cat1",batch:["1.1.1.4","1.1.1.5","1.1.1.6"]}
{group:"some_cat1",batch:["1.1.1.7","1.1.1.8","1.1.1.9"]}
{group:"some_cat1",batch:["1.1.1.10"]}
{group:"some_cat2",batch:["1.1.1.11","1.1.1.12"]}
{group:"some_cat2",batch:["1.1.1.13","1.1.1.14","1.1.1.15"]}
{group:"some_cat2",batch:["1.1.1.16","1.1.1.17","1.1.1.18"]}
{group:"some_cat2",batch:["1.1.1.19","1.1.1.20"]}

i.e., for some_cat2 we should have had three groups of 3 and one group of 1 like we had for some_cat1.

Anyway, I figured I'd share what I've got thus far in case you can make use of it despite that bug. I'll update again when we have that fix for #4943. Let me know if you have any other questions in the meantime.

philrz avatar Dec 17 '23 21:12 philrz

Thanks a lot for the suggestion, that's exactly what Im looking for!

Ane yes, some_cat2 shour be the other group.

allukaZod avatar Dec 19 '23 02:12 allukaZod

@allukaZod: Not sure if you're still watching this issue, but FYI, the issue #4943 I mentioned above has been fixed, so that last example shown previously now generates the correct expected output.

$ zq -I batches.zed 'union(ip) by category | emit_batches(union, 3, category)' data.json
{group:"some_cat1",batch:["1.1.1.1","1.1.1.2","1.1.1.3"]}
{group:"some_cat1",batch:["1.1.1.4","1.1.1.5","1.1.1.6"]}
{group:"some_cat1",batch:["1.1.1.7","1.1.1.8","1.1.1.9"]}
{group:"some_cat1",batch:["1.1.1.10"]}
{group:"some_cat2",batch:["1.1.1.11","1.1.1.12","1.1.1.13"]}
{group:"some_cat2",batch:["1.1.1.14","1.1.1.15","1.1.1.16"]}
{group:"some_cat2",batch:["1.1.1.17","1.1.1.18","1.1.1.19"]}
{group:"some_cat2",batch:["1.1.1.20"]}

This fix is currently in Zed's tip of main and will be included in the next GA release, which I estimate will come out next week.

philrz avatar Jun 07 '24 21:06 philrz