zed icon indicating copy to clipboard operation
zed copied to clipboard

Optimize filters on pool keys

Open mattnibs opened this issue 2 years ago • 1 comments

If a user is filtering on a pool key field (i.e. ts >= now()-3h, the optimizer should recognize this and limit the range of scanned data.

mattnibs avatar Mar 23 '22 22:03 mattnibs

This came up here https://github.com/brimdata/brim/issues/2284

mattnibs avatar Mar 23 '22 22:03 mattnibs

Verified in Zed commit 56d5de4.

As test data, I loaded several GB of FDNS "a" records.

$ zed create -orderby timestamp:asc fdns
pool created: fdns 2DjfJ7r3yt4CiQ73h0qwHW25b7a

$ zed use fdns
Switched to branch "main" on pool "fdns"

$ zed load fdns-a.zng.gz 
(1/1) 3.00GB/3.00GB 0B/s 100.00%

$ zed query -Z 'head 1'
{
    name: "58.47.65.78",
    timestamp: 1677-09-21T00:12:43.145224192Z,
    type: "a",
    value: 58.47.65.78
}

I then crafted a query over the timestamp range that isolated 18 records. At Zed commit 7916524 that came right before the changes in #3930, we can see the 12+ second baseline for the query to run unoptimized.

$ zed -version
Version: v1.2.0-26-g79165249

$ curl http://localhost:9867/version
{"version":"v1.2.0-26-g79165249"}

$ time zed query -z 'timestamp > 2020-11-27T22:16:38Z and timestamp < 2020-11-27T22:17:01Z | count()'
{count:18(uint64)}

real	0m12.531s
user	0m0.019s
sys	0m0.012s

Then at Zed commit 56d5de4 that has the changes from #3930, now the same query completes in 67 milliseconds.

$ zed -version
Version: v1.2.0-27-g56d5de40

$ curl http://localhost:9867/version
{"version":"v1.2.0-27-g56d5de40"}

$ time zed query -z 'timestamp > 2020-11-27T22:16:38Z and timestamp < 2020-11-27T22:17:01Z | count()'
{count:18(uint64)}

real	0m0.067s
user	0m0.012s
sys	0m0.009s

Thanks @mattnibs!

philrz avatar Aug 23 '22 02:08 philrz

Also, per https://github.com/brimdata/zed/pull/3930#issuecomment-1189170142, we can see the importance of having all the filtering done in a single pipeline element. For example, if we reverse the order of the comparisons and put a pipe between them, we can make it run slow even in the current code that has the optimization.

$ zed -version
Version: v1.2.0-27-g56d5de40

$ curl http://localhost:9867/version
{"version":"v1.2.0-27-g56d5de40"}

$ time zed query -z 'timestamp < 2020-11-27T22:17:01Z | timestamp > 2020-11-27T22:16:38Z | count()'
{count:18(uint64)}

real	8m32.360s
user	0m0.236s
sys	0m0.093s

But once we and them into a single pipeline element, now it's once again super fast.

$ time zed query -z 'timestamp < 2020-11-27T22:17:01Z and timestamp > 2020-11-27T22:16:38Z | count()'
{count:18(uint64)}

real	0m0.093s
user	0m0.016s
sys	0m0.014s

philrz avatar Aug 23 '22 02:08 philrz