zed
zed copied to clipboard
Optimize filters on pool keys
If a user is filtering on a pool key field (i.e. ts >= now()-3h
, the optimizer should recognize this and limit the range of scanned data.
This came up here https://github.com/brimdata/brim/issues/2284
Verified in Zed commit 56d5de4.
As test data, I loaded several GB of FDNS "a" records.
$ zed create -orderby timestamp:asc fdns
pool created: fdns 2DjfJ7r3yt4CiQ73h0qwHW25b7a
$ zed use fdns
Switched to branch "main" on pool "fdns"
$ zed load fdns-a.zng.gz
(1/1) 3.00GB/3.00GB 0B/s 100.00%
$ zed query -Z 'head 1'
{
name: "58.47.65.78",
timestamp: 1677-09-21T00:12:43.145224192Z,
type: "a",
value: 58.47.65.78
}
I then crafted a query over the timestamp
range that isolated 18 records. At Zed commit 7916524 that came right before the changes in #3930, we can see the 12+ second baseline for the query to run unoptimized.
$ zed -version
Version: v1.2.0-26-g79165249
$ curl http://localhost:9867/version
{"version":"v1.2.0-26-g79165249"}
$ time zed query -z 'timestamp > 2020-11-27T22:16:38Z and timestamp < 2020-11-27T22:17:01Z | count()'
{count:18(uint64)}
real 0m12.531s
user 0m0.019s
sys 0m0.012s
Then at Zed commit 56d5de4 that has the changes from #3930, now the same query completes in 67 milliseconds.
$ zed -version
Version: v1.2.0-27-g56d5de40
$ curl http://localhost:9867/version
{"version":"v1.2.0-27-g56d5de40"}
$ time zed query -z 'timestamp > 2020-11-27T22:16:38Z and timestamp < 2020-11-27T22:17:01Z | count()'
{count:18(uint64)}
real 0m0.067s
user 0m0.012s
sys 0m0.009s
Thanks @mattnibs!
Also, per https://github.com/brimdata/zed/pull/3930#issuecomment-1189170142, we can see the importance of having all the filtering done in a single pipeline element. For example, if we reverse the order of the comparisons and put a pipe between them, we can make it run slow even in the current code that has the optimization.
$ zed -version
Version: v1.2.0-27-g56d5de40
$ curl http://localhost:9867/version
{"version":"v1.2.0-27-g56d5de40"}
$ time zed query -z 'timestamp < 2020-11-27T22:17:01Z | timestamp > 2020-11-27T22:16:38Z | count()'
{count:18(uint64)}
real 8m32.360s
user 0m0.236s
sys 0m0.093s
But once we and
them into a single pipeline element, now it's once again super fast.
$ time zed query -z 'timestamp < 2020-11-27T22:17:01Z and timestamp > 2020-11-27T22:16:38Z | count()'
{count:18(uint64)}
real 0m0.093s
user 0m0.016s
sys 0m0.014s