zed Performance degradation as commits accumulate

Performance degradation as commits accumulate

Open philrz opened this issue 2 years ago • 6 comments

@mccanne identified a performance exposure as the commit log gets long, so I ran the following script on my Mac laptop to quantify the degradation. The test data is the first Zeek conn record from the zed-sample-data in ZNG format.

#!/bin/sh
rm -rf $ZED_LAKE_ROOT
zed lake init
zed lake create logs
zed lake checkout logs@main
NUM=1
while true; do
  echo "Performing commit #${NUM}:"
  time zed lake load one_conn.zng
  echo
  echo "Performing minimal query:"
  time zed lake query 'head 1' > /dev/null
  echo
  NUM=$(expr $NUM + 1)
done

As shown in the chart below, the simple head 1 query goes from 23 milliseconds when the pool contains just the one commit to 4+ seconds when there's 21k commits.

Sep 07 '21 14:09 philrz

Just to be clear, this is all "by design" and shall be taken care of by not-yet-implemented cats and snanpshots of the commit history. Since everything is immutable (and since ZNG can be simply concatenated), it is trivial to concatenate a commit history at any object and to create snapshots of the current state of the object store at any commit.

Sep 07 '21 15:09 mccanne

I'm also curious about our commit-per-second speed. There are some easy things here we can do here to speed this up too...

Sep 07 '21 15:09 mccanne

(The commit history overhead slows down the commits because you have to build the commit state for each commit made...)

Sep 07 '21 15:09 mccanne

Also query times are slow because of the massive number of single-record ZNG objects in the store. Small commits will be address with LSM-style rollups via the forthcoming compact command (which is trivial to implement... just read a bunch of small objects in a mergescan and write them back).

Sep 07 '21 15:09 mccanne

The addition of zed compact via #3940 justifies revisiting the tests performed here to see if compacting the objects from the large number of commits might improve query performance.,

As the initial tests were performed back in September of 2021, it first seemed appropriate to re-baseline by repeating the original test with current Zed. The Zed commands have changed a bit in the time since, so here's an update of the repro script that's in sync with current Zed commit 28029f0.

#!/bin/sh
export ZED_LAKE="$HOME/.zed-3027-lake"
rm -rf $ZED_LAKE
zed init
zed create logs
zed use logs@main
NUM=1
while true; do
  echo "Performing commit #${NUM}:"
  time zed load one_conn.zng
  echo
  echo "Performing minimal query:"
  time zed query 'head 1' > /dev/null
  echo
  NUM=$(expr $NUM + 1)
done

While I didn't let it run for as many hours, since the prior results indicated a linear degradation, I repeated the test up through 7000 commits. Here's results that chart the slower commit and query times as the number of commits increases.

While the results are roughly similar to what we saw in the past, comparing just the commit times then vs. now, we see that the time per commit is actually a bit worse now than it was before.

Whereas the query times are slightly better now.

Finally, we see that we're able to compact the objects from the 7000 commits in 1.438 seconds.

$ ulimit -n 1000000
$ time zed compact $(zed query -f text 'from logs@main:objects | yield "0x${hex(id)}"')
2BXimaptwtcBm5U9vcixBureyxA compaction committed

real    0m1.438s
user    0m2.575s
sys    0m1.232s

Now the time for our head 1 query is 0.444 seconds, which is almost 3x better than the 1.26 seconds it took for the same query before compacting.

$ time zed query 'head 1' > /dev/null
real    0m0.444s
user    0m0.471s
sys    0m0.260s

However, if the same data is re-loaded into a fresh pool, the query is now super fast.

$ zed create fresh
pool created: fresh 2BaZBV1wQB3wx3qhn2kiwrwqjRO

$ zed query '*' | zed load -use fresh@main -
(2/1) 3030B 3030B/s
2BaZGkfNrORS6pMmU5Fyj68P8Rx committed

$ zed use fresh@main
Switched to branch "main" on pool "fresh"

$ time zed query 'head 1' > /dev/null
real    0m0.024s
user    0m0.019s
sys    0m0.011s

There's a lot more files still left behind under the compacted pool directory, which is unsurprising to me since it has all the history. I assume the perf delta is due to still walking those files in some way. What's unclear to me is if some of this could be avoided in the name of faster perf, given that compaction has left all the data now stored in a single object.

I've uploaded a copy of the Zed lake directory pre-compact (so, still 7000 objects each formed by a separate commit) to s3://zed-issues/3969/zed-3027-lake.tgz if anyone wants to save on repro time.

Jul 05 '22 22:07 philrz

At team sync today we discussed the most recent results above. There was consensus that it was unsurprising to see slow queries when the pool was still formed by 7000 individual commits, but the post-compact performance was still much slower than we'd expect and is worthy of closer study.

Jul 07 '22 19:07 philrz

There's been many changes in recent months so I revisited this one with some binary searching to see if conditions have improved. The tl;dr is that indeed they have, and I think this issue can be closed.

The most significant transition was a huge improvement in the commit times, which occurred at commit 834ab5f, which is associated with #4274. https://github.com/brimdata/zed/pull/4274#issuecomment-1364721730 showed similar validation of this with a different tool, but for an apples-to-apples comparison using the same repro script as this issue, here's a side-by-side next to the commit times from the original repro of this issue back in September, 2021.

Without hunting down the precise commit where it improved, the same test run at that commit showed the query times had also improved relative to when this last was tested.

The commit times have remained at that ideal performance ever since. The query performance, however, saw a curious detour at commit 772322c which is associated with #4352, where the query time improved by about 5x (the head 1 after 7000 mini commits taking 0.14 seconds instead of 0.72 seconds).

...but then it bounced back to prior performance at commit 94ccfe4, which is associated with #4415.

This is where query performance remains at recent tip of main. We could perhaps look into the root cause of the detour, but it may not be worth the effort. What I think further justifies this is that compacting the data (which is what we expect to be a typical part of production use, particularly after accumulating lots of small commits) brings the performance back near the ideal. In that run at commit 94ccfe4, after accumulating those 7000 mini commits and seeing the head 1 query take 0.75 seconds and then running compaction, the first head 1 query takes 0.091 seconds followed by 0.03 seconds afterwards, which is very close to the head 1 query performance after only one commit.

$ zed -version
Version: v1.5.0-56-g94ccfe45

$ time zed compact $(zed query -f text 'from logs@main:objects | yield "0x${hex(id)}"')
2O1RqN9vgiDBL4aMWH7ACjWcHQ3 compaction committed

real	0m0.932s
user	0m1.903s
sys	0m0.790s

$ time zed query 'head 1' > /dev/null

real	0m0.091s
user	0m0.087s
sys	0m0.022s

$ time zed query 'head 1' > /dev/null

real	0m0.030s
user	0m0.024s
sys	0m0.014s

$ time zed query 'head 1' > /dev/null

real	0m0.031s
user	0m0.024s
sys	0m0.013s

Based on these results, I think we're ok to close this one.

Apr 05 '23 20:04 philrz

zed zed copied to clipboard

Performance degradation as commits accumulate

zed
zed copied to clipboard