lucene GITHUB#11795: Add FilterDirectory to track write amplification factor

Added a WriteAmplificationTrackingDirectoryWrapper that simply keeps track of bytes flushed and bytes merged. This allows us to get an estimate of the write amplification factor by doing (bytesFlushed + bytesMerged) / bytesFlushed).

Sep 20 '22 22:09 mdmarshmallow

This class feels like it'd be a good fit for the misc module rather than core?

Sep 22 '22 08:09 jpountz

I love this approach/idea!

It's simple so we should start with this ... but it will necessarily be a lagging indicator since merging takes some time to kick off and run to completely while flushing keeps happening if docs are being indexed. Also, it reports the "for all time" WAF instead of adding some decay and being closer to an instantaneous measure. But we can try to improve those later.

Have you tried turning this on for a luceneutil indexing run to see what WAF it reports? It might be tricky to do it right because by default I think luceneutil indexing does not wait for merges while indexing.

Sep 22 '22 15:09 mikemccand

An alternative implementation would be to add the bytes only in the IndexOutput.close method instead of on each method that writes bytes? It might be less error-proned, but, also less real-time since it won't be until the file is closed that we count any bytes in the shared counters.

Sep 22 '22 15:09 mikemccand

So by doing this on IndexOutput.close(), we would avoid including half-done merges/flushes in the write amplification factor? As you said, this does track all-time WAF so I guess being real-time is not as much of a concern.

Sep 22 '22 17:09 mdmarshmallow

I see you'd already responded to a bunch of my comments. I should've refreshed my PR page. Will resolve those.

Sep 22 '22 17:09 vigyasharma

For folks more familiar with WAF calculations for Search applications, is the formula of (flushedBytes + mergedBytes) / flushedBytes always correct?

For example, does the IOContext.Context.MERGE operation not include all the bytes written during a FLUSH operation (i.e when we are writing to disk)? or should it be something like mergedBytes/flushedBytes when there have been merges and 1 otherwise when flushedBytes are 0?

Edit: Hmm, I misunderstood this concept I suppose. Obviously, when you have say a 3rd segment after 1, 2 have been merged. flushedBytes will have the total values of bytes written for originally writing segments 1, 2, 3 and mergedBytes will have the additional cost of rewriting a merged 1-2 segment.

A link that I had found useful: https://github.com/EighteenZi/rocksdb_wiki/blob/master/RocksDB-Tuning-Guide.md

Sep 22 '22 18:09 gautamworah96

An alternative implementation would be to add the bytes only in the IndexOutput.close method instead of on each method that writes bytes? It might be less error-proned, but, also less real-time since it won't be until the file is closed that we count any bytes in the shared counters.

I'm a bit conflicted about this. I like the completeness we get after close() is called. But as an API, consumers now have to be careful with getApproximateWriteAmplificationFactor()..

There is nothing stopping them from calling it before IndexOutput is closed. The counter will just return 0. Or the value it held before it was reused. The onus of ensuring close() was called is on the caller here i.e. the dir wrapper.

However, in the dir wrapper, we don't keep references to different IndexOutputs for flush and merge. We directly read the counter values in getApproximateWriteAmplificationFactor(), and so there's no way to throw an error, if someone calls write amplification early.

In other words, we cannot ensure that when write amplification returns 1, it really is 1. It could be because IndexOutput is still open.

Maybe, we should let BytesTrackingIndexOutput expose a bytesWritten() method, which internally verifies that close was called. A subsequent real-time writes impl. could change this. The dir. wrapper would then keep IndexOutput references around, and use them instead of directly reading counters.

Then we don't need to pass shared Atomic counters. We can directly aggregate values across IndexOutput references if we want to.

Update: Thinking about this more, we may be okay with just adding documentation around getApproximateWriteAmplificationFactor(). Something that says that only writes from completed and closed index outputs are counted. And returning 0 write amp. when there have been no writes instead of 1?

A directory will do multiple flushes and merges (often concurrent), and end up opening multiple IndexOutputs. We don't want to keep track them all just for this calculation. So sharing counters may be okay, if we explain the nature of the API.

Sep 23 '22 04:09 vigyasharma

Thanks for persisting with this @mdmarshmallow. I think we're close now, just a couple of discussion threads to resolve. This change will be super useful :)

Sep 23 '22 04:09 vigyasharma

This implementation ignores temporary index outputs from write amplification, which I wonder whether this is correct (maybe it is, I struggle making an opinion on this question).

Sep 23 '22 13:09 jpountz

This implementation ignores temporary index outputs from write amplification, which I wonder whether this is correct (maybe it is, I struggle making an opinion on this question).

Interesting point.. Thinking how/when we'd like to track the impact of temp output files. From what I understand, they won't be a part of commit and fsync. So if we're trying to measure increased disk or remote store I/O, we probably want to skip them?

Maybe, when we make optimizations that write more temp files (like #11411 ?), we'll use this to measure some impact. Although we delete the temp files right after, but on a small box, maybe we gives us a sense of increased file writes or page fault.

We could add a flag to optionally include temp files.. It would require overriding createTempOutput() right?

Sep 23 '22 16:09 vigyasharma

This is nifty!
I wonder if it'd be worthwhile for Lucene itself to track this small bit of metadata so that it's persistent?

Sep 23 '22 16:09 dsmiley

Sorry for the delayed response, I haven't had internet for a few days. I added the option to include temp output in tracking, as well as the a method for real time byte tracking as well so we can let users get exactly what they need.

Also @dsmiley that's an interesting suggestion! I'm not as familiar with Lucene as some of the other people commenting here but I would be open to adding this to metadata if there are no objections.

Sep 27 '22 02:09 mdmarshmallow

Interesting point.. Thinking how/when we'd like to track the impact of temp output files. From what I understand, they won't be a part of commit and fsync. So if we're trying to measure increased disk or remote store I/O, we probably want to skip them?

Indeed temporary files are never part of a commit and fsynced, but this may also be the case for a number of flushed segments: if flushed segments get included in a merge before the next commit, then they would never be part of a commit and fsynced either.

Although we delete the temp files right after, but on a small box, maybe we gives us a sense of increased file writes or page fault.

Some temporary files are also not necessarily short-lived, like the ones we create for stored fields when index sorting is enabled.

I'm considering exposing write amplification separately for flushes (as flushedBytes / totalIndexSize), merges (as (totalIndexSize + mergedBytes) / totalIndexSize) and temporary files (as (totalIndexSize + tempBytes) / totalIndexSize) and pushing the responsibility to users of whether and how they would like to combine these various metrics?

Sep 27 '22 08:09 jpountz

@jpountz , in response to this:

I'm considering exposing write amplification separately for flushes (as flushedBytes / totalIndexSize), merges (as (totalIndexSize + mergedBytes) / totalIndexSize) and temporary files (as (totalIndexSize + tempBytes) / totalIndexSize) and pushing the responsibility to users of whether and how they would like to combine these various metrics?

We could add these in addition to the write amplification factor methods that are already present?

Sep 28 '22 22:09 mdmarshmallow

Also @dsmiley that's an interesting suggestion! I'm not as familiar with Lucene as some of the other people commenting here but I would be open to adding this to metadata if there are no objections.

It would indeed be awesome if this was a first class metric in Lucene. But let's first start with this Directory wrapper approach and see how it's used / improved, and later upgrade it to first class if it is helpful? Progress not perfection!

Sep 29 '22 11:09 mikemccand

I'm considering exposing write amplification separately for flushes (as flushedBytes / totalIndexSize), merges (as (totalIndexSize + mergedBytes) / totalIndexSize) and temporary files (as (totalIndexSize + tempBytes) / totalIndexSize) and pushing the responsibility to users of whether and how they would like to combine these various metrics?

+1

Sep 29 '22 11:09 mikemccand

why does this need to do anything more than getFilePointer() on close() or something to capture how much was written?

Sep 29 '22 12:09 rmuir

Robert is right, why do we need to see the values live? getFilePointer() always works.

Sep 29 '22 13:09 uschindler

I'm considering exposing write amplification separately for flushes (as flushedBytes / totalIndexSize), merges (as (totalIndexSize + mergedBytes) / totalIndexSize) and temporary files (as (totalIndexSize + tempBytes) / totalIndexSize) and pushing the responsibility to users of whether and how they would like to combine these various metrics?

We could add these in addition to the write amplification factor methods that are already present?

I think that one question I was raising was also whether the current way we're measuring write amplification is correct. Say you write 10 1MB segments, which then get merged into a 9MB segment (smaller the the sum of the sizes of flushed segments, possibly because many terms are no longer duplicated across segments). What is the write amplification of merging? Is it (10 + 9) / 10 = 1.9 like the formula that this PR is currently using, computing write ampfilication as a function of the total flush size. Or is it (9 + 9) / 9 = 2, computing write amplification as a function of the total index size? The latter feels more intuitive to me.

Sep 29 '22 13:09 jpountz

The former is more intuitive to me -- how much more data do we write beyond the initial segment flush. This is the added cost of a system immutable files with log structured merge.

Sep 29 '22 14:09 dsmiley

I agree, the former is also more intuitive to me as well for the same reasons. In your example, 10MB were initially written, and 9MB were written again in merges, so 1.9 to me means that we did 1.9 times the work of initially just writing the 10MB to get to the final 9MB index state.

Though since this does seem like it might be a point of confusion for users in that there are multiple ways to calculate this, do you think it would be better to just expose getFlushedBytes() and getMergedBytes() and let users calculate the write amplification as they see fit?

Sep 29 '22 16:09 mdmarshmallow

Not sure why the tests are failing as ./gradlew check passes on my machine. I checked the macos-latest distributions test logs and found this:

2022-09-29T19:53:49.4803130Z Failed to resolve action download info. Error: nodename nor servname provided, or not known (pipelines.actions.githubusercontent.com:443)
2022-09-29T19:53:49.4820400Z Retrying in 13.157 seconds
2022-09-29T19:55:42.7778120Z Failed to resolve action download info. Error: The HTTP request timed out after 00:01:40.
2022-09-29T19:55:42.7787910Z Retrying in 15.793 seconds
2022-09-29T19:56:49.5196790Z ##[error]Failed to resolve action download info.

It seems like a transient error to me, but I'm not knowledgable enough about this to be sure.

Edit: looks like this new commit passed fine.

Sep 30 '22 19:09 mdmarshmallow

Though since this does seem like it might be a point of confusion for users in that there are multiple ways to calculate this, do you think it would be better to just expose getFlushedBytes() and getMergedBytes() and let users calculate the write amplification as they see fit?

This sounds like a good idea to me.

Oct 04 '22 07:10 jpountz

Sounds good, removed the getApproximateWriteAmplficationFactor() function.

Oct 04 '22 20:10 mdmarshmallow

I resolved a merge conflict with CHANGES.txt. I think I've also addressed all the issues people have pointed out as well. Would it be possible for someone to take a look at this and merge it if there are no more objections? Thanks!

Oct 14 '22 19:10 mdmarshmallow

This looks great to me! I love all the engagement (83+ comments!) and how it iterated to such a simple solution. I left a small comment for a follow-on issue ... and it looks like CHANGES.txt is conflicting again

@mdmarshmallow maybe open another follow-on issue in luceneutil to add this to nightly benchmarks? It'd be great to see impact on WAF over time of interesting index-time changes...

I'll push this in a few days if nobody objects.

Oct 20 '22 16:10 mikemccand

Thanks Mike, I added an issue to luceneutil: https://github.com/mikemccand/luceneutil/issues/208

Oct 21 '22 01:10 mdmarshmallow

Thanks @mdmarshmallow! Sorry for the delay merging ... I will backport to 9.x then let's get this in nightly benchmarks :)

Oct 27 '22 19:10 mikemccand

9.x backport done: https://github.com/apache/lucene/commit/373d2e84c13ee67e8e1247338e69b53946b7f726

Oct 27 '22 20:10 mikemccand

lucene lucene copied to clipboard

GITHUB#11795: Add FilterDirectory to track write amplification factor

lucene
lucene copied to clipboard