lucene
lucene copied to clipboard
GITHUB#11795: Add FilterDirectory to track write amplification factor
Added a WriteAmplificationTrackingDirectoryWrapper
that simply keeps track of bytes flushed and bytes merged. This allows us to get an estimate of the write amplification factor by doing (bytesFlushed + bytesMerged) / bytesFlushed)
.
This class feels like it'd be a good fit for the misc
module rather than core
?
I love this approach/idea!
It's simple so we should start with this ... but it will necessarily be a lagging indicator since merging takes some time to kick off and run to completely while flushing keeps happening if docs are being indexed. Also, it reports the "for all time" WAF instead of adding some decay and being closer to an instantaneous measure. But we can try to improve those later.
Have you tried turning this on for a luceneutil
indexing run to see what WAF it reports? It might be tricky to do it right because by default I think luceneutil
indexing does not wait for merges while indexing.
An alternative implementation would be to add the bytes only in the IndexOutput.close
method instead of on each method that writes bytes? It might be less error-proned, but, also less real-time since it won't be until the file is closed that we count any bytes in the shared counters.
So by doing this on IndexOutput.close()
, we would avoid including half-done merges/flushes in the write amplification factor? As you said, this does track all-time WAF so I guess being real-time is not as much of a concern.
I see you'd already responded to a bunch of my comments. I should've refreshed my PR page. Will resolve those.
For folks more familiar with WAF calculations for Search applications, is the formula of (flushedBytes + mergedBytes) / flushedBytes
always correct?
For example, does the IOContext.Context.MERGE
operation not include all the bytes written during a FLUSH
operation (i.e when we are writing to disk)? or should it be something like mergedBytes/flushedBytes
when there have been merges and 1
otherwise when flushedBytes
are 0?
Edit: Hmm, I misunderstood this concept I suppose. Obviously, when you have say a 3rd segment after 1, 2 have been merged. flushedBytes
will have the total values of bytes written for originally writing segments 1, 2, 3 and mergedBytes
will have the additional cost of rewriting a merged 1-2
segment.
A link that I had found useful: https://github.com/EighteenZi/rocksdb_wiki/blob/master/RocksDB-Tuning-Guide.md
An alternative implementation would be to add the bytes only in the
IndexOutput.close
method instead of on each method that writes bytes? It might be less error-proned, but, also less real-time since it won't be until the file is closed that we count any bytes in the shared counters.
I'm a bit conflicted about this. I like the completeness we get after close()
is called. But as an API, consumers now have to be careful with getApproximateWriteAmplificationFactor()
..
There is nothing stopping them from calling it before IndexOutput is closed. The counter will just return 0. Or the value it held before it was reused. The onus of ensuring close() was called is on the caller here i.e. the dir wrapper.
However, in the dir wrapper, we don't keep references to different IndexOutputs for flush and merge. We directly read the counter values in getApproximateWriteAmplificationFactor(), and so there's no way to throw an error, if someone calls write amplification early.
In other words, we cannot ensure that when write amplification returns 1, it really is 1. It could be because IndexOutput is still open.
Maybe, we should let BytesTrackingIndexOutput expose a bytesWritten()
method, which internally verifies that close was called. A subsequent real-time writes impl. could change this. The dir. wrapper would then keep IndexOutput references around, and use them instead of directly reading counters.
Then we don't need to pass shared Atomic counters. We can directly aggregate values across IndexOutput references if we want to.
Update: Thinking about this more, we may be okay with just adding documentation around getApproximateWriteAmplificationFactor()
. Something that says that only writes from completed and closed index outputs are counted. And returning 0 write amp. when there have been no writes instead of 1?
A directory will do multiple flushes and merges (often concurrent), and end up opening multiple IndexOutputs. We don't want to keep track them all just for this calculation. So sharing counters may be okay, if we explain the nature of the API.
Thanks for persisting with this @mdmarshmallow. I think we're close now, just a couple of discussion threads to resolve. This change will be super useful :)
This implementation ignores temporary index outputs from write amplification, which I wonder whether this is correct (maybe it is, I struggle making an opinion on this question).
This implementation ignores temporary index outputs from write amplification, which I wonder whether this is correct (maybe it is, I struggle making an opinion on this question).
Interesting point.. Thinking how/when we'd like to track the impact of temp output files. From what I understand, they won't be a part of commit and fsync. So if we're trying to measure increased disk or remote store I/O, we probably want to skip them?
Maybe, when we make optimizations that write more temp files (like #11411 ?), we'll use this to measure some impact. Although we delete the temp files right after, but on a small box, maybe we gives us a sense of increased file writes or page fault.
We could add a flag to optionally include temp files.. It would require overriding createTempOutput()
right?
This is nifty!
I wonder if it'd be worthwhile for Lucene itself to track this small bit of metadata so that it's persistent?
Sorry for the delayed response, I haven't had internet for a few days. I added the option to include temp output in tracking, as well as the a method for real time byte tracking as well so we can let users get exactly what they need.
Also @dsmiley that's an interesting suggestion! I'm not as familiar with Lucene as some of the other people commenting here but I would be open to adding this to metadata if there are no objections.
Interesting point.. Thinking how/when we'd like to track the impact of temp output files. From what I understand, they won't be a part of commit and fsync. So if we're trying to measure increased disk or remote store I/O, we probably want to skip them?
Indeed temporary files are never part of a commit and fsynced, but this may also be the case for a number of flushed segments: if flushed segments get included in a merge before the next commit, then they would never be part of a commit and fsynced either.
Although we delete the temp files right after, but on a small box, maybe we gives us a sense of increased file writes or page fault.
Some temporary files are also not necessarily short-lived, like the ones we create for stored fields when index sorting is enabled.
I'm considering exposing write amplification separately for flushes (as flushedBytes / totalIndexSize
), merges (as (totalIndexSize + mergedBytes) / totalIndexSize
) and temporary files (as (totalIndexSize + tempBytes) / totalIndexSize
) and pushing the responsibility to users of whether and how they would like to combine these various metrics?
@jpountz , in response to this:
I'm considering exposing write amplification separately for flushes (as flushedBytes / totalIndexSize), merges (as (totalIndexSize + mergedBytes) / totalIndexSize) and temporary files (as (totalIndexSize + tempBytes) / totalIndexSize) and pushing the responsibility to users of whether and how they would like to combine these various metrics?
We could add these in addition to the write amplification factor methods that are already present?
Also @dsmiley that's an interesting suggestion! I'm not as familiar with Lucene as some of the other people commenting here but I would be open to adding this to metadata if there are no objections.
It would indeed be awesome if this was a first class metric in Lucene. But let's first start with this Directory
wrapper approach and see how it's used / improved, and later upgrade it to first class if it is helpful? Progress not perfection!
I'm considering exposing write amplification separately for flushes (as
flushedBytes / totalIndexSize
), merges (as(totalIndexSize + mergedBytes) / totalIndexSize
) and temporary files (as(totalIndexSize + tempBytes) / totalIndexSize
) and pushing the responsibility to users of whether and how they would like to combine these various metrics?
+1
why does this need to do anything more than getFilePointer()
on close()
or something to capture how much was written?
Robert is right, why do we need to see the values live? getFilePointer() always works.
I'm considering exposing write amplification separately for flushes (as flushedBytes / totalIndexSize), merges (as (totalIndexSize + mergedBytes) / totalIndexSize) and temporary files (as (totalIndexSize + tempBytes) / totalIndexSize) and pushing the responsibility to users of whether and how they would like to combine these various metrics?
We could add these in addition to the write amplification factor methods that are already present?
I think that one question I was raising was also whether the current way we're measuring write amplification is correct. Say you write 10 1MB segments, which then get merged into a 9MB segment (smaller the the sum of the sizes of flushed segments, possibly because many terms are no longer duplicated across segments). What is the write amplification of merging? Is it (10 + 9) / 10 = 1.9
like the formula that this PR is currently using, computing write ampfilication as a function of the total flush size. Or is it (9 + 9) / 9 = 2
, computing write amplification as a function of the total index size? The latter feels more intuitive to me.
The former is more intuitive to me -- how much more data do we write beyond the initial segment flush. This is the added cost of a system immutable files with log structured merge.
I agree, the former is also more intuitive to me as well for the same reasons. In your example, 10MB were initially written, and 9MB were written again in merges, so 1.9 to me means that we did 1.9 times the work of initially just writing the 10MB to get to the final 9MB index state.
Though since this does seem like it might be a point of confusion for users in that there are multiple ways to calculate this, do you think it would be better to just expose getFlushedBytes()
and getMergedBytes()
and let users calculate the write amplification as they see fit?
Not sure why the tests are failing as ./gradlew check
passes on my machine. I checked the macos-latest distributions test logs and found this:
2022-09-29T19:53:49.4803130Z Failed to resolve action download info. Error: nodename nor servname provided, or not known (pipelines.actions.githubusercontent.com:443)
2022-09-29T19:53:49.4820400Z Retrying in 13.157 seconds
2022-09-29T19:55:42.7778120Z Failed to resolve action download info. Error: The HTTP request timed out after 00:01:40.
2022-09-29T19:55:42.7787910Z Retrying in 15.793 seconds
2022-09-29T19:56:49.5196790Z ##[error]Failed to resolve action download info.
It seems like a transient error to me, but I'm not knowledgable enough about this to be sure.
Edit: looks like this new commit passed fine.
Though since this does seem like it might be a point of confusion for users in that there are multiple ways to calculate this, do you think it would be better to just expose getFlushedBytes() and getMergedBytes() and let users calculate the write amplification as they see fit?
This sounds like a good idea to me.
Sounds good, removed the getApproximateWriteAmplficationFactor()
function.
I resolved a merge conflict with CHANGES.txt
. I think I've also addressed all the issues people have pointed out as well. Would it be possible for someone to take a look at this and merge it if there are no more objections? Thanks!
This looks great to me! I love all the engagement (83+ comments!) and how it iterated to such a simple solution. I left a small comment for a follow-on issue ... and it looks like CHANGES.txt
is conflicting again
@mdmarshmallow maybe open another follow-on issue in luceneutil
to add this to nightly benchmarks? It'd be great to see impact on WAF over time of interesting index-time changes...
I'll push this in a few days if nobody objects.
Thanks Mike, I added an issue to luceneutil
: https://github.com/mikemccand/luceneutil/issues/208
Thanks @mdmarshmallow! Sorry for the delay merging ... I will backport to 9.x then let's get this in nightly benchmarks :)
9.x backport done: https://github.com/apache/lucene/commit/373d2e84c13ee67e8e1247338e69b53946b7f726