atlasdb
atlasdb copied to clipboard
Add statistics to spans
Goals (and why):
Teams using AtlasDB can diagnose service-level statistics, but trace-level usage information needs to be reconstructed indirectly. This PR is the first step to create easy-to-digest statistics to drive optimization.
==COMMIT_MSG== Add statistics to spans emitted from within atlasdb. The following tags will be added to spans:
-
atlasdb.emptyReads
: The number of AtlasDB tombstones read during this span. -
atlasdb.bytesRead
: An estimate of the number of bytes read during this span from the underlying database. This is currently the number of column name & value bytes read, and will not contain various overheads. It should be used seen as a value correlated with the real value.
Note that both these statistics are best-effort. There is no guarantee of completeness, and require correct usage of AtlasDB to work (e.g. closing ClosableIterators
).
==COMMIT_MSG==
Implementation Description (bullets):
- Generally a relatively simple system using thread locals.
- The bulk of the PR is to attempt to push
ClosableIterators
through the codebase & attempt to track them more cleanly. This helps us handle async iterators.
Limitations:
- The statistics can only be kept within reasonable bounds; async iterators need to be consumed from the same thread as the one that created it, or use
PTExecutors.wrap()
around the callable/... that gets executed on a different thread. This limitation is relatively fundamental unless we're willing to e.g. pass around a context object within the KVS code.
Testing (What was existing testing like? What have you done to improve it?):
- TODO
Concerns (what feedback would you like?):
- Are there hard-to-see risks associated with this code?
- How misleading do these become if e.g. the limitations are hit? Given that we have the full statistics available for bytesRead, we might be able to see gaps, but they won't always be easy to spot.
- Do we want to keep the
@MustBeClosed
annotations? They can be rolled back, but it increases the risk of bad wrapping (where we drop the inner ClosableIterator). We already have very few guarantees that close will be called as-is.
Where should we start reviewing?:
-
TraceStatistics
-
KeyValueService
-> The reason for@MustBeClosed
progression within the code
Priority (whenever / two weeks / yesterday):
Generate changelog in changelog/@unreleased
changelog/@unreleased
Type
- [ ] Feature
- [ ] Improvement
- [x] Fix
- [ ] Break
- [ ] Deprecation
- [ ] Manual task
- [ ] Migration
Description
-
atlasdb.emptyReads
: The number of AtlasDB tombstones read during this span. -
atlasdb.bytesRead
: An estimate of the number of bytes read during this span from the underlying database. This is currently the number of column name & value bytes read, and will not contain various overheads. It should be used seen as a value correlated with the real value.
Note that both these statistics are best-effort. There is no guarantee of completeness, and require correct usage of AtlasDB to work (e.g. closing ClosableIterators
).
Check the box to generate changelog(s)
- [x] Generate changelog entry
Looking for a review of the general approach before I go back to clean up the PR itself. The main points I would love to have covered are:
- Is this the best/most reliable point to inject on? The (much more reliable) alternative would be to thread some statistics system throughout the whole codebase, but I don't fully understand how much havoc that would cause on roll-out; do people rely on the internal public APIs?)
- This has some fun/funky edge-cases in terms of threading, meaning there will be services/usage patterns where the results are wrong; do we have a sense on how much reading of async iterators happens on a thread different from the original invoker? What do we want to happen when that does happen?
- What do we want to happen on parallel async reads? The current code can't attribute reads/bytes correctly. The trace level metrics will be correct though, so I feel this is a case of "don't let perfect be the enemy of the good"
@jeremyk-91 Fixed the overflow issue; think it's ready for CR when you get a chance.
If I'm reading the C* code right, the getRange does end up using the ResultExtractor; through:
CassandraKeyValueServiceImpl:1346 (getRange)
RangeLoader:53 (getRange)
RangeLoader:74 (getRangeWithPageCreator)
RangeLoader:95 (getRangeWithPageCreator)
through AbstractPagingIterable.iterator
CassandraRangePagingIterable:69/75 (getFirstPage/getNextPage)
CassandraRangePagingIterable:81 (getSinglePage)
CassandraRangePagingIterable:103 (getPage)
ResultsExtractor:84
Where the current counters are; that said, I'm not entirely sure how exactly the column/row/value maps to network bandwidth (as in, are things deduped?)
Oh, nvm, getRowsColumnRange
doesn't; it's a bit surprising to me that we have multiple paths to turn bytes into values/columns/rows
@jeremyk-91 I added support for getRowsColumnRange
for Cassandra & getRange
for DBKVS. Yes, I think we should b good to merge (assuming I didn't do anything supremely silly in my current PR). We can always improve accuracy later.
Note: I've refrained from putting things at the lowest level (at the DB layer) to ensure we don't accidentally make hidden assumptions (e.g. all C* bytebuffers are backed by arrays so are cheap to re-read; I don't think we should assume that). We could also get trivially more correctness by creating a wrapper around TBinaryProtocol
.
I've been busy with planning and some team commitments, but will pick this up today or tomorrow. Apologies :(
Released 0.689.0