cassandra
cassandra copied to clipboard
CNDB-7146: Improve cold cache reads / parallel reads
https://github.com/riptano/cndb/issues/7146
Kudos, SonarCloud Quality Gate passed!
0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells
94.1% Coverage
0.0% Duplication
The version of Java (11.0.15) you have used to run this analysis is deprecated and we will stop accepting it soon. Please update to at least Java 17.
Read more here
@tjake do you have time to review this?
I don't have tons of context on riptano/cndb#7146 to go with, but I'm not sure this is doing what we want.
Afaict, as is, the patch just parallelize the creation of the sstable iterators (just the ctor of UnfilteredRowIteratorWithLowerBound
of instance in the "usual" case). But that, in itself, does not trigger reading of data (unless I'm misreading something of course).
My understanding is that our goal is to try to parallelize actual data reads to different sstables, so that if we need to hit S3 for multiple sstables, this is done in parallel. But I think the way sstable reading is done make that a bit tricky currently, and there is some reasoning behind how it works, with potential trade-offs if we change how it's done.
In particular, there is 2 main cases (for single partition read):
- for single row read, we use
queryMemtableAndSSTablesInTimestampOrder
. And the idea here is that we want to check sstable sequentially (the more recent first) because if we get the row from the first sstable, we probably don't even have to look the next ones since they will have older data (I'm simplifying here, it's more complex, but that's the spirit). Parallelizing this logic as is does not make sense since the whole point is the sequentiality of the logic. Of course, the cost of hitting S3 may mean we want to modify the logic somewhat, but it's not obvious to me to what. I guess, if we had a way to tell in advance if reading the row from the first few sstables will trigger an S3 download, then we could decide to forgo sequentiality, and so risk downloading stuff we don't need, to make sure we avoid sequential download latencies in the worst case. But it's not something that is easy to surface in converged Cassandra (we could add some extendable API that doesn't S3 I guess...). - for slices, it's setup differently, but there is actually a similar kind of logic going on. The idea being that we try to be fast for time series data, and rely on the fact that the sstable index tells use which range of clustering each sstable has for the key we're looking at. For time series, the most recent sstable may have rows for today, the next one for yesterday, etc. But there will be little clustering overlap between the sstables. Now, say we're reading the 10 most recent rows: the logic will start reading from only the last sstable, and if it gets the 10 row, it stops there and never actually read any other sstable. Here too, we can decide that we want to change the logic, but we should imo be careful, because even in cndb, not all reads end up hitting S3 (we have both the chunk cache and file cache trying hard to prevent that), and the implemented logic make sense when you're not hitting S3. Here again, if we could check in advance that we are going to hit S3, then we could decide to forgo the existing logic and prefetch stuff we may or may not need (which might be wasteful in many case, but would prevent a really bad worst case latency).
I guess, if we had a way to tell in advance if reading the row from the first few sstables will trigger an S3 download, then we could decide to forgo sequentiality, and so risk downloading stuff we don't need, to make sure we avoid sequential download latencies in the worst case
I don't think we can predict it in advance, but could we keep statistics on what we observe (how many sstables needed to satsify a partition) and use that as an approximation?
I don't think we can predict it in advance
"predict" might have been the wrong word here. What I was refering to is the fact that an sstable iterator ends up computing the offset at which it will access the data component before it actually access it. And if that same iterator can also interogate the underlying file cache, it can know before accessing data if that access with have a high latency.
So "in theory", we could know if we're at risk of the worst case scenario we're trying to avoid, namely a bunch of sequential S3 queries that stack up to a timeout, before doing any costly work. When that's the case, we could decide to just force parallel reads from all sstables without any of the optimisations discussed in my previous comment. Which might well result in slower performance (since we might read something we don't truly end up needing), but will avoid a really bad worst case latency.
All that said, how you implement that cleanly and without breaking all sorts of abstractions is a very different question. I "think" we might be able to get something not completely horrible if we pass down some sort of "customizable" object to each sstable iterator (the same to all of them), which the iterator could use to signal if it notices getting the upcoming data is going to be high latency, and that signal could trigger other iterators to prefetch. But would need to try it to see how bad it looks in practice.
Overall, not suggesting this is the prefect solution :)
could we keep statistics on what we observe (how many sstables needed to satsify a partition) and use that as an approximation?
Maybe. I mean, in principle, if we're reasonably sure a read is gonna hit most of the sstables anyway based on prior accesses, then forgoing optimisations that attempt to minimize the number of accessed sstable is fine.
But I do wonder a bit how good an approximation we can get from stats on this. I'm a bit unsure regarding tracking the number of sstable per partitions for 2 reasons in particular:
- if we genuinely track every partition for every table, then it feels like it could be prohibitively costly memory-wise. And while a cache-like behaviour could limit memory usage, it would only make worst the next point.
- our goal (if I understand correctly) is to mitigate the worst case cost of cold reads. But cold reads happens when we read a partition that isn't accessed much, and so partitions for which we won't have the per-partition stat, or one that is most likely a bad approximation.
It's possible that for some tables, simply tracking table level stats is good enough. That is, some table may have workloads that are such that most queries ends up hitting most sstables anyway. But I kind of double this is most tables, so I don't know how useful this would be.