OpenSearch
OpenSearch copied to clipboard
Idempotency for mostly append-only indexes/streams
Background
Document id serves as a primary key in an Opensearch index. The distribution of the document across shards is based on the document id. Opensearch ensures idempotency on the document id by maintaining an in-memory map until refresh is done. Post refresh the document id is indexed as a term in Lucene. Whenever a request with document id arrives, Opensearch looks it up in the in memory map, failing which document is looked up in Lucene as a term query. Segments are searched in the reverse order of creation time for the existence of document for frequently updated documents.
This is an expensive operation per index, especially for append-only ingestion. To get around this problem, auto generated document ids were introduced. When user does not pass document id, the system generates it. The system is able to differentiate between auto and custom document ids. This allows Opensearch to skip searching the document in the index before ingestion. Auto generated document id remains same across internal transport layer retries, but will be different if the retry is initiated from the client.
Why custom document ids are needed in append-only use cases? There are multiple reasons for it:
- Primary key lookups are faster than term queries. This is because the shard is known in advance for the respective document ids. Hence customers may use document id storage as a primary key search.
- The broken idempotency experience for auto generated ids for client-side retries. Sometimes the client itself could be a streaming system like kafka, which itself could introduce duplicates. Hence, to get around this problem, users would use the kafka sequence ids as document ids.
While custom document ids work well for a single index, the idempotency experience breaks across multiple write indices e.g. data streams or alias. A typical set up for a time series use case decouples the rollover from ingestion. Customers may set up automated policies for their indices and would point the writes directly to data streams or aliases. At the time of rollover, the index switches and the idempotency constraint breaks.
The indexing rate increases when the index is rolled over and continuously degrades with time until the next rollover window. Using document ids is the culprit in those cases. It degrades the performance without providing the guarantee. However, customers use it either because they are unaware of this problem or they are better off with a best effort guarantee rather than having no control over duplicates at all.
Problem
Customers do not have a mechanism to define idempotency window for their data in order to avoid duplicates. To achieve this, they use custom document ids as the idempotency key and rollover window as the idempotency window. When index rolls over, the idempotency is gone and it restarts. It may not have been intended this way, but this is how the system is being used today and that too with broken guarantees.
Customers may be fine with a smaller idempotency window than the rollover window since rolling over will increase the size of the shards. That will also help them control the performance degradation by controlling the size of the idempotency key index to be searched.
Data streams that use custom document ids as an idempotency mechanism suffer from a wide variety of problems:
- Unique document ids result in unnecessary storage consumption and the look-ups are costlier depending on how the document id is constructed. The look-ups result in searching the document id in Lucene as a term query.
- Customers don't need indefinite idempotency. The idempotency is generally required over a moving window period post which there is no need to maintain the idempotency key. The keys can be dropped over a moving time window. Also, once the index has rolled-over and doesn't require updates/additions(frozen), we should drop this idempotency data structure.
- Is FST the right data structure for checking existence/absence. BloomFilter format can be used as an inspiration for storing the idempotency key?
- No support for moving idempotency windows.
Solution
The high level solution is to decouple rollover window from idempotency window and provide a rolling window idempotency experience. The details of the solution have not been worked out yet.
@dblock @nknize What are your thoughts?
I'll dig in a little further in a follow-up but wanted to throw one quick thought:
Is FST the right data structure for checking existence/absence. BloomFilter format can be used as an inspiration for storing the idempotency key?
This is a reasonable question worth exploring. Two things to consider,
-
a hash function without collisions. I believe murmurhash2 and 3 can suffer collisions which is why it's limited in application. This is fine since Bloom filters were really only intended for detecting "possible" existence.
-
FST is space efficient at scale and can be highly compressible. We'd want to benchmark space consumption (which I think will also depend on the hash function) when using BloomFilter vs FST.
I'll dig in a little further and follow up shortly.
Bloom filter is probabilistic, so just using bloom filter will not be enough for determinism. We will use it in addition to some other data structure to optimize read performance, right? Created a separate thread to explore this question.
@muralikpbhat What do you think?
These improvements will prove useful even for customers using auto generated ids. The ids are unique per document and are stored as terms in FST. This causes the FST index for this field to grow much more than that of terms which are mostly common across all documents. The id index is almost never read, except during segment merging, at which point we pay unnecessary CPU and memory(mmaped) for this field when we know that the new merged segment is never going to be read ever. Another question that arises now is whether I should even merge this index. Unfortunately since storage of this index is coupled with Lucene segment storage- we do not have an option till we decouple the two.
With an idempotency window, we can drop or even avoid the FST term index after some time, thereby saving memory/CPU and disk costs.