TileDB Consistency with consolidation on S3

Our consolidation algorithm deletes consolidated fragments. Due to the fact that S3 offers eventual consistency for object deletions, deleted consolidated fragments may still be "visible" for quite some time after consolidation. This can cause read performance degradation and other errors in the read algorithm.

A potential fix is to add deletion markers (special files indicating that a fragment directory is deleted), and have the reads check for its existence before loading the fragment metadata. There may be other solutions, so let's get the discussion going in this issue.

Jan 02 '19 20:01 stavrospapadopoulos

Until such time that a deletion marker file is added (or some other strategy is adopted), if all fragments in an array within a bucket can fit on disk in a single EC2 node, would (could) ultimate read performance (and overall application throughput) be improved by copying those fragments out of the bucket, performing consolidation there, and then copying the consolidated fragment(s) back into S3?

The comparison case being just performing array reads against a consolidated array in S3 where the obsolete fragments are still hanging around.

May 09 '19 15:05 AndyGreenwell

The problem is with the old (consolidated) fragments. Therefore, a "hacky" workaround would be to either perform ingestion and consolidation locally, and then sync up the consolidated array to S3 (the local filesystem will properly delete the consolidated fragments). Alternatively, you could perform consolidation on S3, but after every consolidation operation, you will need to manually copy each consolidated fragment into a separate array that stores only consolidated fragments.

Of course neither approach is ideal. We'll discuss internally whether we can escalate this and address it properly (with deletion markers) over the next 1-2 weeks.

May 09 '19 15:05 stavrospapadopoulos

Yes, since consolidation has to read the entire fragment files anyways this should result in the highest throughput. The alternative is to stream the fragments from S3 without materializing to disk, which needs batch size adjustment for consolidation to be equivalent in IO throughput to just downloading the entire fragment and re-uploading the result. Also the batch size for consolidation need to be adjusted to amortize the higher pre-request latency to S3. In addition you pay for all API calls to S3 so it might help there as well.

It's important to make this work to minimize the on-disk / ebs requirements for each node which can be operationally complicated when using technologies like kubernetes / serverless. But it's not going to be the highest performance option most likely.

May 09 '19 15:05 jakebolewski

Thanks for the quick comment.

Every different user's deployment scenario will be slightly different, so (as the developers/creators of the technology) you will have to balance all desired operational environments. I can live with my own quick hacks for now, but you all certainly should not design to that use case. :)

I am currently doing sort of a progressive consolidation in AWS Batch. Lots of operations happening in parallel in a job array with local consolidation happening in each job. The results of each local consolidation copied up to S3. The decision was then what to do next. For now I am going to experiment with launching a dependent job that collects all of those individual fragments and performs local consolidation, then copies the single fragment back to S3. A subsequent array job in the pipeline then performs large number of parallel reads.

May 09 '19 16:05 AndyGreenwell