list icon indicating copy to clipboard operation
list copied to clipboard

Split cases collection into collections by sourceId

Open abhidg opened this issue 3 years ago • 3 comments

Is your feature request related to a problem? Please describe. Currently all data is in one cases collection. This causes issues with operations such as prune, which have to get a write lock on the collection and update millions of cases by using flags. Operations such as export also become slower with the size of the collection.

Describe the solution you'd like Split the cases collection by sourceId. This way, parallel ingestions should be faster as no simultaneous ingestions will be operating on the same collection. Also operations such as export become simpler, especially if we change the export unit to be by source rather than by country. Prune need not be a separate time-consuming operation, we can .renameCollection() the current collection to collection-old and replace it with staging collection for a single source. As renameCollection() does not involve a copy (it changes the metadata), this should be much faster. The benefit is that housekeeping operations relating to ingestion (export, prune) can be done as part of the ingestion process or at the database level via triggers on collections, as suggested by @jim-sheldon. Making collections smaller will make these database operations faster as well.

Describe alternatives you've considered Keep the status quo. It mostly works, though we would expect scaling issues to get worse if we get 2-3x the current number of cases (~100m).

abhidg avatar Feb 25 '22 14:02 abhidg