list
list copied to clipboard
Split cases collection into collections by sourceId
Is your feature request related to a problem? Please describe.
Currently all data is in one cases
collection. This causes issues with operations such as prune, which have to get a write lock on the collection and update millions of cases by using flags. Operations such as export also become slower with the size of the collection.
Describe the solution you'd like
Split the cases
collection by sourceId. This way, parallel ingestions should be faster as no simultaneous ingestions will be operating on the same collection. Also operations such as export become simpler, especially if we change the export unit to be by source rather than by country. Prune need not be a separate time-consuming operation, we can .renameCollection()
the current collection to collection-old and replace it with staging collection for a single source. As renameCollection() does not involve a copy (it changes the metadata), this should be much faster. The benefit is that housekeeping operations relating to ingestion (export, prune) can be done as part of the ingestion process or at the database level via triggers on collections, as suggested by @jim-sheldon. Making collections smaller will make these database operations faster as well.
Describe alternatives you've considered Keep the status quo. It mostly works, though we would expect scaling issues to get worse if we get 2-3x the current number of cases (~100m).