cf-abacus
cf-abacus copied to clipboard
Inconsistent data owing to non transactional writes to database.
The applications in the pipeline that reduces usage, generates multiple output docs. One of the output docs is the duplicate detection doc and the others are the accumulated ones. With multiple DB partitions, the partition to which the every doc is written is based on the id of the doc. If one of these writes fail, the docs that written to the DB will cause the entity to be in inconsistent state.
For example, if we start the pipeline with the following configuration
export SAMPLING=86400000
export SLACK=5D
export DB_PARTITIONS=4
npm start
and submit the usage for November 30th on December 3d
{
"start": 1480464000000,
"end": 1480464000000,
"organization_id": "us-south:a3d7fe4d-3cb1-4cc3-a831-ffe98e20cf27",
"space_id": "aaeae239-f3f8-483c-9dd0-de5d41c38b6a",
"consumer_id": "app:bbeae239-f3f8-483c-9dd0-de6781c38bab",
"resource_id": "object-storage",
"plan_id": "basic",
"resource_instance_id": "0b39fa70-a65f-4183-bae8-385633ca5c87",
"measured_usage": [
{
"measure": "storage",
"quantity": 1073741824
},
{
"measure": "light_api_calls",
"quantity": 1000
},
{
"measure": "heavy_api_calls",
"quantity": 100
}
]
}
the aggregator will produce 3 documents with the following ids
-
k/us-south:a3d7fe4d-3cb1-4cc3-a831-ffe98e20cf27/t/0001480723200000 (written to database abacus-aggregator-aggregated-usage-2-201612)
-
k/us-south:a3d7fe4d-3cb1-4cc3-a831-ffe98e20cf27/aaeae239-f3f8-483c-9dd0-de5d41c38b6a/app:bbeae239-f3f8-483c-9dd0-de6781c38bab/t/0001480723200000 (written to database abacus-aggregator-aggregated-usage-3-201612)
-
k/us-south:a3d7fe4d-3cb1-4cc3-a831-ffe98e20cf27/0b39fa70-a65f-4183-bae8-385633ca5c87/app:bbeae239-f3f8-483c-9dd0-de6781c38bab/basic/basic-object-storage/object-rating-plan/object-pricing-basic/t/0001480464000000/0001480464000000 (written to database abacus-aggregator-aggregated-usage-3-201611)
If CouchDB is used as the backend database, the writes happen in 3 different HTTP requests. If some writes are successful and others aren't then this will result in inconsistent accumulated values.
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/135472301
The labels on this github issue will be updated when the story is started.
Do we want to maintain a consistent set of docs in transactional manner or just be sure that the reporting does not fetch something that's inconsistent? Is it just reporting or sink extensions as well?
What are the problems that such inconsistency causes? I can imagine incorrect report, but are there some side effects as well?
@hsiliev The inconsistency will result in incorrectly aggregated values. The side effects depend on the resource.
Let's say the accumulator posted an accumulated document, for a runtime event, to the aggregator;
-
If the duplicate detection doc was not written while org and consumer aggregated documents were written, the caller is returned with an error code lets say 500, the accumulator tries to report the accumulated usage. Now this usage has already been already aggregated at org and consumer levels and still the usage will not be rejected because of the missing duplicate detection document, and it will be aggregated at the org and consumer level again. If runtime start event was submitted, it would double the consumption or if it's a runtime stop event, then it would show a decreased consumption value (sometimes negative).
-
If the duplicate detection succeeded and the org or consumer aggregation fails, then the caller will retry. Now this document will be rejected as a duplicate detection document, though it's not aggregated at org or consumer levels.
Handling transactions with Couch or Mongo does not seem like a good idea. What if we split the aggregator into consumer and org micro-services?
If consumer aggregation fails we won't have org aggregation, but this will be more consistent than today's behaviour. In worst scenario it should be the same as case 1 above, but it can also happen that we face problems only on consumer level and not on org level (or vice versa).
Drawback is of course longer pipeline, that needs to be async and probably with replay running regularly.
Hello,
We have managed to reproduce the database inconsistency on mongodb by sending the same usage (same timestamp and etc) few times asynchronously. Usually one of the requests returns 201 CREATED
and the others return we get Status code 409
and some of them contain error E11000 duplicate key error index
from the db.
After this error we get inconsistent behavior. The doc with new usage is usually correctly written in the db, but when getting the org usage, sometimes we get the correct usage, sometimes we get HTTP/1.1 500 Internal Server Error
, we've also monitored double aggregation if we send a lot of parallel requests.