m3
m3 copied to clipboard
[DBNode][WIP] Dont block database initialization on namespace initialization
What this PR does / why we need it: This P.R allows M3DB to come up and mark itself as bootstrapped without any configured namespaces. This has several benefits:
- Simplifies DB operation since new clusters can be created and brought to a healthy state without needing to add a "dummy" namespace.
- New clusters can bootstrap more efficiently since the database can bootstrap and mark all of its shards as available before writing out all the files for any existing namespaces (which can sometimes take a very long time)
TODO(rartoul): Add tests for new code path TODO(rartoul): Remove error logs that occur when database is initialized but there are no namespaces
Does this differentiate issues with delayed namespace configuration vs new cluster creation? Eg if etcd is down or takes a while to push an existing ns config, we shouldn’t let the dbnode bootstrap.
Codecov Report
Merging #1750 into master will decrease coverage by
<.1%. The diff coverage is69.6%.
@@ Coverage Diff @@
## master #1750 +/- ##
========================================
- Coverage 71.9% 71.9% -0.1%
========================================
Files 982 982
Lines 82097 82106 +9
========================================
- Hits 59092 59087 -5
- Misses 19108 19119 +11
- Partials 3897 3900 +3
| Flag | Coverage Δ | |
|---|---|---|
| #aggregator | 82.4% <ø> (-0.1%) |
:arrow_down: |
| #cluster | 85.7% <ø> (ø) |
:arrow_up: |
| #collector | 63.9% <ø> (ø) |
:arrow_up: |
| #dbnode | 80% <69.6%> (-0.1%) |
:arrow_down: |
| #m3em | 73.2% <ø> (ø) |
:arrow_up: |
| #m3ninx | 74.1% <ø> (ø) |
:arrow_up: |
| #m3nsch | 51.1% <ø> (ø) |
:arrow_up: |
| #metrics | 17.6% <ø> (ø) |
:arrow_up: |
| #msg | 74.9% <ø> (+0.1%) |
:arrow_up: |
| #query | 66.3% <ø> (ø) |
:arrow_up: |
| #x | 85.1% <ø> (ø) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update c897aa1...8c698c4. Read the comment docs.
Codecov Report
Merging #1750 into master will decrease coverage by
<.1%. The diff coverage is69.6%.
@@ Coverage Diff @@
## master #1750 +/- ##
========================================
- Coverage 71.9% 71.9% -0.1%
========================================
Files 982 982
Lines 82097 82106 +9
========================================
- Hits 59092 59087 -5
- Misses 19108 19119 +11
- Partials 3897 3900 +3
| Flag | Coverage Δ | |
|---|---|---|
| #aggregator | 82.4% <ø> (-0.1%) |
:arrow_down: |
| #cluster | 85.7% <ø> (ø) |
:arrow_up: |
| #collector | 63.9% <ø> (ø) |
:arrow_up: |
| #dbnode | 80% <69.6%> (-0.1%) |
:arrow_down: |
| #m3em | 73.2% <ø> (ø) |
:arrow_up: |
| #m3ninx | 74.1% <ø> (ø) |
:arrow_up: |
| #m3nsch | 51.1% <ø> (ø) |
:arrow_up: |
| #metrics | 17.6% <ø> (ø) |
:arrow_up: |
| #msg | 74.9% <ø> (+0.1%) |
:arrow_up: |
| #query | 66.3% <ø> (ø) |
:arrow_up: |
| #x | 85.1% <ø> (ø) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update c897aa1...8c698c4. Read the comment docs.
Codecov Report
Merging #1750 into master will decrease coverage by
<.1%. The diff coverage is69.6%.
@@ Coverage Diff @@
## master #1750 +/- ##
========================================
- Coverage 71.9% 71.9% -0.1%
========================================
Files 982 982
Lines 82097 82106 +9
========================================
- Hits 59092 59087 -5
- Misses 19108 19119 +11
- Partials 3897 3900 +3
| Flag | Coverage Δ | |
|---|---|---|
| #aggregator | 82.4% <ø> (-0.1%) |
:arrow_down: |
| #cluster | 85.7% <ø> (ø) |
:arrow_up: |
| #collector | 63.9% <ø> (ø) |
:arrow_up: |
| #dbnode | 80% <69.6%> (-0.1%) |
:arrow_down: |
| #m3em | 73.2% <ø> (ø) |
:arrow_up: |
| #m3ninx | 74.1% <ø> (ø) |
:arrow_up: |
| #m3nsch | 51.1% <ø> (ø) |
:arrow_up: |
| #metrics | 17.6% <ø> (ø) |
:arrow_up: |
| #msg | 74.9% <ø> (+0.1%) |
:arrow_up: |
| #query | 66.3% <ø> (ø) |
:arrow_up: |
| #x | 85.1% <ø> (ø) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update c897aa1...8c698c4. Read the comment docs.
Codecov Report
Merging #1750 into master will decrease coverage by
<.1%. The diff coverage is69.6%.
@@ Coverage Diff @@
## master #1750 +/- ##
========================================
- Coverage 71.9% 71.9% -0.1%
========================================
Files 982 982
Lines 82097 82106 +9
========================================
- Hits 59092 59087 -5
- Misses 19108 19119 +11
- Partials 3897 3900 +3
| Flag | Coverage Δ | |
|---|---|---|
| #aggregator | 82.4% <ø> (-0.1%) |
:arrow_down: |
| #cluster | 85.7% <ø> (ø) |
:arrow_up: |
| #collector | 63.9% <ø> (ø) |
:arrow_up: |
| #dbnode | 80% <69.6%> (-0.1%) |
:arrow_down: |
| #m3em | 73.2% <ø> (ø) |
:arrow_up: |
| #m3ninx | 74.1% <ø> (ø) |
:arrow_up: |
| #m3nsch | 51.1% <ø> (ø) |
:arrow_up: |
| #metrics | 17.6% <ø> (ø) |
:arrow_up: |
| #msg | 74.9% <ø> (+0.1%) |
:arrow_up: |
| #query | 66.3% <ø> (ø) |
:arrow_up: |
| #x | 85.1% <ø> (ø) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update c897aa1...8c698c4. Read the comment docs.
Codecov Report
Merging #1750 into master will decrease coverage by
<.1%. The diff coverage is69.6%.
@@ Coverage Diff @@
## master #1750 +/- ##
========================================
- Coverage 71.9% 71.9% -0.1%
========================================
Files 982 982
Lines 82097 82106 +9
========================================
- Hits 59092 59087 -5
- Misses 19108 19119 +11
- Partials 3897 3900 +3
| Flag | Coverage Δ | |
|---|---|---|
| #aggregator | 82.4% <ø> (-0.1%) |
:arrow_down: |
| #cluster | 85.7% <ø> (ø) |
:arrow_up: |
| #collector | 63.9% <ø> (ø) |
:arrow_up: |
| #dbnode | 80% <69.6%> (-0.1%) |
:arrow_down: |
| #m3em | 73.2% <ø> (ø) |
:arrow_up: |
| #m3ninx | 74.1% <ø> (ø) |
:arrow_up: |
| #m3nsch | 51.1% <ø> (ø) |
:arrow_up: |
| #metrics | 17.6% <ø> (ø) |
:arrow_up: |
| #msg | 74.9% <ø> (+0.1%) |
:arrow_up: |
| #query | 66.3% <ø> (ø) |
:arrow_up: |
| #x | 85.1% <ø> (ø) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update c897aa1...8c698c4. Read the comment docs.
Codecov Report
Merging #1750 into master will decrease coverage by
<.1%. The diff coverage is69.6%.
@@ Coverage Diff @@
## master #1750 +/- ##
========================================
- Coverage 71.9% 71.9% -0.1%
========================================
Files 982 982
Lines 82097 82106 +9
========================================
- Hits 59092 59087 -5
- Misses 19108 19119 +11
- Partials 3897 3900 +3
| Flag | Coverage Δ | |
|---|---|---|
| #aggregator | 82.4% <ø> (-0.1%) |
:arrow_down: |
| #cluster | 85.7% <ø> (ø) |
:arrow_up: |
| #collector | 63.9% <ø> (ø) |
:arrow_up: |
| #dbnode | 80% <69.6%> (-0.1%) |
:arrow_down: |
| #m3em | 73.2% <ø> (ø) |
:arrow_up: |
| #m3ninx | 74.1% <ø> (ø) |
:arrow_up: |
| #m3nsch | 51.1% <ø> (ø) |
:arrow_up: |
| #metrics | 17.6% <ø> (ø) |
:arrow_up: |
| #msg | 74.9% <ø> (+0.1%) |
:arrow_up: |
| #query | 66.3% <ø> (ø) |
:arrow_up: |
| #x | 85.1% <ø> (ø) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update c897aa1...8c698c4. Read the comment docs.
Codecov Report
Merging #1750 into master will decrease coverage by
<.1%. The diff coverage is69.6%.
@@ Coverage Diff @@
## master #1750 +/- ##
========================================
- Coverage 71.9% 71.9% -0.1%
========================================
Files 982 982
Lines 82097 82106 +9
========================================
- Hits 59092 59087 -5
- Misses 19108 19119 +11
- Partials 3897 3900 +3
| Flag | Coverage Δ | |
|---|---|---|
| #aggregator | 82.4% <ø> (-0.1%) |
:arrow_down: |
| #cluster | 85.7% <ø> (ø) |
:arrow_up: |
| #collector | 63.9% <ø> (ø) |
:arrow_up: |
| #dbnode | 80% <69.6%> (-0.1%) |
:arrow_down: |
| #m3em | 73.2% <ø> (ø) |
:arrow_up: |
| #m3ninx | 74.1% <ø> (ø) |
:arrow_up: |
| #m3nsch | 51.1% <ø> (ø) |
:arrow_up: |
| #metrics | 17.6% <ø> (ø) |
:arrow_up: |
| #msg | 74.9% <ø> (+0.1%) |
:arrow_up: |
| #query | 66.3% <ø> (ø) |
:arrow_up: |
| #x | 85.1% <ø> (ø) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update c897aa1...8c698c4. Read the comment docs.
Codecov Report
Merging #1750 into master will decrease coverage by
<.1%. The diff coverage is69.6%.
@@ Coverage Diff @@
## master #1750 +/- ##
========================================
- Coverage 71.9% 71.9% -0.1%
========================================
Files 982 982
Lines 82097 82106 +9
========================================
- Hits 59092 59087 -5
- Misses 19108 19119 +11
- Partials 3897 3900 +3
| Flag | Coverage Δ | |
|---|---|---|
| #aggregator | 82.4% <ø> (-0.1%) |
:arrow_down: |
| #cluster | 85.7% <ø> (ø) |
:arrow_up: |
| #collector | 63.9% <ø> (ø) |
:arrow_up: |
| #dbnode | 80% <69.6%> (-0.1%) |
:arrow_down: |
| #m3em | 73.2% <ø> (ø) |
:arrow_up: |
| #m3ninx | 74.1% <ø> (ø) |
:arrow_up: |
| #m3nsch | 51.1% <ø> (ø) |
:arrow_up: |
| #metrics | 17.6% <ø> (ø) |
:arrow_up: |
| #msg | 74.9% <ø> (+0.1%) |
:arrow_up: |
| #query | 66.3% <ø> (ø) |
:arrow_up: |
| #x | 85.1% <ø> (ø) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update c897aa1...8c698c4. Read the comment docs.
@prateek yeah that’s the idea behind the “ForceGet” method. Basically want to go straight to etcd and only if etcd is available and concretely tells you there are no namespaces can you “skip” past waiting for an initial value, otherwise you blocking waiting for an initial value like we do now.
Hey this is much better - do you mind outlining how (2) is possible with this change?
@richardartoul cool, makes sense
@robskillington It helps with #2 because what happens currently in the worst case is:
1. Wait for placement
2. Placement is set
3. Wait for namespace
4. Namespace with long retention is set
5. Node skips first flush (this is a separate issue which I'll fix in a separate P.R, it happens because of the guard rails we put in place last time we had that major bootstrapping / data loss bug but its no longer needed now that justin has refactored the buffer code)
6. Node does snapshot (and it has to snapshot ALL blocks
7. At this point node MAY mark shards available depending on when the tick started. If not, it has do go through the flush/snapshotting process one more time
8. Node marks shards as available
After this P.R lands:
1. Wait for placement
2. Placement is set
3. Node bootstraps
4. Node marks shards as available (get here with in seconds)
5. Namespace is added
6. Namespace is bootstrapped (extremely quick because there is no data)
So now we're up within a few seconds as long as you don't add the namespaces before you create a placement. The downside of course is that if writing out the initial set of files still take forever, you could end up holding data in memory for longer than you expected, but I'm not too worried about that because part of the reason writing out those files takes so long is that sometimes we have to do it twice because the first flush gets skipped (due to this code which I will delete in a separate P.R: https://github.com/m3db/m3/blob/master/src/dbnode/storage/namespace.go#L969)
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
Richard Artoul seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.