m3 Failed to scale m3aggregator

I have been running the m3aggregator for more than 1 hour. But the shard state is still INITIALIZING. That's why i can not scale.

$ curl  http://localhost:7201/api/v1/services/m3aggregator/placement
{  
   "placement":{  
      "instances":{  
         "m3aggregator-0:6000":{  
            "id":"m3aggregator-0:6000",
            "isolationGroup":"availability-zone-a",
            "zone":"embedded",
            "weight":100,
            "endpoint":"m3aggregator-0.m3aggregator.demo.svc:6000",
            "shards":[  
               {  
                  "id":0,
                  "state":"INITIALIZING",
                  "sourceId":"",
                  "cutoverNanos":"1561974000000000000",
                  "cutoffNanos":"0"
               },
...

When i try to add new node to m3aggregator it gives error.

$ curl -X POST  http://localhost:7201/api/v1/services/m3aggregator/placement  -d '{
    "instances": [
        {
            "id": "m3aggregator-2:6000",
            "isolation_group": "availability-zone-a",
            "zone": "embedded",
            "weight": 100,
            "endpoint": "m3aggregator-2.m3aggregator.demo.svc:6000",
            "hostname": "m3aggregator-2",
            "port": 6000
        },
        {
            "id": "m3aggregator-3:6000",
            "isolation_group": "availability-zone-b",
            "zone": "embedded",
            "weight": 100,
            "endpoint": "m3aggregator-3.m3aggregator.demo.svc:6000",
            "hostname": "m3aggregator-3",
            "port": 6000
        }
    ]
}'

{"error":"instances [m3aggregator-0:6000,m3aggregator-1:6000] do not have all shards available"}

But I am getting metric data for aggregated namespace.

What could be the reason behind it? How to fix that? Who is responsible for updating the shard state?

Jul 03 '19 05:07 nightfury1204

@robskillington any idea about this? Do shards need to be marked as available manually?

Jul 18 '19 19:07 richardartoul

That might be the case, @cw9 for initial placements do shards need to marked available manually for the aggregator?

Jul 22 '19 15:07 robskillington

hmm, I don't think so, the shard state does not really matter in m3aggregator, is the cutover/cutoff time correct?

Jul 22 '19 15:07 cw9

@cw9 I used the configuration from this guide: https://github.com/m3db/m3/blob/6db13050aa78e365ec614126ef414734033c3408/docs/how_to/aggregator.md

If shard state doesn't matter why it is giving the error with specifying shard state?

{"error":"instances [m3aggregator-0:6000,m3aggregator-1:6000] do not have all shards available"}

FYI, this api call is done to m3coordinator.

Btw, what is cutover/cutoff time and what's its role in here?

Jul 23 '19 05:07 nightfury1204

Yeah that seems to be a log in the coordinator code, the coordinator should probably do a MarkAllShardsAvailable then Add/Remove/Replace @robskillington

The cutover/cutoff time is used to route traffic between collectors and aggregators during placement updates.

For example shard1 used to live on instance1 and during a placement change it was moved to instance2. So in the new placement, shard1 will still exist on instance1 with a cutoff time t and instance1 will continue to process traffic for shard1 until t. It will also exist on instance2 with a cutover time t and instance2 will start to process traffic for shard1 after t.

Jul 23 '19 14:07 cw9

Yep, While debugging a different issue in my case I tested locally running the m3aggregator integration test.sh script. I added a curl call to print the placement state before each iteration. What I saw was that the placement status was never changed from INITIALIZING although the test passed.

Jul 23 '19 15:07 jorgelbg

I am still seeing the problem as of 0.14.2. Aggregator scaling does not work because the shards sit in INITIALIZING forever.

Jan 22 '20 15:01 mmedvede

I have the same problem at 0.15

Jul 03 '20 12:07 yackushevas

@gibbscullen the problem still exists

Sep 18 '20 16:09 yackushevas

Reopening to investigate further.

Sep 18 '20 17:09 gibbscullen

By design, all shards must be available before any new instances can be added to aggregator placement because it requires reshuffling of shards across all instances: https://github.com/m3db/m3/blob/master/src/cluster/placement/algo/mirrored.go#L152

You might want to check that shards are marked available before new instances are added.

Feb 18 '21 19:02 abliqo

@yackushevas -- thanks for following up on this. For the time being, use the placement set API on the coordinator to manually move the shards from initializing to available. We are aware of this issue, and it will be addressed in the medium-term future with better end to end tooling for using the aggregator.

Mar 02 '21 21:03 gibbscullen

@yackushevas -- thanks for following up on this. For the time being, use the placement set API on the coordinator to manually move the shards from initializing to available. We are aware of this issue, and it will be addressed in the medium-term future with better end to end tooling for using the aggregator.

how to use the placement set API marked as available manually?

Jul 27 '21 13:07 yuandongjian

Can this be fixed? Scaling aggregator tier with this bug is very annoying.

Mar 23 '22 08:03 edgarasg

m3 m3 copied to clipboard

Failed to scale m3aggregator

m3
m3 copied to clipboard