dcos-cassandra-service
dcos-cassandra-service copied to clipboard
Bug: "Unable to find a volume with persistence id" while creating a new cluster
I am getting this error almost every time when I try to start a new cluster:
INFO [2016-05-04 03:55:56,117] org.apache.mesos.offer.OfferEvaluator: Found Offer meeting placement constraints: id {
value: "448075ee-ee69-4fba-9d9d-9157746b3841-O644"
}
framework_id {
value: "7a570f10-67bd-464b-be11-2696d6c2a5c6-0000"
}
slave_id {
value: "3db86480-98e4-41c7-a341-195f8cc10cef-S23"
}
hostname: "compute23-sjc1.prod.uber.internal"
resources {
name: "cpus"
type: SCALAR
scalar {
value: 24.0
}
role: "*"
}
resources {
name: "mem"
type: SCALAR
scalar {
value: 120635.0
}
role: "*"
}
resources {
name: "disk"
type: SCALAR
scalar {
value: 995927.0
}
role: "*"
}
resources {
name: "ports"
type: RANGES
ranges {
range {
begin: 31000
end: 32000
}
}
role: "*"
}
attributes {
name: "host"
type: TEXT
text {
value: "compute23-sjc1"
}
}
attributes {
name: "kernel"
type: SCALAR
scalar {
value: 3.18
}
}
attributes {
name: "pod"
type: TEXT
text {
value: "d"
}
}
attributes {
name: "rack"
type: SCALAR
scalar {
value: 1010.0
}
}
url {
scheme: "http"
address {
hostname: "compute23-sjc1.prod.uber.internal"
ip: "10.163.31.21"
port: 5051
}
path: "/slave(1)"
}
INFO [2016-05-04 03:55:56,117] org.apache.mesos.offer.ResourceUtils: Selected disk: type = ROOTresource = name: "disk"
type: SCALAR
scalar {
value: 995927.0
}
role: "*"
INFO [2016-05-04 03:55:56,117] org.apache.mesos.offer.ResourceUtils: Selected disk: type = ROOTresource = name: "disk"
type: SCALAR
scalar {
value: 995927.0
}
role: "*"
ERROR [2016-05-04 03:55:56,117] org.apache.mesos.offer.OfferEvaluator: Unable to find a volume with persistence id: c9bd2df4-c7ab-4bce-8540-b578c6bb2885
INFO [2016-05-04 03:55:56,117] org.apache.mesos.offer.OfferEvaluator: VolumeMode is EXISTING and VolumeType is ROOT hasExpectedVolumes is false
INFO [2016-05-04 03:55:56,117] org.apache.mesos.offer.OfferEvaluator: EnoughCPU: true EnoughMem: true EnoughDisk: true EnoughPorts: true HasExpectedVolumes: false
WARN [2016-05-04 03:55:56,117] org.apache.mesos.offer.OfferEvaluator: No Offers found meeting Resource constraints.
WARN [2016-05-04 03:55:56,117] org.apache.mesos.offer.OfferEvaluator: No acceptable offers due to insufficient resources.
Here is the complete log file (from starting the cluster): uber-cassandra-mesos.log.txt
@mohitsoni Can you please take a look?
This is the sequence of events:
- Scheduler attempts to launch the task with RESERVE+CREATE+LAUNCH
- Scheduler drops dead
- Scheduler comes back, performs reconcilation
- Receives a status from Mesos that task is still in staging. Reconcile finishes.
- Then this happens: https://github.com/mesosphere/dcos-cassandra-service/blob/master/cassandra-scheduler/src/main/java/com/mesosphere/dcos/cassandra/scheduler/plan/CassandraDaemonBlock.java#L200
- Scheduler tries to replace this task, and updates the ZK state with new taskId for node-1
- Keeps attempting to launch the task, and not finding volume (as you suggested)
- At a later time, scheduler gets above TASK_STATUS back from Mesos that the node-1 task from previous life is now running. But, the new scheduler ignore the update, as it no longer know about the old taskId
INFO [2016-05-04 03:46:38,053] com.mesosphere.dcos.cassandra.scheduler.tasks.CassandraTasks: Received status update for unrecorded task: status = task_id {
22682 value: "node-1_8a1e990b-32a2-41e5-91a2-5160cb89bc58"
22683 }
22684 state: TASK_RUNNING
22685 data: "\b\001\020\001"
22686 slave_id {
22687 value: "3db86480-98e4-41c7-a341-195f8cc10cef-S16"
22688 }
22689 timestamp: 1.462333597970708E9
22690 executor_id {
22691 value: "node-1_8a1e990b-32a2-41e5-91a2-5160cb89bc58_executor"
22692 }
22693 source: SOURCE_EXECUTOR
22694 uuid: "u\237\276\302\254$B6\254Wo\273\337\303\2330"
22695 container_status {
22696 network_infos {
22697 ip_address: "10.163.1.18"
22698 ip_addresses {
22699 ip_address: "10.163.1.18"
22700 }
22701 }
22702 }