dcos-cassandra-service icon indicating copy to clipboard operation
dcos-cassandra-service copied to clipboard

Bug: "Unable to find a volume with persistence id" while creating a new cluster

Open verma7 opened this issue 8 years ago • 2 comments

I am getting this error almost every time when I try to start a new cluster:

INFO  [2016-05-04 03:55:56,117] org.apache.mesos.offer.OfferEvaluator: Found Offer meeting placement constraints: id {
  value: "448075ee-ee69-4fba-9d9d-9157746b3841-O644"
}
framework_id {
  value: "7a570f10-67bd-464b-be11-2696d6c2a5c6-0000"
}
slave_id {
  value: "3db86480-98e4-41c7-a341-195f8cc10cef-S23"
}
hostname: "compute23-sjc1.prod.uber.internal"
resources {
  name: "cpus"
  type: SCALAR
  scalar {
    value: 24.0
  }
  role: "*"
}
resources {
  name: "mem"
  type: SCALAR
  scalar {
    value: 120635.0
  }
  role: "*"
}
resources {
  name: "disk"
  type: SCALAR
  scalar {
    value: 995927.0
  }
  role: "*"
}
resources {
  name: "ports"
  type: RANGES
  ranges {
    range {
      begin: 31000
      end: 32000
    }
  }
  role: "*"
}
attributes {
  name: "host"
  type: TEXT
  text {
    value: "compute23-sjc1"
  }
}
attributes {
  name: "kernel"
  type: SCALAR
  scalar {
    value: 3.18
  }
}
attributes {
  name: "pod"
  type: TEXT
  text {
    value: "d"
  }
}
attributes {
  name: "rack"
  type: SCALAR
  scalar {
    value: 1010.0
  }
}
url {
  scheme: "http"
  address {
    hostname: "compute23-sjc1.prod.uber.internal"
    ip: "10.163.31.21"
    port: 5051
  }
  path: "/slave(1)"
}

INFO  [2016-05-04 03:55:56,117] org.apache.mesos.offer.ResourceUtils: Selected disk: type = ROOTresource = name: "disk"
type: SCALAR
scalar {
  value: 995927.0
}
role: "*"

INFO  [2016-05-04 03:55:56,117] org.apache.mesos.offer.ResourceUtils: Selected disk: type = ROOTresource = name: "disk"
type: SCALAR
scalar {
  value: 995927.0
}
role: "*"

ERROR [2016-05-04 03:55:56,117] org.apache.mesos.offer.OfferEvaluator: Unable to find a volume with persistence id: c9bd2df4-c7ab-4bce-8540-b578c6bb2885
INFO  [2016-05-04 03:55:56,117] org.apache.mesos.offer.OfferEvaluator: VolumeMode is EXISTING and VolumeType is ROOT hasExpectedVolumes is false
INFO  [2016-05-04 03:55:56,117] org.apache.mesos.offer.OfferEvaluator: EnoughCPU: true EnoughMem: true EnoughDisk: true EnoughPorts: true HasExpectedVolumes: false
WARN  [2016-05-04 03:55:56,117] org.apache.mesos.offer.OfferEvaluator: No Offers found meeting Resource constraints.
WARN  [2016-05-04 03:55:56,117] org.apache.mesos.offer.OfferEvaluator: No acceptable offers due to insufficient resources.

verma7 avatar May 04 '16 04:05 verma7

Here is the complete log file (from starting the cluster): uber-cassandra-mesos.log.txt

@mohitsoni Can you please take a look?

verma7 avatar May 04 '16 04:05 verma7

This is the sequence of events:

  1. Scheduler attempts to launch the task with RESERVE+CREATE+LAUNCH
  2. Scheduler drops dead
  3. Scheduler comes back, performs reconcilation
  4. Receives a status from Mesos that task is still in staging. Reconcile finishes.
  5. Then this happens: https://github.com/mesosphere/dcos-cassandra-service/blob/master/cassandra-scheduler/src/main/java/com/mesosphere/dcos/cassandra/scheduler/plan/CassandraDaemonBlock.java#L200
  6. Scheduler tries to replace this task, and updates the ZK state with new taskId for node-1
  7. Keeps attempting to launch the task, and not finding volume (as you suggested)
  8. At a later time, scheduler gets above TASK_STATUS back from Mesos that the node-1 task from previous life is now running. But, the new scheduler ignore the update, as it no longer know about the old taskId
INFO  [2016-05-04 03:46:38,053] com.mesosphere.dcos.cassandra.scheduler.tasks.CassandraTasks: Received status update for unrecorded task: status = task_id {
22682   value: "node-1_8a1e990b-32a2-41e5-91a2-5160cb89bc58"
22683 }
22684 state: TASK_RUNNING
22685 data: "\b\001\020\001"
22686 slave_id {
22687   value: "3db86480-98e4-41c7-a341-195f8cc10cef-S16"
22688 }
22689 timestamp: 1.462333597970708E9
22690 executor_id {
22691   value: "node-1_8a1e990b-32a2-41e5-91a2-5160cb89bc58_executor"
22692 }
22693 source: SOURCE_EXECUTOR
22694 uuid: "u\237\276\302\254$B6\254Wo\273\337\303\2330"
22695 container_status {
22696   network_infos {
22697     ip_address: "10.163.1.18"
22698     ip_addresses {
22699       ip_address: "10.163.1.18"
22700     }
22701   }
22702 }

mohitsoni avatar May 06 '16 00:05 mohitsoni