docker-images Can't get stack working with docker network

I'm trying to create a docker-compose file to bring up a dev stack on docker 1.9.1, which has the new network support in favour of which --link has been deprecated.

Ignoring the deprecation and using this docker-compose.yml with links works fine:

zookeeper:
    image: confluent/zookeeper
    ports:
        - "2181:2181"

kafka:
    image: confluent/kafka
    ports:
        - "9092:9092"
    links:
        - zookeeper

schema-registry:
    image: confluent/schema-registry
    ports:
        - "8081:8081"
    links:
        - zookeeper
        - kafka

rest-proxy:
    image: confluent/rest-proxy
    ports:
        - "8082:8082"
    links:
        - zookeeper
        - kafka
        - schema-registry

But when I try to use a docker 1.9 network bridge (called confluent) with this docker-compose.yml, the Schema Registry won't start:

zookeeper:
  image: confluent/zookeeper
  container_name: zookeeper
  net: confluent
  ports:
    - 2181:2181

kafka:
  image: confluent/kafka
  container_name: kafka
  net: confluent
  ports:
    - 9092:9092
  environment:
    # Use zookeeper:2181 which is defined in container's /etc/hosts
    KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
    # The kafka hostname should point to this container in the other containers' /etc/hosts
    KAFKA_ADVERTISED_HOST_NAME: kafka

schema-registry:
  image: confluent/schema-registry
  container_name: registry
  net: confluent
  ports:
    - 8081:8081
  environment:
    SCHEMA_REGISTRY_KAFKASTORE_CONNECTION_URL: zookeeper:2181

rest-proxy:
  image: confluent/rest-proxy
  container_name: rest-proxy
  net: confluent
  ports:
    - 8082:8082
  environment:
    RP_SCHEMA_REGISTRY_URL: registry:8081
    RP_ZOOKEEPER_CONNECT: zookeeper:2181

The logs from the registry container look like this:

Starting registry
Attaching to registry
registry | SLF4J: Class path contains multiple SLF4J bindings.
registry | SLF4J: Found binding in [jar:file:/usr/share/java/confluent-common/slf4j-log4j12-1.7.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
registry | SLF4J: Found binding in [jar:file:/usr/share/java/schema-registry/slf4j-log4j12-1.7.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
registry | SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
registry | SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
registry | [2015-12-18 17:04:48,636] INFO SchemaRegistryConfig values: 
registry |  master.eligibility = true
registry |  port = 8081
registry |  kafkastore.timeout.ms = 500
registry |  kafkastore.init.timeout.ms = 60000
registry |  debug = false
registry |  kafkastore.zk.session.timeout.ms = 30000
registry |  request.logger.name = io.confluent.rest-utils.requests
registry |  metrics.sample.window.ms = 30000
registry |  schema.registry.zk.namespace = schema_registry
registry |  kafkastore.topic = _schemas
registry |  avro.compatibility.level = backward
registry |  shutdown.graceful.ms = 1000
registry |  response.mediatype.preferred = [application/vnd.schemaregistry.v1+json, application/vnd.schemaregistry+json, application/json]
registry |  metrics.jmx.prefix = kafka.schema.registry
registry |  host.name = 2ea69d1754e9
registry |  metric.reporters = []
registry |  kafkastore.commit.interval.ms = -1
registry |  kafkastore.connection.url = zookeeper:2181
registry |  metrics.num.samples = 2
registry |  response.mediatype.default = application/vnd.schemaregistry.v1+json
registry |  kafkastore.topic.replication.factor = 3
registry |  (io.confluent.kafka.schemaregistry.rest.SchemaRegistryConfig:135)
registry | [2015-12-18 17:04:49,152] INFO Initialized the consumer offset to -1 (io.confluent.kafka.schemaregistry.storage.KafkaStoreReaderThread:87)
registry | [2015-12-18 17:04:49,807] WARN The replication factor of the schema topic _schemas is less than the desired one of 3. If this is a production environment, it's crucial to add more brokers and increase the replication factor of the topic. (io.confluent.kafka.schemaregistry.storage.KafkaStore:201)
registry | [2015-12-18 17:04:49,886] INFO [kafka-store-reader-thread-_schemas], Starting  (io.confluent.kafka.schemaregistry.storage.KafkaStoreReaderThread:68)
registry | [2015-12-18 17:05:49,967] ERROR Error starting the schema registry (io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication:57)
registry | io.confluent.kafka.schemaregistry.exceptions.SchemaRegistryInitializationException: Error initializing kafka store while initializing schema registry
registry |  at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.init(KafkaSchemaRegistry.java:164)
registry |  at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.setupResources(SchemaRegistryRestApplication.java:55)
registry |  at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.setupResources(SchemaRegistryRestApplication.java:37)
registry |  at io.confluent.rest.Application.createServer(Application.java:104)
registry |  at io.confluent.kafka.schemaregistry.rest.Main.main(Main.java:42)
registry | Caused by: io.confluent.kafka.schemaregistry.storage.exceptions.StoreInitializationException: io.confluent.kafka.schemaregistry.storage.exceptions.StoreException: Failed to write Noop record to kafka store.
registry |  at io.confluent.kafka.schemaregistry.storage.KafkaStore.init(KafkaStore.java:151)
registry |  at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.init(KafkaSchemaRegistry.java:162)
registry |  ... 4 more
registry | Caused by: io.confluent.kafka.schemaregistry.storage.exceptions.StoreException: Failed to write Noop record to kafka store.
registry |  at io.confluent.kafka.schemaregistry.storage.KafkaStore.getLatestOffset(KafkaStore.java:363)
registry |  at io.confluent.kafka.schemaregistry.storage.KafkaStore.waitUntilKafkaReaderReachesLastOffset(KafkaStore.java:220)
registry |  at io.confluent.kafka.schemaregistry.storage.KafkaStore.init(KafkaStore.java:149)
registry |  ... 5 more
registry exited with code 1

And while registry is trying to start, the kafka container logs this a bunch of times:

kafka  | [2015-12-18 17:05:50,344] INFO Closing socket connection to /172.19.0.2. (kafka.network.Processor)

(I checked, and 172.19.0.2 was the IP address of the registry container.)

And then after it fails, the zookeeper container logs this:

zookeeper  | [2015-12-18 17:05:50,344] WARN caught end of stream exception (org.apache.zookeeper.server.NIOServerCnxn)
zookeeper  | EndOfStreamException: Unable to read additional data from client sessionid 0x151b60caa6b0004, likely client has closed socket
zookeeper  |    at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
zookeeper  |    at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
zookeeper  |    at java.lang.Thread.run(Thread.java:745)
zookeeper  | [2015-12-18 17:05:50,348] INFO Closed socket connection for client /172.19.0.2:38991 which had sessionid 0x151b60caa6b0004 (org.apache.zookeeper.server.NIOServerCnxn)
zookeeper  | [2015-12-18 17:05:50,349] WARN caught end of stream exception (org.apache.zookeeper.server.NIOServerCnxn)
zookeeper  | EndOfStreamException: Unable to read additional data from client sessionid 0x151b60caa6b0005, likely client has closed socket
zookeeper  |    at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
zookeeper  |    at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
zookeeper  |    at java.lang.Thread.run(Thread.java:745)
zookeeper  | [2015-12-18 17:05:50,350] INFO Closed socket connection for client /172.19.0.2:38992 which had sessionid 0x151b60caa6b0005 (org.apache.zookeeper.server.NIOServerCnxn)
zookeeper  | [2015-12-18 17:05:56,000] INFO Expiring session 0x151b60caa6b0005, timeout of 6000ms exceeded (org.apache.zookeeper.server.ZooKeeperServer)
zookeeper  | [2015-12-18 17:05:56,001] INFO Processed session termination for sessionid: 0x151b60caa6b0005 (org.apache.zookeeper.server.PrepRequestProcessor)
zookeeper  | [2015-12-18 17:06:20,000] INFO Expiring session 0x151b60caa6b0004, timeout of 30000ms exceeded (org.apache.zookeeper.server.ZooKeeperServer)
zookeeper  | [2015-12-18 17:06:20,000] INFO Processed session termination for sessionid: 0x151b60caa6b0004 (org.apache.zookeeper.server.PrepRequestProcessor)

I'm OK with using links for now, but it would be nice to get this resolved.

Dec 18 '15 17:12 ThatRendle

Addendum: I Googled Failed to write Noop record to kafka store and found some issues on the schema-registry repo which mentioned it, but none of the workarounds there made any difference in this case.

Dec 18 '15 17:12 ThatRendle

@markrendle This looks like a limitation in docker-compose. See https://docs.docker.com/compose/faq/#how-do-i-get-compose-to-wait-for-my-database-to-be-ready-before-starting-my-application What's happening (based on how I saw it run locally) is that the schema-registry is being started up before other services (as well as other services starting out of order, e.g. the REST proxy image was pulled and the container started first).

Unfortunately, docker-compose seems to assume that it exists in a vacuum and that there wouldn't be something else in production (e.g. init daemons, YARN/Mesos/similar) that would manage the healthchecks, restarting failed processes, etc., and in turn assumes that you don't ever want to use a process exiting with a non-zero exit code as a useful indication of errors... The version with links works because it correctly defines the dependencies and for historical reasons docker-compose has to respect them.

To make this work reliably in docker-compose, I think we'd need to run everything in the docker-images under an init daemon that is configured to always restart processes that die to get the behavior that docker-compose seems to assume.

That said, even extending the timeouts by setting

    SCHEMA_REGISTRY_KAFKASTORE_INIT_TIMEOUT_MS: 300000
    SCHEMA_REGISTRY_KAFKASTORE_TIMEOUT_MS: 300000

under the schema-registry entry in the docker-compose.yml does not fix the problem. It looks like the Kafka broker startup might be getting interrupted somehow with a shutdown request (although the controller resignation is a bit confusing as there is only a single Kafka broker):

kafka      | [2015-12-22 05:23:56,009] ERROR Error handling event ZkEvent[Data of /controller changed sent to kafka.server.ZookeeperLeaderElector$LeaderChangeListener@7e768ee6] (org.I0Itec.zkclient.ZkEventThread)
kafka      | java.lang.IllegalStateException: Kafka scheduler has not been started
kafka      |    at kafka.utils.KafkaScheduler.ensureStarted(KafkaScheduler.scala:114)
kafka      |    at kafka.utils.KafkaScheduler.shutdown(KafkaScheduler.scala:86)
kafka      |    at kafka.controller.KafkaController.onControllerResignation(KafkaController.scala:350)
kafka      |    at kafka.controller.KafkaController$$anonfun$2.apply$mcV$sp(KafkaController.scala:162)
kafka      |    at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply$mcZ$sp(ZookeeperLeaderElector.scala:138)
kafka      |    at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:134)
kafka      |    at kafka.server.ZookeeperLeaderElector$LeaderChangeListener$$anonfun$handleDataDeleted$1.apply(ZookeeperLeaderElector.scala:134)
kafka      |    at kafka.utils.Utils$.inLock(Utils.scala:535)
kafka      |    at kafka.server.ZookeeperLeaderElector$LeaderChangeListener.handleDataDeleted(ZookeeperLeaderElector.scala:134)
kafka      |    at org.I0Itec.zkclient.ZkClient$6.run(ZkClient.java:549)
kafka      |    at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
kafka      | [2015-12-22 05:23:56,063] ERROR [KafkaApi-0] error when handling request Name: TopicMetadataRequest; Version: 0; CorrelationId: 48; ClientId: producer-1; Topics: _schemas (kafka.server.KafkaApis)
kafka      | kafka.admin.AdminOperationException: replication factor: 1 larger than available brokers: 0
kafka      |    at kafka.admin.AdminUtils$.assignReplicasToBrokers(AdminUtils.scala:70)
kafka      |    at kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:171)
kafka      |    at kafka.server.KafkaApis$$anonfun$19.apply(KafkaApis.scala:520)
kafka      |    at kafka.server.KafkaApis$$anonfun$19.apply(KafkaApis.scala:503)
kafka      |    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
kafka      |    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
kafka      |    at scala.collection.immutable.Set$Set1.foreach(Set.scala:74)
kafka      |    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
kafka      |    at scala.collection.AbstractSet.scala$collection$SetLike$$super$map(Set.scala:47)
kafka      |    at scala.collection.SetLike$class.map(SetLike.scala:93)
kafka      |    at scala.collection.AbstractSet.map(Set.scala:47)
kafka      |    at kafka.server.KafkaApis.getTopicMetadata(KafkaApis.scala:503)
kafka      |    at kafka.server.KafkaApis.handleTopicMetadataRequest(KafkaApis.scala:542)
kafka      |    at kafka.server.KafkaApis.handle(KafkaApis.scala:62)
kafka      |    at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:59)
kafka      |    at java.lang.Thread.run(Thread.java:745)

I don't have any concrete next steps to suggest yet aside from adjusting the different services to wait long enough for dependent services to start up (which I don't know will work reliably if they have to pull the images too...), but I figured I'd at least document some initial findings.

Dec 22 '15 05:12 ewencp

docker-images docker-images copied to clipboard

Can't get stack working with docker network

docker-images
docker-images copied to clipboard