oklog Add helm chart for Kubernetes integration

Per discussion in #13, I think it makes a lot of sense to run oklog ingest alongside logspout as a DaemonSet, with oklog store running as a deployment. I'll put together a helm chart in a few weeks for easy k8s setup.

Jan 26 '17 19:01 derekperkins

SGTM!

Jan 27 '17 08:01 peterbourgon

@kminehart

I'm trying to implement #32 and I'm a bit lost on how the clustering would work. Would each pod just set the "replication-factor" to "1" and throw them behind a service? I imagine that oklog does a lot more than that though.

You definitely can't just set replication-factor to 1 — the end user needs to provide that parameter explicitly, based on their operational requirements.

With regards to clustering: each node needs to start up with a -peer flag pointing to a routable host:port of at least 1 other node in the cluster, and ideally all nodes in the cluster (including itself is fine). They need to be the direct, routable host:ports of the other nodes taken from some form of service discovery; pointing each node at a shared e.g. service IP won't cut it. If you know all the host:ports a priori, and can start every node with the same set of -peers, that's best; if you only find out about them as the nodes start up in sequence, that's also acceptable; here's a bash example.

I'm not sure what the state of the art is in Kubernetes Land for this sort of thing; my instinct would be to look at how folks are deploying e.g. Consul clusters, as they have a similar clustering mechanism. A StatefulSet, perhaps?

Apr 04 '17 13:04 peterbourgon

Hmm. That should give me enough to work with.

I have an idea; I'll post back here soon.

Apr 04 '17 16:04 kminehart

I take that back. I don't think that it's possible with my knowledge to do it correctly.

And by correctly I mean have one configuration for oklog.

I'll try to explain myself; I'm not completely familiar with oklog, and definitely not a Kubernetes expert, but here's what I'm running into:

With regards to clustering: each node needs to start up with a -peer flag pointing to a routable host:port of at least 1 other node in the cluster, and ideally all nodes in the cluster (including itself is fine).

So this would mean, to start a cluster for oklog it'd look like this (with a StatefulSet as you suggested):

# Each DaemonSet (process that runs per-node in the cluster)
oklog ingest -peer oklog01.oklog.default -peer oklog02.oklog.default -peer oklog03.oklog.default

# Each StatefulSet (individual oklog storage service: oklog01, oklog02...)
oklog store -peer oklog01.oklog.default -peer oklog02.oklog.default -peer oklog03.oklog.default

And then if I were to shrink the amount of replicas, I would have to reduce command line arguments and ensure that each -peer is pointed to the active peers, and if I were to increase the amount of replicas, I would have to add additional -peers. If my understanding is correct, increasing and decreasing the number of replicas involves restarting each pod.

Am I looking at it wrong here?

Apr 04 '17 17:04 kminehart

Note that OK Log ingest and store nodes should join the same OK Log cluster. So, if you want to deploy an OK Log ingest node to each node in the Kubernetes cluster, and an OK Log store node to a subset of nodes in the Kubernetes cluster, that's fine, but they should have the same set of -peer addresses.

…increasing and decreasing the number of replicas involves restarting each pod.

As a best practice, yes, this is correct. This shouldn't represent a problem, because an OK Log cluster shouldn't be resized very often—not like a stateless web app, for example. Edit: I believe the ConfigMap resource may help, here

With that said, it's worth noting that OK Log instances can survive if their configured -peers aren't 100% correct. For example, if you start up a cluster on 3 nodes,

a~$ oklog ingeststore -peer a -peer b -peer c ...
b~$ oklog ingeststore -peer a -peer b -peer c ...
c~$ oklog ingeststore -peer a -peer b -peer c ...

and then later kill one of the nodes and start another one, like this

c~$ shutdown -h now
d~$ oklog ingeststore -peer a -peer b -peer d

then things will still work: the new node d will connect to peers a and b, and they will all update their configuration to know about each other; nodes a and b will gossip node c's address to node d, and they will all continuously try to reconnect to it—but they will work just fine in its absence.

Of course, if you use this strategy to cycle an entire cluster's nodes, or even do it repeatedly, you're gonna have a bad time :)

— – -

How does Consul on Kubernetes handle cluster reconfiguration? Or e.g. Minio? What are some other clustered applications that have Helm charts, how do they work?

Apr 05 '17 08:04 peterbourgon

Minio is incredibly well documented. I actually spent a good part of yesterday setting up their cluster locally, and it's fantastic.

So if I understand correctly, Minio works very similarly to Oklog. Whenever you're installing the helm chart, you specify the size of the cluster, and that will create a StatefulSet with the size of the cluster specified, and then, just like oklog, point each node to the other nodes in the cluster using the unique identifier assigned by the StatefulSet.

I don't think that it's possible to dynamically resize Minio with kubectl scale, so maybe for tools like minio and Oklog it's not as big of a deal.

Thank you for all of the information. I'll see if I can get a PR in soon. :)

Here's the Minio documentation for Kubernetes. https://docs.minio.io/docs/deploy-minio-on-kubernetes

I'll try to make a chart of that quality.

Apr 05 '17 14:04 kminehart

Ok so we've got some process.

I'm trying to use a StatefulSet as you suggested, and at the time of running helm install you specify a number of replicas.

From there, a ConfigMap is used to create a script which basically calls /oklog ingeststore -peer <ip of oklog-0> -peer <ip of oklog-1> ...

The first one loads up just fine.

ts=2017-04-05T17:32:28.469816853Z level=info cluster=0.0.0.0:7659
ts=2017-04-05T17:32:28.469862534Z level=warn err="this node advertises itself on an unroutable address" addr=0.0.0.0
ts=2017-04-05T17:32:28.469873452Z level=warn err="this node will be unreachable in the cluster"
ts=2017-04-05T17:32:28.469880591Z level=warn err="provide -cluster as a routable IP address or hostname"
ts=2017-04-05T17:32:28.469958346Z level=info fast=tcp://0.0.0.0:7651
ts=2017-04-05T17:32:28.469977234Z level=info durable=tcp://0.0.0.0:7652
ts=2017-04-05T17:32:28.469991969Z level=info bulk=tcp://0.0.0.0:7653
ts=2017-04-05T17:32:28.470006996Z level=info API=tcp://0.0.0.0:7650
ts=2017-04-05T17:32:28.470327106Z level=info ingest_path=data/ingest
ts=2017-04-05T17:32:28.470409004Z level=info store_path=data/store
ts=2017-04-05T17:32:28.472215054Z level=debug component=cluster Join=1
ts=2017-04-05T17:32:38.472827806Z level=warn component=cluster NumMembers=1
ts=2017-04-05T17:32:43.472757648Z level=warn component=cluster NumMembers=1
ts=2017-04-05T17:32:48.473905345Z level=warn component=cluster NumMembers=1
ts=2017-04-05T17:32:53.473044826Z level=warn component=cluster NumMembers=1
ts=2017-04-05T17:32:58.472775018Z level=warn component=cluster NumMembers=1
ts=2017-04-05T17:33:03.473503215Z level=warn component=cluster NumMembers=1
ts=2017-04-05T17:33:18.472686551Z level=warn component=cluster NumMembers=1
ts=2017-04-05T17:33:23.472747874Z level=warn component=cluster NumMembers=1
ts=2017-04-05T17:33:28.473306342Z level=warn component=cluster NumMembers=1
ts=2017-04-05T17:33:33.472761258Z level=warn component=cluster NumMembers=1

And then the second one enters a "Terminating" status, and I can't seem to identify the reason.

ts=2017-04-05T17:35:16.736540587Z level=info cluster=0.0.0.0:7659
ts=2017-04-05T17:35:16.736574649Z level=warn err="this node advertises itself on an unroutable address" addr=0.0.0.0
ts=2017-04-05T17:35:16.736581382Z level=warn err="this node will be unreachable in the cluster"
ts=2017-04-05T17:35:16.736584682Z level=warn err="provide -cluster as a routable IP address or hostname"
ts=2017-04-05T17:35:16.736626044Z level=info fast=tcp://0.0.0.0:7651
ts=2017-04-05T17:35:16.736638483Z level=info durable=tcp://0.0.0.0:7652
ts=2017-04-05T17:35:16.736647877Z level=info bulk=tcp://0.0.0.0:7653
ts=2017-04-05T17:35:16.736657079Z level=info API=tcp://0.0.0.0:7650
ts=2017-04-05T17:35:16.736822975Z level=info ingest_path=data/ingest
ts=2017-04-05T17:35:16.736906303Z level=info store_path=data/store
ts=2017-04-05T17:35:16.739414654Z level=debug component=cluster Join=1
ts=2017-04-05T17:35:26.739955166Z level=warn component=cluster NumMembers=1
ts=2017-04-05T17:35:31.742763171Z level=warn component=cluster NumMembers=1

There doesn't seem to be any output specific to the second pod that I would think would cause it to terminate... I'll have to keep poking around.

Apr 05 '17 17:04 kminehart

Check out those level=warn messages earlier in the output.

Apr 05 '17 19:04 peterbourgon

I wouldn't think that a few warnings would prevent the cluster from starting completely. I've modified the startup script to point -cluster to the pod itself

» kubectl logs happy-nightingale-oklog-0   
ts=2017-04-05T19:55:31.63240948Z level=info cluster=happy-nightingale-oklog-0.happy-nightingale-oklog.default.svc.cluster.local:1009
ts=2017-04-05T19:55:31.632902695Z level=info fast=tcp://0.0.0.0:1001
ts=2017-04-05T19:55:31.632964907Z level=info durable=tcp://0.0.0.0:1002
ts=2017-04-05T19:55:31.633084756Z level=info bulk=tcp://0.0.0.0:1003
ts=2017-04-05T19:55:31.633249335Z level=info API=tcp://0.0.0.0:1000
ts=2017-04-05T19:55:31.633610319Z level=info ingest_path=data/ingest
ts=2017-04-05T19:55:31.634346458Z level=info store_path=data/store
ts=2017-04-05T19:55:31.685453946Z level=debug component=cluster Join=1
ts=2017-04-05T19:55:41.689037431Z level=warn component=cluster NumMembers=1
ts=2017-04-05T19:55:46.687560851Z level=warn component=cluster NumMembers=1

» kubectl logs happy-nightingale-oklog-1        
ts=2017-04-05T19:55:34.010837024Z level=info cluster=happy-nightingale-oklog-1.happy-nightingale-oklog.default.svc.cluster.local:1009
ts=2017-04-05T19:55:34.011087201Z level=info fast=tcp://0.0.0.0:1001
ts=2017-04-05T19:55:34.011146245Z level=info durable=tcp://0.0.0.0:1002
ts=2017-04-05T19:55:34.01117196Z level=info bulk=tcp://0.0.0.0:1003
ts=2017-04-05T19:55:34.01121508Z level=info API=tcp://0.0.0.0:1000
ts=2017-04-05T19:55:34.011670149Z level=info ingest_path=data/ingest
ts=2017-04-05T19:55:34.013837619Z level=info store_path=data/store
ts=2017-04-05T19:55:34.075413221Z level=debug component=cluster Join=2
ts=2017-04-05T19:55:44.07655387Z level=warn component=cluster NumMembers=1

So to me it looks like the process itself is working fine. No more warnings, no errors... yet the pods for oklog don't seem to spin up correctly.

» kubectl get pods
happy-nightingale-oklog-0     1/1       Running       0          3m
happy-nightingale-oklog-1     1/1       Terminating   0          1m

Progress is being made though. 👍

This is probably where someone more experienced in kubernetes than I could chime in.

See #55 for my helm chart so far.

For the startup script, https://github.com/oklog/oklog/pull/55/files#diff-e634cba63c2ce895cd4df8e8e587927b

Apr 05 '17 20:04 kminehart

Update:

The 4 Oklog nodes are now up and running, as is Logspout. logspout is outputting the data to the oklog instances to an extent.

counter                                1/1       Running   0          10m
winning-cricket-oklog-0                1/1       Running   0          15m
winning-cricket-oklog-1                1/1       Running   0          15m
winning-cricket-oklog-2                1/1       Running   0          15m
winning-cricket-oklog-3                1/1       Running   0          15m
winning-cricket-oklog-logspout-ms73q   1/1       Running   2          15m

~ » kubectl port-forward winning-cricket-oklog-3 3000 3001 3002 3003 3009 
...

~ » oklog query -store localhost:3000 -from 5m
378: Wed Apr  5 21:57:48 UTC 2017
379: Wed Apr  5 21:57:50 UTC 2017
ts=2017-04-05T21:57:50.126983157Z level=warn component=cluster NumMembers=1
ts=2017-04-05T21:57:50.140165456Z level=warn component=cluster NumMembers=1
...

However, for the other nodes

~ » oklog query -store localhost:3000 -from 5m         kminehart@kminehart-arch

They come up empty. To me this means that the other nodes aren't working properly?

I'm realizing now that they are identifying their peers.

Apr 05 '17 22:04 kminehart

I'm going to take a break for the day.

here's where I'm stuck:

The instances of oklog are not finding eachother.

They exist on flabby-kangaroo-oklog-0, flabby-kangaroo-oklog-1, flabby-kangaroo-oklog-2...

They all start up like this: (from 0...N-1):

     /oklog ingeststore \
      -api tcp://0.0.0.0:3000 \
      -ingest.fast tcp://0.0.0.0:3001 \
      -ingest.durable tcp://0.0.0.0:3002 \
      -ingest.bulk tcp://0.0.0.0:3003 \
      -cluster tcp://flabby-kangaroo-oklog-0.flabby-kangaroo-oklog.default.svc.cluster.local:3009 \
      -store.segment-target-size 1000000 \
      -store.segment-replication-factor 2 \
      -store.segment-retain 30m \
      -store.segment-purge 5m \
      -peer flabby-kangaroo-oklog-0.flabby-kangaroo-oklog.default.svc.cluster.local:3009 \
      -peer flabby-kangaroo-oklog-1.flabby-kangaroo-oklog.default.svc.cluster.local:3009 \
      -peer flabby-kangaroo-oklog-2.flabby-kangaroo-oklog.default.svc.cluster.local:3009 \ 
      ...

I have confirmed that the hostnames are all reachable from each pod, and each pod can also refer to themselves in the same manner.

Here's a portion of logs from the first node: (they're all basically the same):

ts=2017-04-05T23:00:43.213663004Z level=warn component=cluster NumMembers=1
ts=2017-04-05T23:00:43.414480668Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-04-05T23:00:44.514042221Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-04-05T23:00:45.514299415Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-04-05T23:00:46.514459555Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-04-05T23:00:47.613729661Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-04-05T23:00:48.21366717Z level=warn component=cluster NumMembers=1
ts=2017-04-05T23:00:48.614119205Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"

So I'm not too sure what to try here. But I can tell we're getting close.

Also worth mentioning: Each node seems to get this right:

node 1:
ts=2017-04-05T23:00:18.21318931Z level=debug component=cluster Join=1

node 2:
ts=2017-04-05T23:00:18.21318931Z level=debug component=cluster Join=2

node 3:
ts=2017-04-05T23:00:18.21318931Z level=debug component=cluster Join=3

etc.

We've also tried setting the -cluster to tcp://0.0.0.0:3009 with no luck.

Your thoughts?

Apr 05 '17 23:04 kminehart

Setting -cluster to 0.0.0.0 will always fail, so you can strike that.

It's encouraging that each node sees an incrementing Join=n number as they boot up. But something is going wrong later on:

ts=… level=warn component=cluster NumMembers=1
ts=… level=warn component=Consumer state=gather replication_factor=2 
  available_peers=1 err="replication currently impossible"

This is saying that the node only sees itself in the cluster (component=cluster NumMembers=1, and component=Consumer available_peers=1). and that since the replication-factor is 2, there's no way it can successfully replicate any segments, so it's not even going to try.

This is very similar behavior to #51, which ended up being a problem with the given cluster address resolving to the IPv6 loopback address ::. As a quick test, is it possible to supply IP addresses instead of hostnames? Otherwise, the next step in debugging would be to use current master e926a9e and pass the new -debug flag to each node. Is that possible?

Apr 06 '17 07:04 peterbourgon

As a quick test, is it possible to supply IP addresses instead of hostnames?

I don't believe that it is. The configuration files with helm are generated when you run helm install, which happens before the pods go up. We could try getting the IP addresses from the hostnames in the startup script and call oklog with the IP addresses. I'll investigate that.

Otherwise, the next step in debugging would be to use current master e926a9e and pass the new -debug flag to each node. Is that possible?

Definitely. I'll make a new docker image and post my update here.

Apr 06 '17 12:04 kminehart

I've recently cut v0.2.1 so you might prefer that instead.

Apr 06 '17 12:04 peterbourgon

Each node runs this:

     /oklog ingeststore \
      -api tcp://0.0.0.0:3000 \
      -ingest.fast tcp://0.0.0.0:3001 \
      -ingest.durable tcp://0.0.0.0:3002 \
      -ingest.bulk tcp://0.0.0.0:3003 \
      -debug \
      -cluster tcp://$HOSTNAME.quaffing-chipmunk-oklog.default.svc.cluster.local:3009 \
      -store.segment-target-size 1000000 \
      -store.segment-replication-factor 2 \
      -store.segment-retain 30m \
      -store.segment-purge 5m \
      -peer quaffing-chipmunk-oklog-0.quaffing-chipmunk-oklog.default.svc.cluster.local:3009 \
      -peer quaffing-chipmunk-oklog-1.quaffing-chipmunk-oklog.default.svc.cluster.local:3009 \
      -peer quaffing-chipmunk-oklog-2.quaffing-chipmunk-oklog.default.svc.cluster.local:3009 \
      -peer quaffing-chipmunk-oklog-3.quaffing-chipmunk-oklog.default.svc.cluster.local:3009 \

flabby-kangaroo-oklog-0

ts=2017-04-06T12:58:53.084757996Z level=info cluster=flabby-kangaroo-oklog-0.flabby-kangaroo-oklog.default.svc.cluster.local:3009
ts=2017-04-06T12:58:53.084823743Z level=info fast=tcp://0.0.0.0:3001
ts=2017-04-06T12:58:53.08483711Z level=info durable=tcp://0.0.0.0:3002
ts=2017-04-06T12:58:53.084847049Z level=info bulk=tcp://0.0.0.0:3003
ts=2017-04-06T12:58:53.084860247Z level=info API=tcp://0.0.0.0:3000
ts=2017-04-06T12:58:53.084998933Z level=info ingest_path=data/ingest
ts=2017-04-06T12:58:53.085054296Z level=info store_path=data/store
ts=2017-04-06T12:59:08.094629104Z level=debug component=cluster Join=3
ts=2017-04-06T12:59:24.195151787Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"

flabby-kangaroo-oklog-1

~/Work/oklog(master*) » kubectl logs flabby-kangaroo-oklog-1                                                 kminehart@kminehart-arch
ts=2017-04-06T12:58:52.178098093Z level=info cluster=flabby-kangaroo-oklog-1.flabby-kangaroo-oklog.default.svc.cluster.local:3009
ts=2017-04-06T12:58:52.178347589Z level=info fast=tcp://0.0.0.0:3001
ts=2017-04-06T12:58:52.178375275Z level=info durable=tcp://0.0.0.0:3002
ts=2017-04-06T12:58:52.178423129Z level=info bulk=tcp://0.0.0.0:3003
ts=2017-04-06T12:58:52.178444504Z level=info API=tcp://0.0.0.0:3000
ts=2017-04-06T12:58:52.178592333Z level=info ingest_path=data/ingest
ts=2017-04-06T12:58:52.178686867Z level=info store_path=data/store
ts=2017-04-06T12:59:07.220475222Z level=debug component=cluster Join=3
ts=2017-04-06T12:59:20.323192333Z level=warn component=Consumer state=replicate target=[::]:3000 during=/replicate got="500 Internal Server Error"
ts=2017-04-06T12:59:20.323209775Z level=warn component=Consumer state=replicate err="failed to fully replicate" want=2 have=1
ts=2017-04-06T12:59:21.225124729Z level=warn component=Consumer state=replicate target=[::]:3000 during=/replicate got="500 Internal Server Error"
ts=2017-04-06T12:59:21.228453949Z level=warn component=Consumer state=replicate target=[::]:3000 during=/replicate got="500 Internal Server Error"
ts=2017-04-06T12:59:21.228580223Z level=warn component=Consumer state=replicate err="failed to fully replicate" want=2 have=0
ts=2017-04-06T12:59:22.125780363Z level=warn component=Consumer state=replicate target=[::]:3000 during=/replicate got="500 Internal Server Error"
ts=2017-04-06T12:59:22.128774144Z level=warn component=Consumer state=replicate target=[::]:3000 during=/replicate got="500 Internal Server Error"
ts=2017-04-06T12:59:22.128834248Z level=warn component=Consumer state=replicate err="failed to fully replicate" want=2 have=0
ts=2017-04-06T12:59:23.128187924Z level=warn component=Consumer state=replicate target=[::]:3000 during=/replicate got="500 Internal Server Error"
ts=2017-04-06T12:59:23.128378275Z level=warn component=Consumer state=replicate err="failed to fully replicate" want=2 have=1
ts=2017-04-06T12:59:24.126886054Z level=warn component=Consumer state=replicate target=[::]:3000 during=/replicate got="500 Internal Server Error"
ts=2017-04-06T12:59:24.128264329Z level=warn component=Consumer state=replicate target=[::]:3000 during=/replicate got="500 Internal Server Error"
ts=2017-04-06T12:59:24.128345429Z level=warn component=Consumer state=replicate err="failed to fully replicate" want=2 have=0
ts=2017-04-06T12:59:25.126495425Z level=warn component=Consumer state=replicate target=[::]:3000 during=/replicate got="500 Internal Server Error"
ts=2017-04-06T12:59:25.131582453Z level=warn component=Consumer state=replicate target=[::]:3000 during=/replicate got="500 Internal Server Error"
ts=2017-04-06T12:59:25.13539062Z level=warn component=Consumer state=replicate err="failed to fully replicate" want=2 have=0
ts=2017-04-06T12:59:26.224087757Z level=warn component=Consumer state=replicate replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-04-06T12:59:26.422569546Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-04-06T12:59:27.2231378Z level=warn component=cluster NumMembers=1

flabby-kangaroo-oklog-2

~/Work/oklog(master*) » kubectl logs flabby-kangaroo-oklog-2                                                 kminehart@kminehart-arch
ts=2017-04-06T12:58:54.619023275Z level=info cluster=flabby-kangaroo-oklog-2.flabby-kangaroo-oklog.default.svc.cluster.local:3009
ts=2017-04-06T12:58:54.619101424Z level=info fast=tcp://0.0.0.0:3001
ts=2017-04-06T12:58:54.619153748Z level=info durable=tcp://0.0.0.0:3002
ts=2017-04-06T12:58:54.619170489Z level=info bulk=tcp://0.0.0.0:3003
ts=2017-04-06T12:58:54.61919614Z level=info API=tcp://0.0.0.0:3000
ts=2017-04-06T12:58:54.619885324Z level=info ingest_path=data/ingest
ts=2017-04-06T12:58:54.619995743Z level=info store_path=data/store
ts=2017-04-06T12:59:04.630119568Z level=debug component=cluster Join=3
ts=2017-04-06T12:59:27.631406482Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-04-06T12:59:28.631733112Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-04-06T12:59:29.631592417Z level=warn component=cluster NumMembers=1

flabby-kangaroo-oklog-3

ts=2017-04-06T12:58:53.771527327Z level=info cluster=flabby-kangaroo-oklog-3.flabby-kangaroo-oklog.default.svc.cluster.local:3009
ts=2017-04-06T12:58:53.774168565Z level=info fast=tcp://0.0.0.0:3001
ts=2017-04-06T12:58:53.774214864Z level=info durable=tcp://0.0.0.0:3002
ts=2017-04-06T12:58:53.774229707Z level=info bulk=tcp://0.0.0.0:3003
ts=2017-04-06T12:58:53.774245126Z level=info API=tcp://0.0.0.0:3000
ts=2017-04-06T12:58:53.774382701Z level=info ingest_path=data/ingest
ts=2017-04-06T12:58:53.77443604Z level=info store_path=data/store
ts=2017-04-06T12:59:08.812437518Z level=debug component=cluster Join=3
ts=2017-04-06T12:59:32.918334606Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-04-06T12:59:33.817699774Z level=warn component=cluster NumMembers=1

Interesting to note that flabby-kangaroo-oklog-1 has some pretty significantly different output than the rest.

target=[::]:3000

that's ipv6 isn't it? In which case, #51 would be very relevant. i'll read up some more on that solution then.

Apr 06 '17 13:04 kminehart

Whoops. Those were pods that were left on the cluster overnight.

Output of 0:

ts=2017-04-06T13:09:18.092922182Z level=info cluster=ignorant-bumblebee-oklog-0.ignorant-bumblebee-oklog.default.svc.cluster.local:3009
ts=2017-04-06T13:09:18.09315329Z level=info fast=tcp://0.0.0.0:3001
ts=2017-04-06T13:09:18.093183459Z level=info durable=tcp://0.0.0.0:3002
ts=2017-04-06T13:09:18.093206234Z level=info bulk=tcp://0.0.0.0:3003
ts=2017-04-06T13:09:18.093224321Z level=info API=tcp://0.0.0.0:3000
ts=2017-04-06T13:09:18.09336701Z level=info ingest_path=data/ingest
ts=2017-04-06T13:09:18.093435766Z level=info store_path=data/store
ts=2017-04-06T13:09:18.093449841Z level=debug component=cluster cluster_addr=ignorant-bumblebee-oklog-0.ignorant-bumblebee-oklog.default.svc.cluster.local cluster_port=3009 ParseIP=<nil>
ts=2017-04-06T13:09:18.093663038Z level=debug component=cluster received=NotifyJoin node=e4ae6a09-4bfd-4075-b801-36ceb83c2af9 addr=:::3009
ts=2017-04-06T13:09:18.122420683Z level=debug component=cluster Join=1
ts=2017-04-06T13:09:18.22409012Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-04-06T13:09:19.224373954Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-04-06T13:09:20.157107051Z level=debug component=cluster received=NotifyJoin node=1e9f6c72-79b3-4aea-9d58-671fd7e60df3 addr=:::3009
ts=2017-04-06T13:09:22.486003713Z level=debug component=cluster received=NotifyJoin node=690b8fb3-1e34-4673-ad9b-02eeefecb842 addr=:::3009
ts=2017-04-06T13:09:24.666278341Z level=debug component=cluster received=NotifyJoin node=48c3c1bf-a27f-4a6a-88ef-738c5ae6b2cd addr=:::3009
ts=2017-04-06T13:09:27.09511008Z level=debug component=cluster received=NotifyLeave node=1e9f6c72-79b3-4aea-9d58-671fd7e60df3 addr=:::3009
ts=2017-04-06T13:09:30.095697488Z level=debug component=cluster received=NotifyLeave node=690b8fb3-1e34-4673-ad9b-02eeefecb842 addr=:::3009
ts=2017-04-06T13:09:33.93334628Z level=warn component=Consumer state=replicate target=[::]:3000 during=/replicate got="500 Internal Server Error"
ts=2017-04-06T13:09:33.933453525Z level=warn component=Consumer state=replicate err="failed to fully replicate" want=2 have=1
ts=2017-04-06T13:09:34.826031911Z level=warn component=Consumer state=replicate target=[::]:3000 during=/replicate got="500 Internal Server Error"
ts=2017-04-06T13:09:34.828391129Z level=warn component=Consumer state=replicate target=[::]:3000 during=/replicate got="500 Internal Server Error"
ts=2017-04-06T13:09:34.828553489Z level=warn component=Consumer state=replicate err="failed to fully replicate" want=2 have=0

Output of 1:

ts=2017-04-06T13:09:20.154391526Z level=info cluster=ignorant-bumblebee-oklog-1.ignorant-bumblebee-oklog.default.svc.cluster.local:3009
ts=2017-04-06T13:09:20.154598406Z level=info fast=tcp://0.0.0.0:3001
ts=2017-04-06T13:09:20.154622638Z level=info durable=tcp://0.0.0.0:3002
ts=2017-04-06T13:09:20.154642731Z level=info bulk=tcp://0.0.0.0:3003
ts=2017-04-06T13:09:20.154659663Z level=info API=tcp://0.0.0.0:3000
ts=2017-04-06T13:09:20.154783408Z level=info ingest_path=data/ingest
ts=2017-04-06T13:09:20.154839958Z level=info store_path=data/store
ts=2017-04-06T13:09:20.154852773Z level=debug component=cluster cluster_addr=ignorant-bumblebee-oklog-1.ignorant-bumblebee-oklog.default.svc.cluster.local cluster_port=3009 ParseIP=<nil>
ts=2017-04-06T13:09:20.155040605Z level=debug component=cluster received=NotifyJoin node=1e9f6c72-79b3-4aea-9d58-671fd7e60df3 addr=:::3009
ts=2017-04-06T13:09:20.157560536Z level=debug component=cluster received=NotifyJoin node=e4ae6a09-4bfd-4075-b801-36ceb83c2af9 addr=:::3009
ts=2017-04-06T13:09:20.195293494Z level=debug component=cluster Join=2
ts=2017-04-06T13:09:22.503838636Z level=debug component=cluster received=NotifyJoin node=690b8fb3-1e34-4673-ad9b-02eeefecb842 addr=:::3009
ts=2017-04-06T13:09:24.673958088Z level=debug component=cluster received=NotifyJoin node=48c3c1bf-a27f-4a6a-88ef-738c5ae6b2cd addr=:::3009
ts=2017-04-06T13:09:27.162791402Z level=debug component=cluster received=NotifyLeave node=e4ae6a09-4bfd-4075-b801-36ceb83c2af9 addr=:::3009
ts=2017-04-06T13:09:30.155854225Z level=debug component=cluster received=NotifyLeave node=690b8fb3-1e34-4673-ad9b-02eeefecb842 addr=:::3009
ts=2017-04-06T13:09:37.156561018Z level=debug component=cluster received=NotifyLeave node=48c3c1bf-a27f-4a6a-88ef-738c5ae6b2cd addr=:::3009
ts=2017-04-06T13:09:37.295689114Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-04-06T13:09:38.295802597Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"

output of 2:

ts=2017-04-06T13:09:22.482060864Z level=info cluster=ignorant-bumblebee-oklog-2.ignorant-bumblebee-oklog.default.svc.cluster.local:3009
ts=2017-04-06T13:09:22.482149063Z level=info fast=tcp://0.0.0.0:3001
ts=2017-04-06T13:09:22.482230978Z level=info durable=tcp://0.0.0.0:3002
ts=2017-04-06T13:09:22.482267619Z level=info bulk=tcp://0.0.0.0:3003
ts=2017-04-06T13:09:22.482296722Z level=info API=tcp://0.0.0.0:3000
ts=2017-04-06T13:09:22.482520717Z level=info ingest_path=data/ingest
ts=2017-04-06T13:09:22.482648391Z level=info store_path=data/store
ts=2017-04-06T13:09:22.482673659Z level=debug component=cluster cluster_addr=ignorant-bumblebee-oklog-2.ignorant-bumblebee-oklog.default.svc.cluster.local cluster_port=3009 ParseIP=<nil>
ts=2017-04-06T13:09:22.483041613Z level=debug component=cluster received=NotifyJoin node=690b8fb3-1e34-4673-ad9b-02eeefecb842 addr=:::3009
ts=2017-04-06T13:09:22.48640225Z level=debug component=cluster received=NotifyJoin node=e4ae6a09-4bfd-4075-b801-36ceb83c2af9 addr=:::3009
ts=2017-04-06T13:09:22.504078124Z level=debug component=cluster received=NotifyJoin node=1e9f6c72-79b3-4aea-9d58-671fd7e60df3 addr=:::3009
ts=2017-04-06T13:09:22.511276828Z level=debug component=cluster Join=3
ts=2017-04-06T13:09:24.687986785Z level=debug component=cluster received=NotifyJoin node=48c3c1bf-a27f-4a6a-88ef-738c5ae6b2cd addr=:::3009
ts=2017-04-06T13:09:27.504713025Z level=debug component=cluster received=NotifyLeave node=e4ae6a09-4bfd-4075-b801-36ceb83c2af9 addr=:::3009
ts=2017-04-06T13:09:29.488021585Z level=debug component=cluster received=NotifyLeave node=1e9f6c72-79b3-4aea-9d58-671fd7e60df3 addr=:::3009
ts=2017-04-06T13:09:35.489549623Z level=debug component=cluster received=NotifyLeave node=48c3c1bf-a27f-4a6a-88ef-738c5ae6b2cd addr=:::3009

output of 3:

ts=2017-04-06T13:09:24.653716885Z level=info cluster=ignorant-bumblebee-oklog-3.ignorant-bumblebee-oklog.default.svc.cluster.local:3009
ts=2017-04-06T13:09:24.653800842Z level=info fast=tcp://0.0.0.0:3001
ts=2017-04-06T13:09:24.653835078Z level=info durable=tcp://0.0.0.0:3002
ts=2017-04-06T13:09:24.653850386Z level=info bulk=tcp://0.0.0.0:3003
ts=2017-04-06T13:09:24.653863782Z level=info API=tcp://0.0.0.0:3000
ts=2017-04-06T13:09:24.654002201Z level=info ingest_path=data/ingest
ts=2017-04-06T13:09:24.65410395Z level=info store_path=data/store
ts=2017-04-06T13:09:24.6541364Z level=debug component=cluster cluster_addr=ignorant-bumblebee-oklog-3.ignorant-bumblebee-oklog.default.svc.cluster.local cluster_port=3009 ParseIP=<nil>
ts=2017-04-06T13:09:24.654344993Z level=debug component=cluster received=NotifyJoin node=48c3c1bf-a27f-4a6a-88ef-738c5ae6b2cd addr=:::3009
ts=2017-04-06T13:09:24.666619682Z level=debug component=cluster received=NotifyJoin node=690b8fb3-1e34-4673-ad9b-02eeefecb842 addr=:::3009
ts=2017-04-06T13:09:24.666685616Z level=debug component=cluster received=NotifyJoin node=e4ae6a09-4bfd-4075-b801-36ceb83c2af9 addr=:::3009
ts=2017-04-06T13:09:24.674298139Z level=debug component=cluster received=NotifyJoin node=1e9f6c72-79b3-4aea-9d58-671fd7e60df3 addr=:::3009
ts=2017-04-06T13:09:24.695206152Z level=debug component=cluster Join=4
ts=2017-04-06T13:09:29.67495384Z level=debug component=cluster received=NotifyLeave node=e4ae6a09-4bfd-4075-b801-36ceb83c2af9 addr=:::3009
ts=2017-04-06T13:09:29.692920224Z level=debug component=cluster received=NotifyLeave node=1e9f6c72-79b3-4aea-9d58-671fd7e60df3 addr=:::3009
ts=2017-04-06T13:09:31.65703115Z level=debug component=cluster received=NotifyLeave node=690b8fb3-1e34-4673-ad9b-02eeefecb842 addr=:::3009
ts=2017-04-06T13:09:31.696547909Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-04-06T13:09:32.696885913Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-04-06T13:09:33.796177279Z level=warn component=Consumer state=gather replication_factor=2 available_peers=1 err="replication currently impossible"
ts=2017-04-06T13:09:34.695924625Z level=warn component=cluster NumMembers=1

Apr 06 '17 13:04 kminehart

Yep, so (long story short) when you set -cluster to that hostname, the memberlist library binds a TCP listener there, and later introspects on the bound net.TCPConn.Addr to deduce an advertise address. It's that call which seems to be returning the IPv6 localhost :: address.

I guess the best way to solve this problem (as well as #51) is to provide an opt-in way to explicitly set the AdvertiseHost, rather than relying on that codepath. I'll see what I can do this evening :)

Apr 06 '17 13:04 peterbourgon

Perfect!

Slightly off-topic, I noticed that the new release, v0.2.1 did not yet have a tag on the docker hub. Does Travis CI give you a way to automatically docker push tags / releases?

I could make a new issue for this if you were interested.

Apr 06 '17 13:04 kminehart

I have no idea who is creating the OK Log Docker images. It's not me.

Apr 06 '17 13:04 peterbourgon

@m247suppport apparently :joy:

Apr 06 '17 13:04 kminehart

Hi! I'll try to kick off an v0.2.1 image build as quickly as I can. I was wondering why the downloads total went over 2k all of a sudden! Also, let me know if you want a better of approach to building images in the future. So far, this is only my attempt at creating images for my own use (please see the notes section in README.md).

Apr 06 '17 13:04 m247support

Sorry for the delay. You should be able to pull a v0.2.1 image now. Let me know if you have any issues / thoughts.

Apr 06 '17 14:04 m247support

Hi @kminehart,

root@docker:~# helm version
Client: &version.Version{SemVer:"v2.1.3", GitCommit:"5cbc48fb305ca4bf68c26eb8d2a7eb363227e973", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.1.3", GitCommit:"5cbc48fb305ca4bf68c26eb8d2a7eb363227e973", GitTreeState:"clean"}

root@docker:~# kubectl api-versions
apps/v1alpha1
authentication.k8s.io/v1beta1
authorization.k8s.io/v1beta1
autoscaling/v1
batch/v1
batch/v2alpha1
certificates.k8s.io/v1alpha1
extensions/v1beta1
policy/v1alpha1
rbac.authorization.k8s.io/v1alpha1
storage.k8s.io/v1beta1
v1

This is slightly off topic, but I am new to k8s / helm and I thought I would try and see what you are are trying to accomplish.

I have cloned https://github.com/wehco/oklog and tried to helm install oklog/helm and I get the following error:

root@docker:~# helm install oklog/helm/
Error: release elevated-ostrich failed: error validating "": error validating data: the server could not find the requested resource

root@docker:~# helm list
NAME                    REVISION        UPDATED                         STATUS  CHART
elevated-ostrich        1               Sun Apr  9 10:26:42 2017        FAILED  oklog-0.0.1

Is this due to a yaml formatting issue or is something else missing? I can see in the values.yaml file (commit 904de82) that the chart is still using the v0.2.0 Docker image, so is there going to be more recent changes pushed at some point? Thanks in advance!

Apr 09 '17 10:04 m247support

It's possible that the change only exists locally 😓 I'll update it this afternoon!

Apr 09 '17 14:04 kminehart

It's updated.

Also, try helm install ./oklog/helm instead. It's possible that helm thinks you're trying to install a chart in its repository.

Apr 09 '17 14:04 kminehart

Thanks! Tracked this down to Issue #10318 and PR #10447. I believe the apis are out of version in order for helm to be able install oklog helm chart this on the cluster? What is recommend / what are you using? v1.x?

Apr 10 '17 11:04 m247support

It looks like the one you're missing is apps/v1beta1.

I'm using that apiVersion for a StatefulSet which gives us the ability to reference each oklog instance individually. Previously, these were known as PetSets but they have since been renamed to StatefulSets.