thanos icon indicating copy to clipboard operation
thanos copied to clipboard

[Feature] Per-Store TLS configuration in Thanos Query

Open daviddyball opened this issue 5 years ago • 74 comments

Request

Feature request to add support for per-store tls configuration.

Use Case

I have store running alongside query on my main "monitoring" cluster and we don't use TLS for comms between pods on the same cluster. I also want to pull metrics from my remote Prometheus instances which exist on different clusters, but these endpoints are TLS encrypted because they are cross-cluster

Suggestion

Change the --store argument to use a more definitive style for remote endpoints e.g.:

  • grpc://<ip-address>:<port>
  • grpc+tls://<ip-address>:<port>

This would allow specifying per-store TLS settings and allow a combination of secure and insecure comms for store endpoints.

Related PR from @adrien-f : https://github.com/improbable-eng/thanos/pull/899

daviddyball avatar Mar 26 '19 14:03 daviddyball

Definitely going to try and throw something together for this.

Ideas so far are:

a) use JDBC-like URLs to configure individual stores (e.g. grpc+tls://<addr>:<port>/?Option1=y&Option2=n)

or

b) use file-based configuration like the object store.

@bwplotka If I go for option "b" should I re-use the object-store file config stuff, or are you happy for me to implement a query-store specific config file?

daviddyball avatar Mar 28 '19 14:03 daviddyball

So I think we talked about it here: https://github.com/improbable-eng/thanos/pull/899

I would say for this specifc use case, let's start with option B and essentially limit, by saying --store are explicitly out of this feature.

I re-use the object-store file config stuff

What? We should never mix object store stuff with discovery stuff. In fact you have ready FileSD logic which is compatible with Prometheus FileSD. It allows arbitrary labels, so we can implement easily this:

One solution would be to do that work in the FileSD config and have labels such as tls_server_name to override TLS config.

I would say this is the way to go. What do you think? (:

Thanks for proposing this and sorry for massive delay!

bwplotka avatar Apr 13 '19 22:04 bwplotka

@bwplotka Apologies, I meant I can look at writing some similar code to what is in place for FileSD to implement FileStoreConfig or something :-P Definitely don't want to mix the two 😆

daviddyball avatar Apr 15 '19 08:04 daviddyball

Yea, I would say no specifc config just literally special labels we document properly in our docs

bwplotka avatar Apr 15 '19 12:04 bwplotka

IMO people just coud just run envoy/nginx/etc proxy and handle TLS outside of Thanos. As then we don't have to work on cert reloading all the different options, etc.

@bwplotka wdyt?

povilasv avatar Apr 16 '19 13:04 povilasv

I attempted to use nginx, but it appears that the grpc call is not sending a host header. Can anyone confirm this is the case? (Or at least anyone with better knowledge of the code and go than myself)

mleklund avatar Apr 16 '19 14:04 mleklund

Why does it need that?

I think you should be able to proxy all the traffic to GPRC backend.

Something like:

http {
    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent"';
 
    server {
        listen 80 http2;
 
        access_log logs/access.log main;
 
        location / {
            # Replace localhost:50051 with the address and port of your gRPC server
            # The 'grpc://' prefix is optional; unencrypted gRPC is the default
            grpc_pass grpc://localhost:50051;
        }
    }
}

povilasv avatar Apr 16 '19 16:04 povilasv

In my case I would like to park the grpc on a virtual host, instead of running a special nginx instance just for grpc.

mleklund avatar Apr 16 '19 19:04 mleklund

I don't think we pass host headers nor we have an option to configure them. So right now I don't think it will work

povilasv avatar Apr 17 '19 08:04 povilasv

@mleklund Try getambassador.io for a nice GRPC proxy. It's what we ended up using to get around the problem of not being able to mix TLS and non-TLS stores.

@povilasv I'm not sure I understand your objection to adding this feature. Do you think it serves no purpose?

daviddyball avatar Apr 17 '19 08:04 daviddyball

I'ts not an objection, I just wanted to point out that it won't work right now.

Whether we want to add host header or no is a seperate discussion and it will take a bit of time.

BTW a lot of people right now run isitio / linkerd, the sidecar per service approach, so I think it's not a wide problem.

povilasv avatar Apr 17 '19 09:04 povilasv

Up to this point nginx ingress has worked for us and I would rather not complicate things any more then I need to. I have been looking at envoy and ambassador, and it may be a good fit down the road. For the moment I am using port based tcp for nginx ingress since most of our use cases are behind vpn or firewall.

I was curious if I was missing something or if there were plans to pass header info since grpc is built on top of http(2).

mleklund avatar Apr 17 '19 15:04 mleklund

@daviddyball , I'm not sure if I understand your use case correctly, but does this pr https://github.com/improbable-eng/thanos/pull/508 solve your problem?

There are flags like --grpc-server-tls-cert/--grpc-client-tls-cert added by this pr in query doc and other components' docs

benjaminhuo avatar May 08 '19 11:05 benjaminhuo

@benjaminhuo I don't think it does solve the problem here. The issue is that I'd like query to talk to different stores in different ways (e.g. TLS vs non-TLS) and there is no way to configure per-store endpoint TLS configuration

daviddyball avatar May 08 '19 12:05 daviddyball

For now we won't be implementining Per Store TLS configuration. Our general suggestion is run sidecar proxy envoy / nginx / etc.

Hopefully we will update docs soon. We will close the ticket once the docs are there.

povilasv avatar May 08 '19 12:05 povilasv

I would also like something like this.

My idea, which may be wrong, is:

  • a "central" cluster which gathers metrics from all other clusters
  • query instances on the central cluster talk to query instances in other clusters over TLS
  • query instances on the central cluster also talks to local cluster's store and sidecars
    • right now this needs to be over TLS as well, and needs to have one endpoint for each sidecar, an extra hop et al

ideally, I would like to define something like:

    - query.cluster1.foobar.com:443
    - query.cluster2.foobar.com:443
    - query.cluster3.foobar.com:443
    - query.cluster4.foobar.com:443
    - dnssrv+_grpc._tcp.thanos-sidecar-grpc.thanos.svc.cluster.local
    - dnssrv+_grpc._tcp.thanos-query-grpc.thanos.svc.cluster.local

so, it seems that without setting --grpc-client-tls-secure the secure endpoints won't work, and if I enable it the dnssrv records won't work :(

as far as I understand there is no way of doing this right now.

caarlos0 avatar Jun 27 '19 21:06 caarlos0

Try getambassador.io for a nice GRPC proxy. It's what we ended up using to get around the problem of not being able to mix TLS and non-TLS stores.

@daviddyball I'm considering using ambassador as an edge proxy in each cluster too. Ambassador has tls termination ability, and we can use it to accept tls connection from central monitoring cluster.

But for the central monitoring cluster, is it possible for ambassador to originate tls connection to the remote cluster(egress grpc tls traffic)?

Would you share your config? thanks.

@bwplotka @caarlos0 , I found a blog post https://improbable.io/blog/thanos-architecture-at-improbable saying

In terms of requests to a remote cluster, Envoy has been used securely to proxy our request between many clusters; meaning that a request will go via an Envoy sidecar, an edge Envoy egress proxy, and over the public internet to an edge Envoy ingress proxy (all over a secure connection). This latter will then forward the request onto the Thanos Sidecar to retrieve the data for the time period and labels specified. All of this is done using server streaming gRPC.

Just wondering where I can find envoy config examples like this for ingress and especially egress tls traffic?

Thanks very much Ben

benjaminhuo avatar Jul 16 '19 06:07 benjaminhuo

@benjaminhuo I had thanos-query talking directly to ambassador directly, so didn't use Ambassador for outbound. You'd probably want to use Envoy directly for that part, no?

daviddyball avatar Jul 16 '19 08:07 daviddyball

@benjaminhuo I had thanos-query talking directly to ambassador directly, so didn't use Ambassador for outbound. You'd probably want to use Envoy directly for that part, no?

Yeah, I'm considering using envoy to manage outbound tls.

Then you've the same tls client config for all the store specified in thanos query including the one in the same cluster as thanos query?

benjaminhuo avatar Jul 16 '19 10:07 benjaminhuo

We're not validating TLS at the moment @benjaminhuo , so there's no TLS config on query, just a list of store endpoints :+1:

daviddyball avatar Jul 16 '19 10:07 daviddyball

We're not validating TLS at the moment @benjaminhuo , so there's no TLS config on query, just a list of store endpoints 👍

I see :)

benjaminhuo avatar Jul 16 '19 11:07 benjaminhuo

It looks like there are mix of things in this feature requests.

So is it about host header or per store TLS config?

bwplotka avatar Sep 11 '19 14:09 bwplotka

Per-store TLS config. The ability to specify TLS for some stores and disable it for others. We've since worked around the issue, so if it doesn't fit with the intended design of Thanos, then it can be closed as Won't Fix :+1:

daviddyball avatar Sep 11 '19 14:09 daviddyball

How you worked around it? If you will quickly describe this it would be awesome for users (:

I guess some forward proxy?

bwplotka avatar Sep 11 '19 14:09 bwplotka

We ended up switching off TLS for the time being and using Ambassador as a proxy.

The eventual goal is to use cross-cluster services via Istio, so we can just go direct without proxies.

daviddyball avatar Sep 11 '19 14:09 daviddyball

We ran into this issue with a Thanos in Kubernetes. We have stores in the cluster we wanted to communicate with in plaintext, and others in remote (bare metal) datacenters with which we wanted to use TLS.

We worked around the problem with one Envoy pod per remote datacenter, terminating the TLS connection in the cluster hosting the Thanos query. Each is tagged like a local store to be automatically picked up by Thanos' dnssrv discovery mechanism.

Still, it would probably be simpler to do away with the proxies and let Thanos speak directly (and securely) with the remote endpoints.

jqueuniet avatar Sep 11 '19 15:09 jqueuniet

We ran into this issue with a Thanos in Kubernetes. We have stores in the cluster we wanted to communicate with in plaintext, and others in remote (bare metal) datacenters with which we wanted to use TLS.

We worked around the problem with one Envoy pod per remote datacenter, terminating the TLS connection in the cluster hosting the Thanos query. Each is tagged like a local store to be automatically picked up by Thanos' dnssrv discovery mechanism.

Still, it would probably be simpler to do away with the proxies and let Thanos speak directly (and securely) with the remote endpoints.

@jqueuniet I'm running into the exact same issue. I've been struggling to get a working Envoy configuration in place. Would you be open to sharing your solution here or discussing it privately? Thanks!

vbeausoleil avatar Nov 22 '19 15:11 vbeausoleil

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 11 '20 03:01 stale[bot]

Any update? Why this issue become stale? I have the exact same issue...

omerlh avatar Jan 21 '20 12:01 omerlh

It became stale as no one said anything for long time (: I marked this thread as pinned as it's not solved and it's fair feature request.

We recently added way for different TLS for Ruler Query API and alertmanagers, so we might want to follow same logic and allow that for querier. Help wanted (:

bwplotka avatar Jan 25 '20 19:01 bwplotka