jaeger-operator icon indicating copy to clipboard operation
jaeger-operator copied to clipboard

connect external es created by eck got error "health check timeout: no Elasticsearch node available"

Open luxurine opened this issue 4 years ago • 14 comments

error msg:

2020/12/16 04:02:17 maxprocs: Updating GOMAXPROCS=1: using minimum allowed GOMAXPROCS
{"level":"info","ts":1608091337.2560472,"caller":"flags/service.go:117","msg":"Mounting metrics handler on admin server","route":"/metrics"}
{"level":"info","ts":1608091337.2560928,"caller":"flags/service.go:123","msg":"Mounting expvar handler on admin server","route":"/debug/vars"}
{"level":"info","ts":1608091337.2562084,"caller":"flags/admin.go:121","msg":"Mounting health check on admin server","route":"/"}
{"level":"info","ts":1608091337.2562501,"caller":"flags/admin.go:127","msg":"Starting admin HTTP server","http-addr":":14269"}
{"level":"info","ts":1608091337.256265,"caller":"flags/admin.go:113","msg":"Admin server started","http.host-port":"[::]:14269","health-status":"unavailable"}
{"level":"fatal","ts":1608091342.4766536,"caller":"command-line-arguments/main.go:74","msg":"Failed to init storage factory","error":"failed to create primary Elasticsearch client: health check timeout: no Elasticsearch node available","stacktrace":"main.main.func1\n\tcommand-line-arguments/main.go:74\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/[email protected]/command.go:826\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/spf13/[email protected]/command.go:914\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/[email protected]/command.go:864\nmain.main\n\tcommand-line-arguments/main.go:133\nruntime.main\n\truntime/proc.go:204"}

external es 7.10.1 yaml file(es operator 1.3.1): https://www.elastic.co/guide/en/cloud-on-k8s/1.3/k8s-deploy-eck.html

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: jaeger-storage
spec:
  version: 7.10.1
  http:
    tls:
      selfSignedCertificate:
        subjectAltNames:
        - dns: jaeger-storage-es-http.observability.svc.cluster.local
        - dns: jaeger-storage-es-http.observability.svc
        - dns: jaeger-storage-es-http
  nodeSets:
  - name: default
    count: 1
    config:
      node.roles:
      - master
      - data
      node.store.allow_mmap: false
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 20Gi
        storageClassName: xxx

jaeger yaml file:

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: tds-jaeger-dev
spec:
  strategy: production
  collector:
    maxReplicas: 3
    resources:
      limits:
        cpu: 100m
        memory: 128Mi
  storage:
    type: elasticsearch
    options:
      es:
        # server-urls: https://jaeger-storage-es-http:9200
        # server-urls: https://jaeger-storage-es-http.observability.svc:9200
        server-urls: https://jaeger-storage-es-http.observability.svc.cluster.local:9200
        tls: "true"
        tls.ca: /es/certificates/ca.crt
        tls.cert: /es/certificates/tls.crt
        tls.key: /es/certificates/tls.key
      secretName: jaeger-secret
  volumeMounts:
    - name: certificates
      mountPath: /es/certificates/
      readOnly: true
  volumes:
    - name: certificates
      secret:
        secretName: jaeger-storage-es-http-certs-internal

I have tried all these urls, but still got the same error

        # server-urls: https://jaeger-storage-es-http:9200
        # server-urls: https://jaeger-storage-es-http.observability.svc:9200
        server-urls: https://jaeger-storage-es-http.observability.svc.cluster.local:9200

I have tried the same config follow this issuehttps://github.com/jaegertracing/jaeger-operator/issues/496, but still not work.

The error tell healthcheck timeout, so I use curl to test as follow

curl -X HEAD -u "elastic:fZdf2d80D238g0mEr03CDM3e" -k "https://172.21.10.94:9200" -v
* About to connect() to 172.21.10.94 port 9200 (#0)
*   Trying 172.21.10.94...
* Connected to 172.21.10.94 (172.21.10.94) port 9200 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
* Server certificate:
* 	subject: CN=jaeger-storage-es-http.observability.es.local,OU=jaeger-storage
* 	start date: 12月 16 03:39:31 2020 GMT
* 	expire date: 12月 16 03:49:31 2021 GMT
* 	common name: jaeger-storage-es-http.observability.es.local
* 	issuer: CN=jaeger-storage-http,OU=jaeger-storage
* Server auth using Basic with user 'elastic'
> HEAD / HTTP/1.1
> Authorization: Basic ZWxhc3RpYzpmWmRmMmQ4MEQyMzhnMG1FcjAzQ0RNM2U=
> User-Agent: curl/7.29.0
> Host: 172.21.10.94:9200
> Accept: */*
>
< HTTP/1.1 200 OK
< content-type: application/json; charset=UTF-8
< content-length: 557
<

this way(-X HEAD), the response code is 200, but just got stuck, the request can't finish

curl --head -u "elastic:fZdf2d80D238g0mEr03CDM3e" -k "https://172.21.10.94:9200" -v
* About to connect() to 172.21.10.94 port 9200 (#0)
*   Trying 172.21.10.94...
* Connected to 172.21.10.94 (172.21.10.94) port 9200 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
* Server certificate:
* 	subject: CN=jaeger-storage-es-http.observability.es.local,OU=jaeger-storage
* 	start date: 12月 16 03:39:31 2020 GMT
* 	expire date: 12月 16 03:49:31 2021 GMT
* 	common name: jaeger-storage-es-http.observability.es.local
* 	issuer: CN=jaeger-storage-http,OU=jaeger-storage
* Server auth using Basic with user 'elastic'
> HEAD / HTTP/1.1
> Authorization: Basic ZWxhc3RpYzpmWmRmMmQ4MEQyMzhnMG1FcjAzQ0RNM2U=
> User-Agent: curl/7.29.0
> Host: 172.21.10.94:9200
> Accept: */*
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< content-type: application/json; charset=UTF-8
content-type: application/json; charset=UTF-8
< content-length: 557
content-length: 557

<
* Connection #0 to host 172.21.10.94 left intact

this way(--head) finish successfully.

So I'm wondering if the problem is with healcheck method processing in es.

luxurine avatar Dec 16 '20 04:12 luxurine

I am getting similar error while trying to connect from jaeger instance to elasticsearch created by elasticsearch redhat operator on openshift cloud platform 4.6

 {"level":"fatal","ts":1610517606.3812678,"caller":"collector/main.go:70","msg":"Failed to init storage factory","error":"failed to create primary Elasticsearch client: health check timeout: no Elasticsearch node available","stacktrace":"main.main.func1\n\t/go/src/github.com/jaegertracing/jaeger/cmd/collector/main.go:70\ngithub.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra.(*Command).execute\n\t/go/src/github.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra/command.go:762\ngithub.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra.(*Command).ExecuteC\n\t/go/src/github.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra/command.go:852\ngithub.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra.(*Command).Execute\n\t/go/src/github.com/jaegertracing/jaeger/vendor/github.com/spf13/cobra/command.go:800\nmain.main\n\t/go/src/github.com/jaegertracing/jaeger/cmd/collector/main.go:129\nruntime.main\n\t/opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/runtime/proc.go:203"}

here is my jaeger config

kind: Jaeger
metadata:
  name: afjaeger
spec:
  strategy: production
  collector:
    maxReplicas: 1
    resources:
      limits:
        cpu: 400m
        memory: 512Mi
  query:
    resources:
      limits:
        cpu: 256m
        memory: 128Mi
  agent:
    strategy: DaemonSet
    resources:
      limits:
        cpu: 256m
        memory: 128Mi		
  storage:
    type: elasticsearch
    elasticsearch:
      nodeCount: 3
      resources:
        requests:
          cpu: 500m
          memory: 1Gi
        limits:
          cpu: 1
          memory: 2Gi	
    options:
      es:
         server-urls: https://elasticsearch.openshift-logging.svc.cluster.local:9200
         tls.ca: /es/certificates/tls.crt	 
    secretName: jaeger-secret-ka
  volumeMounts:
    - name: certificates
      mountPath: /es/certificates/
      readOnly: true
  volumes:
    - name: certificates
      secret:
        secretName: test

I used admin-ca from elasticsearch secret here

tls.ca: /es/certificates/tls.crt	 

priyavj08 avatar Jan 13 '21 11:01 priyavj08

I hava the save issuses~

fsm-xyz avatar Jan 16 '21 04:01 fsm-xyz

I have solved my problem @luxurine @priyavj08 I wrote the wrong config, the jaeger-secret should be ES_PASSWORD and ES_USERNAME, i wrote elastic=xxxx directly!

check the env info in the container, `$ env

$ elastic=235n718z40NkHPX3Sf6EjcyY123 ` the right way

$ kubectl create secret generic jaeger-secret --from-literal=ES_PASSWORD=changeme --from-literal=ES_USERNAME=elastic `$ env

$ ES_USERNAME=elastic

$ ES_PASSWORD=Qkt5ij7wQhM75sa1q45YB178 `

maybe this is helpful to u @luxurine

fsm-xyz avatar Jan 17 '21 01:01 fsm-xyz

I have solved my problem @luxurine @priyavj08 I wrote the wrong config, the jaeger-secret should be ES_PASSWORD and ES_USERNAME, i wrote elastic=xxxx directly!

check the env info in the container, `$ env

$ elastic=235n718z40NkHPX3Sf6EjcyY123 ` the right way

$ kubectl create secret generic jaeger-secret --from-literal=ES_PASSWORD=changeme --from-literal=ES_USERNAME=elastic `$ env

$ ES_USERNAME=elastic

$ ES_PASSWORD=Qkt5ij7wQhM75sa1q45YB178 `

maybe this is helpful to u @luxurine

I have checked my jaeger-secret secret, it seems correct.

image

luxurine avatar Jan 21 '21 09:01 luxurine

@luxurine , given that you are using -k with your curl call, I suspect the issue might be related to the TLS certs, not with the password. Depending on the Jaeger version that you are using, you can use a -debug image, like jaegertracing/jaeger-collector-debug:1.21.0, and check whether the certs are correct. If you can make a curl call without the insecure mode set (-k), send us the results and we can check from there.

jpkrohling avatar Jan 21 '21 10:01 jpkrohling

I have the same issue, but I don't use self signed cert, it is valid so I don't need to provide ca. Also we don't use password authentication. Curl works without -k option, and curator could reach to ES cluster.

teoyaomiqui avatar Aug 03 '21 00:08 teoyaomiqui

I have the same issue and I figure it out. My es account permission doesn't have permission "monitor", so the Elasticsearch client of jaeger sends a HEAD / request to do the health check but fails.

linw1995 avatar Aug 18 '21 04:08 linw1995

I have the same issue and I figure it out. My es account permission doesn't have permission "monitor", so the Elasticsearch client of jaeger sends a HEAD / request to do the health check but fails.

thank you this ended up being my issue as well, my jaeger secret was correct and i noticed in jaeger logs:

Failed to init storage factory","error":"failed to create primary Elasticsearch client: health check timeout: no Elasticsearch node available"

will let you know if giving my role in rbac monitor resolves the issue.

edit: yes that was it, thank you for the crucial input linw1995

perezjasonr avatar Oct 25 '21 16:10 perezjasonr

Hi @perezjasonr @linw1995, Cloud you please let me know the detail what you solve this issue? I am not very clear how do i add permission like you mentioned.

ye0321 avatar Nov 05 '21 02:11 ye0321

Anybody know how to solve this issue, i am stuck it right now.

Btw @jpkrohling could you please help me?

ye0321 avatar Nov 06 '21 07:11 ye0321

I meet is also

seaurching avatar Mar 22 '22 09:03 seaurching

jaeger-secret

@luxurine hope you have resolved the issue. For anyone else facing the issue, I seem to have suspected the reason for the issue. There is an incorrect indentation on the jaeger yaml file; secretName: jaeger-secret is a sub option of storage, and not storage.options

rishiravikumar-tul-scm avatar Dec 16 '22 07:12 rishiravikumar-tul-scm

Hi @luxurine, We have resolved the issue. Thanks for your YAML files, they have helped us alot.

You just have one mistake in your jaeger YAML file. You have mentioned the secretName under Options i.e. sibling to es which is under Options in your YAMl file. The correct format of Jaeger YAML is secretName and Options should be siblings and they must be under Storage. Official documentation link --> https://docs.openshift.com/container-platform/4.4/jaeger/jaeger_install/rhbjaeger-deploying.html#jaeger-deploy-production_jaeger-deploying Here you can check the YAML format.

Points to be noted :-

  1. We have created separate secret file, named --> es-secret, in which we have provided the ES_USERNAME and ES_PASSWORD details, and passed that secret file name in jaeger YAML under storage.
  2. You have to provide http-certs-internal secret name under Volumes.
  3. Apart from that, if you are installing ElasticSearch and Jaeger in separate namespace then you will need to create es-secret and http-certs-internal secret files in jaeger namespace.
  4. Currently 8.x ElasticSearch version is not compatible with Jaeger, hence try to avoid using that version.
  5. Also we have tried installing 7.x version of ElasticSearch using helm but it was giving error. Hence we have installed the ElasticSearch using Operators (CRD). For that first you need to install ECK in your cluster, you can refer this link to install --> https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-deploy-eck.html

If you follow this steps, I am sure you will resolve the issue.

@harshalzambre-TEKsystems-scm @rishiravikumar-tul-scm Thanks a lot! It could be an error with the documentation, I just copy and edit YAML manifests at that time, and could not notice the error config format, but it's alright. Thank you again for your response.

luxurine avatar Dec 16 '22 09:12 luxurine