nats-server icon indicating copy to clipboard operation
nats-server copied to clipboard

TLS handshake errors on nats container in K8

Open amit12cool opened this issue 4 months ago • 4 comments

Observed behavior

I have a publisher publishing message to NATS server. The publisher is an Azure function publishing messages at regular intervals.

I see the errors on the publisher:-

2024-02-23T18:10:33.016Z] nats: encountered error
Traceback (most recent call last):
  File "C:\Users\abcd\Code\Veli-IoT-System\message-ingestor\.venv\lib\site-packages\nats\aio\client.py", line 714, in _close
    await self._transport.wait_closed()
  File "C:\Users\abcd\Code\Veli-IoT-System\message-ingestor\.venv\lib\site-packages\nats\aio\transport.py", line 180, in wait_closed
    return await self._io_writer.wait_closed()
  File "C:\Users\abcd\AppData\Local\Programs\Python\Python39\lib\asyncio\streams.py", line 359, in wait_closed
    await self._protocol._get_close_waiter(self)
  File "C:\Users\abcd\AppData\Local\Programs\Python\Python39\lib\asyncio\sslproto.py", line 534, in data_received
    ssldata, appdata = self._sslpipe.feed_ssldata(data)
  File "C:\Users\abcd\AppData\Local\Programs\Python\Python39\lib\asyncio\sslproto.py", line 206, in feed_ssldata
    self._sslobj.unwrap()
  File "C:\Users\abcd\AppData\Local\Programs\Python\Python39\lib\ssl.py", line 949, in unwrap
    return self._sslobj.shutdown()
ssl.SSLError: [SSL: APPLICATION_DATA_AFTER_CLOSE_NOTIFY] application data after close notify (_ssl.c:2756)

I see errors in nats(server) containers as below

[7] 2024/02/23 18:14:27.845444 [ERR] 10.240.0.10:29783 - cid:206 - TLS handshake error: read tcp 10.244.6.44:4222->10.240.0.10:29783: i/o timeout
2/23/2024, 11:44:28 PM
nats-2
d13670d74711ba3ebf50e939852eb2efdc7dbe6010bb37abd81dcf8f7f2e8542
[7] 2024/02/23 18:14:28.144152 [ERR] 10.240.0.10:29783 - cid:206 - TLS handshake timeout
[7] 2024/02/23 18:14:28.806477 [ERR] 10.240.0.12:39668 - cid:207 - TLS handshake error: read tcp 10.244.6.44:4222->10.240.0.12:39668: i/o timeout
2/23/2024, 11:44:31 PM
nats-2
d13670d74711ba3ebf50e939852eb2efdc7dbe6010bb37abd81dcf8f7f2e8542
[7] 2024/02/23 18:14:31.045227 [ERR] 10.240.0.14:17010 - cid:208 - TLS handshake error: read tcp 10.244.6.44:4222->10.240.0.14:17010: i/o timeout

Expected behavior

No TLS errors should be there.

Server and client version

NATS version - 2.10.4-alpine Client -> nats-py==2.6.0

Host environment

K8's

Steps to reproduce

TLS auth

amit12cool avatar Feb 23 '24 18:02 amit12cool

I think you may have to bump the default tls { timeout = X } and authorization { timeout = X } for your deployment, have you tried tuning those?

wallyqs avatar Feb 23 '24 18:02 wallyqs

I think you may have to bump the default tls { timeout = X } and authorization { timeout = X } for your deployment, have you tried tuning those?

@wallyqs where can I bump this in the https://github.com/nats-io/k8s/blob/main/helm/charts/nats/values.yaml helm chart.

Also I didn't find a timeout option here https://docs.nats.io/running-a-nats-service/configuration/securing_nats/authorization

amit12cool avatar Mar 11 '24 18:03 amit12cool

@wallyqs with timeout = 50 in tls the timeout error was removed. But I see tls handhsake errors like below

[7] 2024/03/13 19:50:46.143685 [ERR] Error trying to connect to route (attempt 1): lookup for host "nats-1.nats-headless": lookup nats-1.nats-headless on 10.0.0.10:53: no such host
[7] 2024/03/13 19:50:46.143905 [ERR] Error trying to connect to route (attempt 1): lookup for host "nats-0.nats-headless": lookup nats-0.nats-headless on 10.0.0.10:53: no such host
[7] 2024/03/13 19:50:46.172442 [ERR] Error trying to connect to route (attempt 1): lookup for host "nats-0.nats-headless": lookup nats-0.nats-headless on 10.0.0.10:53: no such host
[7] 2024/03/13 19:50:46.173120 [ERR] Error trying to connect to route (attempt 1): lookup for host "nats-1.nats-headless": lookup nats-1.nats-headless on 10.0.0.10:53: no such host
[7] 2024/03/12 18:49:37.957961 [ERR] Error trying to connect to route (attempt 1): lookup for host "nats-0.nats-headless": lookup nats-0.nats-headless on 10.0.0.10:53: no such host

[7] 2024/03/12 17:40:23.567594 [ERR] 10.240.0.4:59217 - cid:44 - TLS handshake error: read tcp 10.244.2.14:4222->10.240.0.4:59217: read: connection reset by peer

Points to note:-

  • 10.0.0.10:53 is the DNS address and when I run the command kubectl exec -it nats-0 -- nslookup nats-1.nats-headless.nats-playground-3.svc.cluster.local it gives output
Server:         10.0.0.10
Address:        10.0.0.10:53


Name:   nats-1.nats-headless.nats-playground-3.svc.cluster.local
Address: 10.244.3.198

So above is resolved to the address 10.244.3.198 from outside pod but not inside from pod.

  • I have no client connected to the nats server i.e. any of the nats POD in k8
  • 10.244.2.14:4222 I know this is my nats pod server ip. But I don't know whose Ip is this 0.240.0.4:59217. any idea, with whom the tcp connection is being made from bats server?

I have below nats.conf

{
  "cluster": {
    "name": "nats",
    "no_advertise": true,
    "port": 6222,
    "routes": [
      "nats://nats-0.nats-headless:6222",
      "nats://nats-1.nats-headless:6222",
      "nats://nats-2.nats-headless:6222"
    ]
  },
  "http_port": 8222,
  "jetstream": {
    "max_file_store": 10Gi,
    "max_memory_store": 0,
    "store_dir": "/data"
  },
  "lame_duck_duration": "30s",
  "lame_duck_grace_period": "10s",
  "pid_file": "/var/run/nats/nats.pid",
  "port": 4222,
  "server_name": $SERVER_NAME,
  "tls": {
    "ca_file": "/mnt/nats-certificate/rootCA-playground.crt",
    "cert_file": "/mnt/nats-certificate/nats-playground-server.crt",
    "key_file": "/mnt/nats-certificate/nats-playground-server.key",
    "timeout": 50,
    "verify": true
  }
}

Here is my helm chart

config:
  cluster:
    enabled: true
    replicas: 3
    port: 6222
  jetstream:
    enabled: true
    fileStore:
      pvc:
        size: 10Gi
  nats:
    tls:
      enabled: true
      merge: {
        verify: true,
        cert_file: '/mnt/nats-certificate/nats-playground-server.crt',
        key_file: '/mnt/nats-certificate/nats-playground-server.key',
        ca_file: '/mnt/nats-certificate/rootCA-playground.crt',
        timeout: 50
      }
  
podTemplate:
  topologySpreadConstraints:
    kubernetes.io/hostname:
      maxSkew: 1
      whenUnsatisfiable: DoNotSchedule
  patch:
  - op: add
    path: /spec/volumes/-
    value:
      name: nats-certificate-volume
      csi:
        driver: secrets-store.csi.k8s.io
        readOnly: true
        volumeAttributes:
          secretProviderClass: 'nats-certificate'

service:
  merge:
    spec:
      type: LoadBalancer

promExporter:
  enabled: true
  podMonitor:
    enabled: true

container:
  image:
    repository: nats
    image: 2.10.11-alpine
  patch:
  - op: add
    path: /volumeMounts/-
    value:
      name: nats-certificate-volume
      mountPath: /mnt/nats-certificate

reloader:
  patch:
  - op: add
    path: /volumeMounts/-
    value:
      name: nats-certificate-volume
      mountPath: /mnt/nats-certificate

amit12cool avatar Mar 12 '24 17:03 amit12cool

@neilalexander @wallyqs please check above. This is critical for me.

amit12cool avatar Mar 13 '24 20:03 amit12cool