datahub icon indicating copy to clipboard operation
datahub copied to clipboard

acryl-datahub-actions pod is not able to connect to AWS MSK with TLS only brokers

Open atul-chegg opened this issue 2 years ago • 9 comments

We are seeing messages like this in logs of the pod of acryl-datahub-actions

%6|1661937385.394|FAIL|rdkafka#consumer-1| [thrd:<broker-1>:9094/b]: b-1.<broker-1>:9094/bootstrap: Disconnected while requesting ApiVersion: might be caused by incorrect security.protocol configuration (connecting to a SSL listener?) or broker version is < 0.10 (see api.version.request) (after 1ms in state APIVERSION_QUERY, 3 identical error(s) suppressed)

I have these settings under global for kafka

kafka:
    bootstrap:
      server: ${kafka_bootstrap_servers}
    zookeeper:
      server:  ${kafka_zookeeper_servers}
    schemaregistry:
      url: "http://datahub-prerequisites-cp-schema-registry:8081"
    partitions: 3
    replicationFactor: 3

  springKafkaConfigurationOverrides:
    security.protocol: SSL
    kafkastore.security.protocol: SSL
    ssl.protocol: TLS

atul-chegg avatar Aug 31 '22 09:08 atul-chegg

@atul-chegg, please, can you elaborate more on what you tried to do when you got this issue? How I would be able to reproduce it?

treff7es avatar Sep 15 '22 17:09 treff7es

Hi @treff7es We have not tried anything related to acryl-datahub-actions yet. I show these errors in container. Although, container is running and not crashing.

These container log messages suggested that it is not able to connect to Kafka. That's why I reported this issue.

We are using AWS MSK for Kafka. Endpoint is unauthenticated but available only on SSL.

Screenshot 2022-09-16 at 10 35 36 AM

To reproduce this issue, maybe, you can create similar setup in AWS MSK.

I will try to use acryl-datahub-actions to see if it is really breaking any functionality or just some false error messages.

Here are the some more logs.

2022/09/14 00:23:21 Problem with request: Get "http://datahub-datahub-gms:443/health": dial tcp 172.20.46.42:443: connect: connection refused. Sleeping 1s
2022/09/14 00:23:23 Problem with request: Get "http://datahub-datahub-gms:443/health": dial tcp 172.20.46.42:443: connect: connection refused. Sleeping 1s
2022/09/14 00:23:25 Problem with request: Get "http://datahub-datahub-gms:443/health": dial tcp 172.20.46.42:443: connect: connection refused. Sleeping 1s
2022/09/14 00:23:27 Problem with request: Get "http://datahub-datahub-gms:443/health": dial tcp 172.20.46.42:443: connect: connection refused. Sleeping 1s
2022/09/14 00:23:29 Problem with request: Get "http://datahub-datahub-gms:443/health": dial tcp 172.20.46.42:443: connect: connection refused. Sleeping 1s
2022/09/14 00:23:31 Problem with request: Get "http://datahub-datahub-gms:443/health": dial tcp 172.20.46.42:443: connect: connection refused. Sleeping 1s
2022/09/14 00:23:33 Received 200 from http://datahub-datahub-gms:443/health
No user action configurations found. Not starting user actions.
[2022-09-14 00:23:34,679] INFO     {datahub_actions.cli.actions:68} - DataHub Actions version: unavailable (installed editable via git)
%6|1663115015.318|FAIL|rdkafka#consumer-1| [thrd:b-2.testdatahub.kbb3mi.c14.kafka.us-west-2.amazonaws.com:9094/b]: b-2.testdatahub.kbb3mi.c14.kafka.us-west-2.amazonaws.com:9094/bootstrap: Disconnected while requesting ApiVersion: might be caused by incorrect security.protocol configuration (connecting to a SSL listener?) or broker version is < 0.10 (see api.version.request) (after 1ms in state APIVERSION_QUERY)
[2022-09-14 00:23:35,436] INFO     {datahub_actions.cli.actions:98} - Action Pipeline with name 'ingestion_executor' is now running.
%6|1663115015.441|FAIL|rdkafka#consumer-1| [thrd:b-1.testdatahub.kbb3mi.c14.kafka.us-west-2.amazonaws.com:9094/b]: b-1.testdatahub.kbb3mi.c14.kafka.us-west-2.amazonaws.com:9094/bootstrap: Disconnected while requesting ApiVersion: might be caused by incorrect security.protocol configuration (connecting to a SSL listener?) or broker version is < 0.10 (see api.version.request) (after 1ms in state APIVERSION_QUERY)
%6|1663115015.515|FAIL|rdkafka#consumer-1| [thrd:b-2.testdatahub.kbb3mi.c14.kafka.us-west-2.amazonaws.com:9094/b]: b-2.testdatahub.kbb3mi.c14.kafka.us-west-2.amazonaws.com:9094/bootstrap: Disconnected while requesting ApiVersion: might be caused by incorrect security.protocol configuration (connecting to a SSL listener?) or broker version is < 0.10 (see api.version.request) (after 1ms in state APIVERSION_QUERY, 1 identical error(s) suppressed)
%6|1663115015.681|FAIL|rdkafka#consumer-1| [thrd:b-1.testdatahub.kbb3mi.c14.kafka.us-west-2.amazonaws.com:9094/b]: b-1.testdatahub.kbb3mi.c14.kafka.us-west-2.amazonaws.com:9094/bootstrap: Disconnected while requesting ApiVersion: might be caused by incorrect security.protocol configuration (connecting to a SSL listener?) or broker version is < 0.10 (see api.version.request) (after 1ms in state APIVERSION_QUERY, 1 identical error(s) suppressed)
%6|1663115016.317|FAIL|rdkafka#consumer-1| [thrd:b-3.testdatahub.kbb3mi.c14.kafka.us-west-2.amazonaws.com:9094/b]: b-3.testdatahub.kbb3mi.c14.kafka.us-west-2.amazonaws.com:9094/bootstrap: Disconnected while requesting ApiVersion: might be caused by incorrect security.protocol configuration (connecting to a SSL listener?) or broker version is < 0.10 (see api.version.request) (after 1ms in state APIVERSION_QUERY)
 

atul-chegg avatar Sep 16 '22 05:09 atul-chegg

Thanks for the super detailed comment. Let me dig into this.

treff7es avatar Sep 16 '22 05:09 treff7es

@atul-chegg for datahub-actions, you have to set these properties in the recipe. The above config is for gms server. -> https://github.com/acryldata/datahub-actions/blob/main/docs/sources/kafka-event-source.md

treff7es avatar Sep 16 '22 13:09 treff7es

@treff7es we ran into the same issue using a on-prem Confluent cluster with mTLS auth active.

We found the packaged ingestion_executor config in the datahub-actions container in the following path: /etc/datahub/actions/system/conf/executor.yaml. However this config doesn't include the necessary environment substitutions mentioned in the link provided by you above.

This issue was previously spotted here: https://github.com/datahub-project/datahub/issues/5706

We worked around this shortcoming with the following Helm values which overwrites the default bundled configuration with an custom one provided as ConfigMap:

acryl-datahub-actions:
  [...]
  extraVolumes:
    - configMap:
        defaultMode: 292
        name: datahub-executor
      name: datahub-custom-actions-config
  extraVolumeMounts:
    - mountPath: /etc/datahub/actions/system/conf
      name: datahub-custom-actions-config

However this workaround doesn't seem to be documented anywhere isn't it?

We added the following values to the default configuration shipped with the datahub-actions container:

#[source.config.connection]
consumer_config: 
    security.protocol: ${KAFKA_PROPERTIES_SECURITY_PROTOCOL:-PLAINTEXT}
    ssl.keystore.location: ${KAFKA_PROPERTIES_SSL_KEYSTORE_LOCATION:-/mnt/certs/keystore}
    ssl.truststore.location: ${KAFKA_PROPERTIES_SSL_TRUSTSTORE_LOCATION:-/mnt/certs/truststore}
    ssl.keystore.password: ${KAFKA_PROPERTIES_SSL_KEYSTORE_PASSWORD:-keystore_password}
    ssl.key.password: ${KAFKA_PROPERTIES_SSL_KEY_PASSWORD:-keystore_password}
    ssl.truststore.password: ${KAFKA_PROPERTIES_SSL_TRUSTSTORE_PASSWORD:-truststore_password}

Afterwards we sadly ran into another error we can't yet explain, any thoughts on that one?

---- (full traceback above) ----
File "/usr/local/lib/python3.9/site-packages/datahub_actions/pipeline/pipeline_util.py", line 67, in create_event_source
    event_source_instance = event_source_class.create(event_source_config, ctx)
File "/usr/local/lib/python3.9/site-packages/datahub_actions/plugin/source/kafka/kafka_event_source.py", line 140, in create
    return cls(config, ctx)
File "/usr/local/lib/python3.9/site-packages/datahub_actions/plugin/source/kafka/kafka_event_source.py", line 120, in __init__
    self.consumer: confluent_kafka.Consumer = confluent_kafka.DeserializingConsumer(
File "/usr/local/lib/python3.9/site-packages/confluent_kafka/deserializing_consumer.py", line 103, in __init__
    super(DeserializingConsumer, self).__init__(conf_copy)

KafkaException: KafkaError{code=_INVALID_ARG,val=-186,str="Java TrustStores are not supported, use `ssl.ca.location` and a certificate file instead. See https://github.com/edenhill/librdkafka/wiki/Using-SSL-with-librdkafka for more information."}

danielhass avatar Oct 18 '22 13:10 danielhass

We digged into this further with a little help from @Masterchen09 and the links from the error message above we came up with the following workaround in the executor.yaml config:

#[source.config.connection]
consumer_config:
   ssl.ca.location: "/mnt/datahub/certs/ca.pem"
   ssl.certificate.location: "/mnt/datahub/certs/client.pem"
   ssl.key.location: "/mnt/datahub/certs/key.pem"

It seems like the Python library used by the Actions Framework is not able to digest Java keystores (which makes sense I guess), however this requires users to publish certs and keys in keystores and as pem files in order to get all components to work. After setting the above shown consumer_config as documented here: https://github.com/edenhill/librdkafka/wiki/Using-SSL-with-librdkafka#configure-librdkafka-client the actions pod seems to run fine (final tests from your data engineers pending).

Ultimately this leaves us with two questions:

  • what is the intended way to alter the shipped and default executor.yaml as an overwrite via a ConfigMap mount as done in our case seems a bit hacky especially with respect to upgrades of the shipped defaults down the road.
  • are there plans to unify the certificate and key handling between the Java and Python components? The current way requires users to maintain these in two forms (Java Keystores + PEM encoded files).

Let me know if I can help with further insights on such an certificate based deployment.

FYI - @treff7es @atul-chegg

danielhass avatar Oct 25 '22 13:10 danielhass

fyi: @hsheth2 might have opinion on this.

szalai1 avatar Oct 26 '22 13:10 szalai1

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

github-actions[bot] avatar Nov 26 '22 02:11 github-actions[bot]

I sure think that the points from my comment above need to be sorted out as they are blocking a proper deployment of DataHub with an external mTLS enabled Kafka cluster:

  • what is the intended way to alter the shipped and default executor.yaml as an overwrite via a ConfigMap mount as done in our case seems a bit hacky especially with respect to upgrades of the shipped defaults down the road.
  • are there plans to unify the certificate and key handling between the Java and Python components? The current way requires users to maintain these in two forms (Java Keystores + PEM encoded files).

Dunno if this is still in the scope of this issue but a maintainer should look into this I think.

danielhass avatar Nov 26 '22 14:11 danielhass