datahub
datahub copied to clipboard
acryl-datahub-actions pod is not able to connect to AWS MSK with TLS only brokers
We are seeing messages like this in logs of the pod of acryl-datahub-actions
%6|1661937385.394|FAIL|rdkafka#consumer-1| [thrd:<broker-1>:9094/b]: b-1.<broker-1>:9094/bootstrap: Disconnected while requesting ApiVersion: might be caused by incorrect security.protocol configuration (connecting to a SSL listener?) or broker version is < 0.10 (see api.version.request) (after 1ms in state APIVERSION_QUERY, 3 identical error(s) suppressed)
I have these settings under global for kafka
kafka:
bootstrap:
server: ${kafka_bootstrap_servers}
zookeeper:
server: ${kafka_zookeeper_servers}
schemaregistry:
url: "http://datahub-prerequisites-cp-schema-registry:8081"
partitions: 3
replicationFactor: 3
springKafkaConfigurationOverrides:
security.protocol: SSL
kafkastore.security.protocol: SSL
ssl.protocol: TLS
@atul-chegg, please, can you elaborate more on what you tried to do when you got this issue? How I would be able to reproduce it?
Hi @treff7es We have not tried anything related to acryl-datahub-actions yet. I show these errors in container. Although, container is running and not crashing.
These container log messages suggested that it is not able to connect to Kafka. That's why I reported this issue.
We are using AWS MSK for Kafka. Endpoint is unauthenticated but available only on SSL.

To reproduce this issue, maybe, you can create similar setup in AWS MSK.
I will try to use acryl-datahub-actions to see if it is really breaking any functionality or just some false error messages.
Here are the some more logs.
2022/09/14 00:23:21 Problem with request: Get "http://datahub-datahub-gms:443/health": dial tcp 172.20.46.42:443: connect: connection refused. Sleeping 1s
2022/09/14 00:23:23 Problem with request: Get "http://datahub-datahub-gms:443/health": dial tcp 172.20.46.42:443: connect: connection refused. Sleeping 1s
2022/09/14 00:23:25 Problem with request: Get "http://datahub-datahub-gms:443/health": dial tcp 172.20.46.42:443: connect: connection refused. Sleeping 1s
2022/09/14 00:23:27 Problem with request: Get "http://datahub-datahub-gms:443/health": dial tcp 172.20.46.42:443: connect: connection refused. Sleeping 1s
2022/09/14 00:23:29 Problem with request: Get "http://datahub-datahub-gms:443/health": dial tcp 172.20.46.42:443: connect: connection refused. Sleeping 1s
2022/09/14 00:23:31 Problem with request: Get "http://datahub-datahub-gms:443/health": dial tcp 172.20.46.42:443: connect: connection refused. Sleeping 1s
2022/09/14 00:23:33 Received 200 from http://datahub-datahub-gms:443/health
No user action configurations found. Not starting user actions.
[2022-09-14 00:23:34,679] INFO {datahub_actions.cli.actions:68} - DataHub Actions version: unavailable (installed editable via git)
%6|1663115015.318|FAIL|rdkafka#consumer-1| [thrd:b-2.testdatahub.kbb3mi.c14.kafka.us-west-2.amazonaws.com:9094/b]: b-2.testdatahub.kbb3mi.c14.kafka.us-west-2.amazonaws.com:9094/bootstrap: Disconnected while requesting ApiVersion: might be caused by incorrect security.protocol configuration (connecting to a SSL listener?) or broker version is < 0.10 (see api.version.request) (after 1ms in state APIVERSION_QUERY)
[2022-09-14 00:23:35,436] INFO {datahub_actions.cli.actions:98} - Action Pipeline with name 'ingestion_executor' is now running.
%6|1663115015.441|FAIL|rdkafka#consumer-1| [thrd:b-1.testdatahub.kbb3mi.c14.kafka.us-west-2.amazonaws.com:9094/b]: b-1.testdatahub.kbb3mi.c14.kafka.us-west-2.amazonaws.com:9094/bootstrap: Disconnected while requesting ApiVersion: might be caused by incorrect security.protocol configuration (connecting to a SSL listener?) or broker version is < 0.10 (see api.version.request) (after 1ms in state APIVERSION_QUERY)
%6|1663115015.515|FAIL|rdkafka#consumer-1| [thrd:b-2.testdatahub.kbb3mi.c14.kafka.us-west-2.amazonaws.com:9094/b]: b-2.testdatahub.kbb3mi.c14.kafka.us-west-2.amazonaws.com:9094/bootstrap: Disconnected while requesting ApiVersion: might be caused by incorrect security.protocol configuration (connecting to a SSL listener?) or broker version is < 0.10 (see api.version.request) (after 1ms in state APIVERSION_QUERY, 1 identical error(s) suppressed)
%6|1663115015.681|FAIL|rdkafka#consumer-1| [thrd:b-1.testdatahub.kbb3mi.c14.kafka.us-west-2.amazonaws.com:9094/b]: b-1.testdatahub.kbb3mi.c14.kafka.us-west-2.amazonaws.com:9094/bootstrap: Disconnected while requesting ApiVersion: might be caused by incorrect security.protocol configuration (connecting to a SSL listener?) or broker version is < 0.10 (see api.version.request) (after 1ms in state APIVERSION_QUERY, 1 identical error(s) suppressed)
%6|1663115016.317|FAIL|rdkafka#consumer-1| [thrd:b-3.testdatahub.kbb3mi.c14.kafka.us-west-2.amazonaws.com:9094/b]: b-3.testdatahub.kbb3mi.c14.kafka.us-west-2.amazonaws.com:9094/bootstrap: Disconnected while requesting ApiVersion: might be caused by incorrect security.protocol configuration (connecting to a SSL listener?) or broker version is < 0.10 (see api.version.request) (after 1ms in state APIVERSION_QUERY)
Thanks for the super detailed comment. Let me dig into this.
@atul-chegg for datahub-actions, you have to set these properties in the recipe. The above config is for gms server. -> https://github.com/acryldata/datahub-actions/blob/main/docs/sources/kafka-event-source.md
@treff7es we ran into the same issue using a on-prem Confluent cluster with mTLS auth active.
We found the packaged ingestion_executor
config in the datahub-actions container in the following path: /etc/datahub/actions/system/conf/executor.yaml
. However this config doesn't include the necessary environment substitutions mentioned in the link provided by you above.
This issue was previously spotted here: https://github.com/datahub-project/datahub/issues/5706
We worked around this shortcoming with the following Helm values which overwrites the default bundled configuration with an custom one provided as ConfigMap:
acryl-datahub-actions:
[...]
extraVolumes:
- configMap:
defaultMode: 292
name: datahub-executor
name: datahub-custom-actions-config
extraVolumeMounts:
- mountPath: /etc/datahub/actions/system/conf
name: datahub-custom-actions-config
However this workaround doesn't seem to be documented anywhere isn't it?
We added the following values to the default configuration shipped with the datahub-actions container:
#[source.config.connection]
consumer_config:
security.protocol: ${KAFKA_PROPERTIES_SECURITY_PROTOCOL:-PLAINTEXT}
ssl.keystore.location: ${KAFKA_PROPERTIES_SSL_KEYSTORE_LOCATION:-/mnt/certs/keystore}
ssl.truststore.location: ${KAFKA_PROPERTIES_SSL_TRUSTSTORE_LOCATION:-/mnt/certs/truststore}
ssl.keystore.password: ${KAFKA_PROPERTIES_SSL_KEYSTORE_PASSWORD:-keystore_password}
ssl.key.password: ${KAFKA_PROPERTIES_SSL_KEY_PASSWORD:-keystore_password}
ssl.truststore.password: ${KAFKA_PROPERTIES_SSL_TRUSTSTORE_PASSWORD:-truststore_password}
Afterwards we sadly ran into another error we can't yet explain, any thoughts on that one?
---- (full traceback above) ----
File "/usr/local/lib/python3.9/site-packages/datahub_actions/pipeline/pipeline_util.py", line 67, in create_event_source
event_source_instance = event_source_class.create(event_source_config, ctx)
File "/usr/local/lib/python3.9/site-packages/datahub_actions/plugin/source/kafka/kafka_event_source.py", line 140, in create
return cls(config, ctx)
File "/usr/local/lib/python3.9/site-packages/datahub_actions/plugin/source/kafka/kafka_event_source.py", line 120, in __init__
self.consumer: confluent_kafka.Consumer = confluent_kafka.DeserializingConsumer(
File "/usr/local/lib/python3.9/site-packages/confluent_kafka/deserializing_consumer.py", line 103, in __init__
super(DeserializingConsumer, self).__init__(conf_copy)
KafkaException: KafkaError{code=_INVALID_ARG,val=-186,str="Java TrustStores are not supported, use `ssl.ca.location` and a certificate file instead. See https://github.com/edenhill/librdkafka/wiki/Using-SSL-with-librdkafka for more information."}
We digged into this further with a little help from @Masterchen09 and the links from the error message above we came up with the following workaround in the executor.yaml
config:
#[source.config.connection]
consumer_config:
ssl.ca.location: "/mnt/datahub/certs/ca.pem"
ssl.certificate.location: "/mnt/datahub/certs/client.pem"
ssl.key.location: "/mnt/datahub/certs/key.pem"
It seems like the Python library used by the Actions Framework is not able to digest Java keystores (which makes sense I guess), however this requires users to publish certs and keys in keystores and as pem files in order to get all components to work. After setting the above shown consumer_config
as documented here: https://github.com/edenhill/librdkafka/wiki/Using-SSL-with-librdkafka#configure-librdkafka-client the actions pod seems to run fine (final tests from your data engineers pending).
Ultimately this leaves us with two questions:
- what is the intended way to alter the shipped and default
executor.yaml
as an overwrite via a ConfigMap mount as done in our case seems a bit hacky especially with respect to upgrades of the shipped defaults down the road. - are there plans to unify the certificate and key handling between the Java and Python components? The current way requires users to maintain these in two forms (Java Keystores + PEM encoded files).
Let me know if I can help with further insights on such an certificate based deployment.
FYI - @treff7es @atul-chegg
fyi: @hsheth2 might have opinion on this.
This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io
I sure think that the points from my comment above need to be sorted out as they are blocking a proper deployment of DataHub with an external mTLS enabled Kafka cluster:
- what is the intended way to alter the shipped and default executor.yaml as an overwrite via a ConfigMap mount as done in our case seems a bit hacky especially with respect to upgrades of the shipped defaults down the road.
- are there plans to unify the certificate and key handling between the Java and Python components? The current way requires users to maintain these in two forms (Java Keystores + PEM encoded files).
Dunno if this is still in the scope of this issue but a maintainer should look into this I think.