DataflowTemplates
DataflowTemplates copied to clipboard
[Bug]: unable to find valid certification path to requested target
Related Template(s)
BigQuery to Elasticsearch
What happened?
I have a Elasticsearch instance that is reachable by Dataflow workers. The certificates are self signed and I don't know how to tell the pipeline to ignore the certificate or give it one.
Is there a way to ignore certificate validation? Or to forcefully tell or know where is it?
Thank you in advance
Beam Version
Newer than 2.43.0
Relevant log output
Caused by: javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:844)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:259)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:246)
at com.google.cloud.teleport.v2.elasticsearch.utils.ElasticsearchIO.getBackendVersion(ElasticsearchIO.java:1615)
at com.google.cloud.teleport.v2.elasticsearch.utils.ElasticsearchIO$Write$WriteFn.setup(ElasticsearchIO.java:1368)
at com.google.cloud.teleport.v2.elasticsearch.utils.ElasticsearchIO$Write$WriteFn$DoFnInvoker.invokeSetup(Unknown Source)
at org.apache.beam.sdk.transforms.reflect.DoFnInvokers.tryInvokeSetupFor(DoFnInvokers.java:53)
at org.apache.beam.runners.dataflow.worker.DoFnInstanceManagers$ConcurrentQueueInstanceManager.deserializeCopy(DoFnInstanceManagers.java:86)
at org.apache.beam.runners.dataflow.worker.DoFnInstanceManagers$ConcurrentQueueInstanceManager.peek(DoFnInstanceManagers.java:68)
at org.apache.beam.runners.dataflow.worker.UserParDoFnFactory.create(UserParDoFnFactory.java:100)
at org.apache.beam.runners.dataflow.worker.DefaultParDoFnFactory.create(DefaultParDoFnFactory.java:75)
at org.apache.beam.runners.dataflow.worker.IntrinsicMapTaskExecutorFactory.createParDoOperation(IntrinsicMapTaskExecutorFactory.java:267)
at org.apache.beam.runners.dataflow.worker.IntrinsicMapTaskExecutorFactory.access$000(IntrinsicMapTaskExecutorFactory.java:89)
at org.apache.beam.runners.dataflow.worker.IntrinsicMapTaskExecutorFactory$1.typedApply(IntrinsicMapTaskExecutorFactory.java:186)
at org.apache.beam.runners.dataflow.worker.IntrinsicMapTaskExecutorFactory$1.typedApply(IntrinsicMapTaskExecutorFactory.java:168)
at org.apache.beam.runners.dataflow.worker.graph.Networks$TypeSafeNodeFunction.apply(Networks.java:67)
at org.apache.beam.runners.dataflow.worker.graph.Networks$TypeSafeNodeFunction.apply(Networks.java:54)
at org.apache.beam.runners.dataflow.worker.graph.Networks.replaceDirectedNetworkNodes(Networks.java:91)
at org.apache.beam.runners.dataflow.worker.IntrinsicMapTaskExecutorFactory.create(IntrinsicMapTaskExecutorFactory.java:128)
at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:361)
at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:314)
at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140)
at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120)
at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at org.apache.beam.sdk.util.UnboundedScheduledExecutorService$ScheduledFutureTask.run(UnboundedScheduledExecutorService.java:162)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
In python when you connect to Elasticsearch you can use:
Elasticsearch(hosts=[address], basic_auth=[user, password], verify_certs=False)
and with curl you can use the -k parameter:
curl -k https://<elasticsearch_url>:9200
Is there a way to reach the same trick with Dataflow run ?
Or maybe a way to get the certificate from?
I dug a little deeper and find out that this is the point where the selfsigned certificate should be not trusted.
Why can't we have this as a parameter?
https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/65b9a15caa0939151386b51e67ecd6d5b997178d/v2/elasticsearch-common/src/main/java/com/google/cloud/teleport/v2/elasticsearch/utils/ElasticsearchIO.java#L321
I think it is fine / we are able to do that, as long as it comes from an option and not by default. WDYT? Want to sent a PR for that?
I think it is fine / we are able to do that, as long as it comes from an option and not by default. WDYT? Want to sent a PR for that?
Thanks, I created a small PR for this. Not really sure how to test it properly though. Let me know what I can do about it, if there is anything needed
Thank you @salvob41.
Pull request is merged, can we close this?
Ciao @bvolpato , I am afraid that the real solution is finding something to add the certificate inside the workers (as we kind of discussed in the PR). As it is, it does not properly solve the "custom" certificate given from a on-premise Elasticsearch installation :/
I don't know honestly :(
This issue has been marked as stale due to 180 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the issue at any time. Thank you for your contributions.
This issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.