DataflowTemplates
DataflowTemplates copied to clipboard
PubSub to Splunk template fails after latest update
Our Splunk dataflow job started failing after being cloned and restarted today. I noticed the version had been bumped to 2021-09-13-00_rc00
.
Forcing a previous version of the template (2021-03-08-01_RC00) fixed the issue. This is likely the one we were using before the job was restarted.
Stack trace:
org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 27; received: 0)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:198)
at org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:101)
at org.apache.http.impl.execchain.ResponseEntityProxy.streamClosed(ResponseEntityProxy.java:142)
at org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228)
at org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:172)
at com.google.api.client.http.HttpResponse.ignore(HttpResponse.java:427)
at com.google.api.client.http.HttpResponse.disconnect(HttpResponse.java:441)
at com.google.cloud.teleport.splunk.SplunkEventWriter.flush(SplunkEventWriter.java:266)
at com.google.cloud.teleport.splunk.SplunkEventWriter.processElement(SplunkEventWriter.java:184)
Looks like a commit was shipped today to prevent this error from disrupting the job. https://github.com/GoogleCloudPlatform/DataflowTemplates/commit/d7a1d3ce5f3301ef6c79b4d1b40b2c2ff5700cbd
Therefore, this might be a non-issue for the next version but I still wanted to raise the flag.
+1 I'm getting the same message after restarting our Splunk Dataflow job today (just cloned the old one) because I realized that it had a public IP and shouldn't have. Still getting logs but they're coming in at much different intervals than expected which triggered some PagerDuty alerts.
We have a watchdog alert that checks in once every two minutes from two hosts. As you can see the chart goes crazy right after I deployed the new template. I'm not even sure how to roll back with Dataflow templates since they don't seem to be versioned so if anyone can help out there, I'd appreciate it.
+1 I'm getting the same message after restarting our Splunk Dataflow job today (just cloned the old one) because I realized that it had a public IP and shouldn't have. Still getting logs but they're coming in at much different intervals than expected which triggered some PagerDuty alerts.
We have a watchdog alert that checks in once every two minutes from two hosts. As you can see the chart goes crazy right after I deployed the new template. I'm not even sure how to roll back with Dataflow templates since they don't seem to be versioned so if anyone can help out there, I'd appreciate it.
All the templates (including old ones) will be available in public dataflow-templates GCS bucket. Latest release happened on 09/20/2021. To find all releases in the current year
gsutil ls "gs://dataflow-templates/2021-*" | grep Splunk$
If you want to pick up different version of the template in UI, choose the custom template option under Dataflow Template dropdown and provide appropriate template path. e.g dataflow-templates/2021-09-13-00_RC00/Cloud_PubSub_to_Splunk
for previous version
Thanks @prathapreddy123 I'll try that.
Update: Ok that worked, but the 09/13 release was still busted with the same error. I ended up going back to the 2021-08-03 release which is what was working before. It's unsettling that this could have been so obviously broken for so long without anyone seeming to notice 😞
2021-08-30-00_RC00 is the newest template we can get to work, 9-13 breaks it
this also coincides with the underlying SDK changing from 2.29 to 2.32 AFAIK
With no other changes apparent, it seems the 2021-10-04-00_RC00 template works better, in that it works at all now; but we're still seeing now a slightly different variation of that same error being logged
2021-10-12T17:20:24.269ZError trying to disconnect from Splunk: Premature end of Content-Length delimited message body (expected: 27; received: 0) Messages should still have either been published or prepared for error handling, but there might be a connection leak. Stack Trace: [org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178), org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:198), org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:101), org.apache.http.impl.execchain.ResponseEntityProxy.streamClosed(ResponseEntityProxy.java:142), org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228), org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:172), com.google.api.client.http.HttpResponse.ignore(HttpResponse.java:427), com.google.api.client.http.HttpResponse.disconnect(HttpResponse.java:441), com.google.cloud.teleport.splunk.SplunkEventWriter.flush(SplunkEventWriter.java:273), com.google.cloud.teleport.splunk.SplunkEventWriter.processElement(SplunkEventWriter.java:184), com.google.cloud.teleport.splunk.AutoValue_SplunkEventWriter$DoFnInvoker.invokeProcessElement(Unknown Source), org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:232), org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:188), org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:339), org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44), org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49), org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:212), org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:163), org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:92), org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1435), org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1100(StreamingDataflowWorker.java:165), org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$7.run(StreamingDataflowWorker.java:1111), java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128), java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628), java.base/java.lang.Thread.run(Thread.java:834)]
It seems it maybe to relate to the change in https://github.com/GoogleCloudPlatform/DataflowTemplates/commit/d7a1d3ce5f3301ef6c79b4d1b40b2c2ff5700cbd
Folks, as noted above, the issue was unfortunately introduced in 2021-09-13-00_RC00 release when we upgraded underlying HTTP Java client library. That introduced this change in how HTTP response disconnect is implemented.
The issue was mitigated (not fixed) as of 2021-09-27-00_RC00 release by safely catching these errors (and logging warnings instead) in order to ensure log delivery is uninterrupted. Yes, that means there will a warning message per batched request (not per log message). That warning can be safely ignored until we get a fix from the dependent library (http client) that we can incorporate in upcoming release. If you wish to reduce worker log verbosity, see Dataflow docs to set worker log levels accordingly (includes setting log level for a specific class).
For production workloads, and especially given current release cadence (almost once per week), it's highly recommended to:
- Test new template version end-to-end in a test environment before a staged roll-out to prod. Note we continue to invest in our template integration tests, yet we encourage you to test e2e with your own specific pipeline settings and potentially custom UDFs.
- Version pinning to a tested working version (e.g.
gs://dataflow-templates/2021-09-27-00_RC00/Cloud_PubSub_to_Splunk
rather thangs://dataflow-templates/latest/Cloud_PubSub_to_Splunk
) to minimize potential outages and data loss.
Thanks for reporting this. For future potential issues, consider also filing a Google Cloud support case for faster path to resolution.
Thanks for hinting at this problem and workaround. Is there any progress on a permanent fix?
When I try to use the old template, unfortunately I get connectivity issues related to TLS. On GCP side it shows in job. Sender and receiver do not have shared ciphers. From GCP side:
Error writing to Splunk: Received fatal alert: handshake_failure
From Splunk side:
WARN HttpListener - Socket error from XXX while idling: error:1408A0C1:SSL routines:ssl3_get_client_hello:no shared cipher
Interestingly, on one of the job shells, there are common ciphers:
$ openssl s_client -connect SPLUNKSERVER -tls1_2
[...]
New, TLSv1.2, Cipher is ECDHE-RSA-AES256-GCM-SHA384
[...]
$ openssl ciphers -s
TLS_AES_256_GCM_SHA384:[...]
The dependent HTTP Java client library was recently upgraded to 1.40.1 which includes the fix. New template release forthcoming. Will update here.
@erhanX could you file a new issue for the error you're seeing. Please include template version, (non-sensitive) parameters values, and specific SSL cipher or cipher suite used for Splunk server certificate. For fast response, consider also filing a support ticket from your Cloud Console if you're a GCP customer.
Thank you very much for your help. I will wait for the update. We got the latest version partially working by setting the maximum nodes for workers to 1. I had no solution for the older version and the error but I think it is obsolete for my case now anyway.
Is this update included in the latest release 2022-02-07-00_RC01 or still forthcoming?
Is this update included in the latest release 2022-02-07-00_RC01 or still forthcoming?
Yes. The release contains fixes currently in the repository for all released templates. I've updated the release notes to mention this.
Since there's a few years' worth of unpublished releases, I'm not sure it's feasible to document every change specifically. Going forward, we should be capturing each change in the relevant release notes.
Thanks for the good news and your work! Is this already available in gs://dataflow-templates/latest/Cloud_PubSub_to_Splunk ?
Yes, it should be.
The problem has been resolved.
--
This issue has been stale for some time now. Please reopen it if there is a follow up or any related questions.