KAFKA-2170 [WIP]: Updated Fixes For Windows Platform
During stress testing of kafka 0.10.2.1 on a Windows platform, our group has encountered some issues that appear to be known to the community but not fully addressed by kafka. Using:
https://github.com/apache/kafka/pull/154
as a guide, we have made derived changes to the source code and automated tests such that the "clients" and "core" tests pass for us on Windows and Linux platforms. Our stress tests succeed as well.
This pull request adapts those changes to merge and build with kafka/trunk. The "clients" and "core" tests from kafka/trunk pass on Linux for us with these changes in place, and all tests pass on Windows except:
ConsumerBounceTest (intermittent failures) TransactionsTest DeleteTopicTest.testDeleteTopicWithCleaner EpochDrivenReplicationProtocolAcceptanceTest.offsetsShouldNotGoBackwards
Our intention is to help efforts to further kafka support for the Windows platform. Our changes are the work of engineers from Nexidia building upon the work found in the aforementioned pull request link, and they are contributed to the community per kafka's open source license.
We welcome all feedback and look forward to working with the kafka community.
Matt Briggs Principal Software Engineer Nexidia, a NICE Analytics Company www.nexidia.com
Refer to this link for build results (access rights to CI server needed): https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/5105/ Test PASSed (JDK 7 and Scala 2.11).
Refer to this link for build results (access rights to CI server needed): https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/5089/ Test PASSed (JDK 8 and Scala 2.12).
@ijuma This patch does not expose the errors I saw in the release candidate for 0.11.0.0 (it needs a rebase though). However, I personally have not been able to run all tests on Windows with this patch. The core tests always hang after a few hundreds of them run (with a few randomly failing).
Refer to this link for build results (access rights to CI server needed): https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/5729/ Test FAILed (JDK 8 and Scala 2.12).
Refer to this link for build results (access rights to CI server needed): https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/5743/ Test PASSed (JDK 7 and Scala 2.11).
Thanks @vahidhashemian and @ijuma for taking a look at our patches. With the latest kafka release now out (congrats!), would there be time for a detailed review of the patches? We have found the first two:
https://github.com/apache/kafka/pull/3283/commits/6ee3c167c6e2daa8ce4564d98f9f63967a0efece https://github.com/apache/kafka/pull/3283/commits/a0cd773a8d89d7df90fc75ce55a46fd8bb93d368
to be essential for us in getting kafka to run robustly on Windows. Our efforts up to this point focused on 0.10.2.1, so we don't expect the patches to fully address 0.11.x on Windows (but they should not introduce any regressions on Linux or Windows) .
If we can get some feedback on the viability our patches, we'd very much like to continue our work and get 0.11.x fully running for us on Windows.
Thanks again!
Refer to this link for build results (access rights to CI server needed): https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/5854/ Test FAILed (JDK 7 and Scala 2.11).
Refer to this link for build results (access rights to CI server needed): https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/5839/ Test PASSed (JDK 8 and Scala 2.12).
@nxmbriggs404 Thanks for the PR. It seems to fix some of the issues with running unit tests on Windows. However, I'm not still able to run all unit tests on Windows without an error. In fact, they always hang somewhere in the middle of running core unit tests (I try gradlew.bat test). Not that this an issue with your patch, but I was just wondering if you are able to run all the tests without an error or a hang. Thanks.
Hi @vahidhashemian. No, unfortunately I am not able to run all of the unit tests successfully and I do encounter occasional hangs as well. Our original changes were based on the 0.10.2.1 release and they did allow us to successfully run all of the "clients" and "core" unit tests on Windows and Linux. There have been significant changes to the trunk which we have yet to work through, in particular related to new memory mapped file usages and unit tests that do not fully cleanup resources on tear down. I have updated the pull request recently to fix trunk merge conflicts and to ensure compiles are successful and Linux tests pass.
Hi @junrao and @jkreps
Forgive me for pinging you directly, but I'm attempting to get this pull request reviewed and I saw you guys listed as maintainers for the Log subsystem of kafka. The quick summary is that I've had to apply some tactical patches to core kafka in order for us to run 0.10.2.1 robustly on the Windows platform. Essentially the permissive nature of Linux file operations introduces some subtle behavior dependencies that must be identified and worked around in order to adapt to the the more restrictive file operations of Windows.
My intention is to communicate what's been observed in order to raise awareness, as well as to receive guidance as to how the patches should be improved to be pull-worthy. If I can get some dialogue going on the latter front, I believe my employer will fund efforts to continue the work needed to bring these patches to trunk.
Thanks!
@nxmbriggs404 I wanted to give this a bump as I'm currently looking at the same issue. Have you heard anything regarding this PR out of band?
Refer to this link for build results (access rights to CI server needed): https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/6664/ Test PASSed (JDK 7 and Scala 2.11).
Refer to this link for build results (access rights to CI server needed): https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/6649/ Test PASSed (JDK 8 and Scala 2.12).
@jasonaliyetti Thanks for taking a look at the PR! Unfortunately, no, I haven't heard anything out of band about it. I was really hoping to get some detailed review before attempting to get the trunk further updated, but not much luck so far. It's seeming to me that advancing first class Windows support is going to take some higher level efforts beyond just PR's. In particular getting a Windows build going in the Apache farm and also drawing core dev attention to the low level file subsystem differences in Windows. Otherwise it seems like it's going to be a struggle keeping pace with trunk changes that unknowingly undermine the Windows experience.
@nxmbriggs404 For what it's worth, I've been digging into the hanging build on Windows a bit. One culprit I'm looking at is the TransactionBounceTest, which seems to be hanging on shutting down one of the brokers it spins up in the JVM. I am wondering if this is just test fragility around having multiple brokers in the JVM on Windows for some reason.
Hi, we are facing the same problem, is it expacted to get this bugfix in the next Kafka Release? Andreas
We've been digging into this a little bit, and it seems that the hanging build on Windows is due to to tests involving transactions hanging. Here's a summary from a co-worker:
The place where the transaction gets stuck is when it attempts to complete (either abort or commit) a transaction. The Abort message is successfully sent to the broker and it moves into the prepare abort stage. At this point the producer received a response to its abort message, so it considers the transaction aborted and moves on to begin a new transaction. That new transaction will hang and continuously retry with CONCURRENT_TRANSACTIONS errors because the broker hasn't actually finished aborting the previous transaction. The TransactionCoordinator writes the pre-abort state to the local log, and then attempts to send the abort message to all brokers involved in the transaction through TransactionMarkerChannelManager. This is done using an instance of NetworkClient created in the TransactionMarkerChannelManager. When it goes to send the abort message, it realizes it doesn't have a connection established, so it initiates the connection asynchronously (creating an OP_CONNECT watching SelectionKey) and the re-enqueues the send so it will get completed once the connection is established. The SelectionKey waiting for the connection complete event is never triggered though, so it gets stuck in a loop here. On the recieving end of this connection (in SocketServer.scala) I see the connection get received, accepted, and added to the active connections list. But the client end still never gets it's connection key triggered. I tried running the broker in a separate JVM to see if it might be an issue related to Broker, Producer, and Zookeeper all running in the same JVM in the test, but that had no effect. There's not a general issue with the connection logic in NetworkClient because I see other connections successfully getting established and used.
@nxmbriggs404 Replying here because I didn't want to totally hijack the other PR.
I was under the impression that the same ReplicaManager code was being used under the hood. Our current thinking is that the hanging tests may be due to a bug around managing connections in the TransactionMarkerChannelManager (see my above comment). We will continue to investigate, but I'll keep an eye on this in case you find anything. If we can determine the issue I'll make sure to let you know.
I am as well experiencing log retention issue on Windows platform. I just found this pull request but it seems like there is no progress on it since August 2017.
In order to revive the story I opened new pull request which contains finished test adaptations and the latest trunk merged into it.
I hope together we can finally fix and close old KAFKA-1194 issue which affects all kafka versions on Windows.
out of curiosity - why this PR can not be merged?
Can this be rebased? Also, if it's no longer WIP it should be removed from the title. Thanks!
Anything? Please do merge :)
Hi, I have created a new pull request. See #12331. It fixes some issues mentioned in this pull request.