KAFKA-1194: changes needed to run on Windows
Hi,
I have come across this comment by Maksim Zinal.
I've created a quick patch which seems to fix the issue on my system and on the Kafka version I use (2.8.0).
I believe that the reason of lock failures is the use of RandomAccessFile Java API to open the files in some cases, while in other cases FileChannel.open() is used instead. When opening files with RandomAccessFile under Windows, FILE_SHARE_DELETE flag is not set, which leads to "access denied" errors when trying to rename or delete the open files. FileChannel.open() sets the FILE_SHARE_DELETE by default, as I checked on JDK 8 and 11.
Here's the link to the branch based on tag 2.8.0: https://github.com/zinal/kafka/tree/2.8.0_KAFKA-1194
Here are the exact changes implemented: https://github.com/zinal/kafka/compare/2.8.0...zinal:2.8.0_KAFKA-1194 (plus jcenter and grgit stuff needed to run the build).
This change addresses some "access denied" errors which occur on Windows. KAFKA-1194, KAFKA-2427, KAFKA-6059, KAFKA-6188, KAFKA-6200, KAFKA-7575, KAFKA-8145, KAFKA-9458
Stackowerflow 45141541, 45599625, 48114040, 50755827, 67555122
Testing I deployed Kafka 2.8.1. with the original patch provided by Maksim Zinal in January on ~50 Windows servers. Each Kafka instance has only one queue. Several processes per Kafka instance write their logs in to the queue and one process reads the queue and processed the data.
It seems that these issues mentioned on stackowerflow by Windows users are gone.
I am currently testing Kafka 3.3. with the latest version of the patch which was modified according to @divijvaidya requests.
Kafka 3.3. log files
You can see in the logs that Kafka managed to delete timeindex files.
Thank you for your time.
Thank you for your help, I really appreciate it.
$ ./jmh-benchmarks/jmh.sh FileRecordsBenchmark
Benchmark Mode Cnt Score Error Units
FileRecordsBenchmark.testOpenChannelAfter thrpt 25 0.333 ▒ 0.004 ops/ms
FileRecordsBenchmark.testOpenChannelBefore thrpt 25 0.001 ▒ 0.001 ops/ms
JMH benchmarks done
@divijvaidya, thank you for your help. JDK 17 and Scala 2.13 build and tests finished successfully. Can we now request a committer for review on this PR?
@divijvaidya thank you for your help. I updated the summary of the PR.
@mimaison perhaps you might be interested in looking at this one? I have done an initial review.
I'm afraid I have no experience running Kafka on Windows so I'm not familiar with the issues on that platform.
@ijuma, would like to take a look at this or perhaps tag a committer who would be most suited to look into this one? (note that we can review this PR from the lens of change the file preallocation logic to use the new Java NIO.2 APIs. The new APIs are better at handling Windows OS which is nice.)
I also had a problem with Kraft which did not start on Windows. The latest commit https://github.com/apache/kafka/pull/12331/commits/f7cc920771735576d9cfba2afe6f26fdcfb2ccd4 fixes it. The commit https://github.com/apache/kafka/pull/12331/commits/77ae23d816ea1a30a3ec970b7dfe77fd35f7fe00 most likely fixes KAFKA-7575 and KAFKA-2427, but I will have to do more testing to confirm that.
@MPeli what is the latest on this one?
This is a really important fix for Windows. Can someone help move this along, please?
@sjetha you should probably send an email to the [email protected] mailing list, explaining the urgency and asking for someone to take a look at this.
@MPeli do tests pass for you? I ran unitTest task on Windows 10 and some are failing
test-results.zip
@MPeli I found AccessDenied exception again when a topic is deleted. This causes a crash same way it used to while deleting message. Could the same fix be applied to delete topic as well?
ERROR Error while renaming dir for contentIndexing-0 in log dir d:\3rdParty\kafka-logs (org.apache.kafka.storage.internals.log.LogDirFailureChannel) java.nio.file.AccessDeniedException: d:\3rdParty\kafka-logs\contentIndexing-0 -> d:\3rdParty\kafka-logs\contentIndexing-0.57b7fab3260c43d29cb7941747b5e3a5-delete at java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:89) at java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103) at java.base/sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:403) at java.base/sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:293) at java.base/java.nio.file.Files.move(Files.java:1432) at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:951) at kafka.log.LocalLog.$anonfun$renameDir$2(LocalLog.scala:111) at kafka.log.LocalLog.renameDir(LocalLog.scala:711) at kafka.log.UnifiedLog.$anonfun$renameDir$2(UnifiedLog.scala:627) at kafka.log.UnifiedLog.renameDir(UnifiedLog.scala:1729) at kafka.log.LogManager.asyncDelete(LogManager.scala:1151) at kafka.log.LogManager.$anonfun$asyncDelete$3(LogManager.scala:1186) at scala.Option.foreach(Option.scala:437) at kafka.log.LogManager.$anonfun$asyncDelete$2(LogManager.scala:1184) at kafka.log.LogManager.$anonfun$asyncDelete$2$adapted(LogManager.scala:1182) at scala.collection.mutable.HashSet$Node.foreach(HashSet.scala:435) at scala.collection.mutable.HashSet.foreach(HashSet.scala:361) at kafka.log.LogManager.asyncDelete(LogManager.scala:1182) at kafka.server.ReplicaManager.stopPartitions(ReplicaManager.scala:503) at kafka.server.ReplicaManager.stopReplicas(ReplicaManager.scala:440) at kafka.server.KafkaApis.handleStopReplicaRequest(KafkaApis.scala:304) at kafka.server.KafkaApis.handle(KafkaApis.scala:185) at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:145) at java.base/java.lang.Thread.run(Thread.java:833) Suppressed: java.nio.file.AccessDeniedException: d:\3rdParty\kafka-logs\contentIndexing-0 -> d:\3rdParty\kafka-logs\contentIndexing-0.57b7fab3260c43d29cb7941747b5e3a5-delete at java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:89) at java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103) at java.base/sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:317) at java.base/sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:293) at java.base/java.nio.file.Files.move(Files.java:1432) at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:948) ... 18 more
I'm going to leave a -1 here because there are a number of issues in the patch and it's not ready to commit yet. I would encourage you to split this patch into several parts and clearly explain what issue each one solves and how.
From the high level, the challenge for us is that Windows does things quite differently in some cases, but we don't want to degrade performance on all the other platforms. So a lot of thought is probably required here. Ideally other platforms would not be impacted at all.
Agree that this change should be ideally confined to the Windows platform w/o any code change for others.
@Hangleton : Agree that this change should be ideally confined to the Windows platform w/o any code change for others.
Yeah. To be totally clear, it's OK to change the main code paths, but it really raises the stakes because then we want to understand that there haven't been performance or correctness regressions.
@MPeli: I really do want this to get in eventually but in its current form it's very difficult to review. As a first step, how about splitting out just the FileRecords.java change, for example. If you want to do preallocation differently on windows, then explain how it will be different and why, and have a PR that just does that. Since preallocation seems to be working fine on other platforms this particular change could be put behind an isOsWindows check.
Can someone help kafka on windows , please? Two years have passed !
@Hangleton @cmccabe I encountered a bug while using KRaft where Kafka would mark .checkpoint files in the _cluster-metadata_0 folder for deletion and then would fail to delete them because they are read-only. The root cause is that it tried to do it using the Files.deleteIfExists method, which can't delete read-only files on Windows. I made changes to KafkaMetadataLog.scala and Snapshots.java where I created a new method for this purpose. path.toFile().setWritable(true) works if there is no file at the path on Windows, but I'm not sure how it behaves on Linux, so I can create a new unit test for it or put it behind an isOsWindows check. Should I create a new PR for this?
Hi Julius (@AB027PS) - just to clarify, this isn't the same as the AccessDeniedException addressed by this CR right?
I work with Martin Pelak and he suggested that I add my change to this PR, but my change is unrelated to the rest of this PR. So this doesn't address the errors mentioned in the first comment if that's what you're asking. I should have been clearer about this.
Who can reslove that test failures?