kafka icon indicating copy to clipboard operation
kafka copied to clipboard

KAFKA-1194: changes needed to run on Windows

Open MPeli opened this issue 3 years ago • 7 comments

Hi,

I have come across this comment by Maksim Zinal.

I've created a quick patch which seems to fix the issue on my system and on the Kafka version I use (2.8.0).

I believe that the reason of lock failures is the use of RandomAccessFile Java API to open the files in some cases, while in other cases FileChannel.open() is used instead. When opening files with RandomAccessFile under Windows, FILE_SHARE_DELETE flag is not set, which leads to "access denied" errors when trying to rename or delete the open files. FileChannel.open() sets the FILE_SHARE_DELETE by default, as I checked on JDK 8 and 11.

Here's the link to the branch based on tag 2.8.0: https://github.com/zinal/kafka/tree/2.8.0_KAFKA-1194

Here are the exact changes implemented: https://github.com/zinal/kafka/compare/2.8.0...zinal:2.8.0_KAFKA-1194 (plus jcenter and grgit stuff needed to run the build).

This change addresses some "access denied" errors which occur on Windows. KAFKA-1194, KAFKA-2427, KAFKA-6059, KAFKA-6188, KAFKA-6200, KAFKA-7575, KAFKA-8145, KAFKA-9458

Stackowerflow 45141541, 45599625, 48114040, 50755827, 67555122

Testing I deployed Kafka 2.8.1. with the original patch provided by Maksim Zinal in January on ~50 Windows servers. Each Kafka instance has only one queue. Several processes per Kafka instance write their logs in to the queue and one process reads the queue and processed the data.

It seems that these issues mentioned on stackowerflow by Windows users are gone.

I am currently testing Kafka 3.3. with the latest version of the patch which was modified according to @divijvaidya requests.

Kafka 3.3. log files

You can see in the logs that Kafka managed to delete timeindex files.

Thank you for your time.

MPeli avatar Jun 22 '22 12:06 MPeli

Thank you for your help, I really appreciate it.

$ ./jmh-benchmarks/jmh.sh FileRecordsBenchmark

Benchmark                                    Mode  Cnt  Score    Error   Units
FileRecordsBenchmark.testOpenChannelAfter   thrpt   25  0.333 ▒  0.004  ops/ms
FileRecordsBenchmark.testOpenChannelBefore  thrpt   25  0.001 ▒  0.001  ops/ms
JMH benchmarks done

MPeli avatar Jun 29 '22 20:06 MPeli

@divijvaidya, thank you for your help. JDK 17 and Scala 2.13 build and tests finished successfully. Can we now request a committer for review on this PR?

MPeli avatar Jul 06 '22 15:07 MPeli

@divijvaidya thank you for your help. I updated the summary of the PR.

MPeli avatar Jul 08 '22 13:07 MPeli

@mimaison perhaps you might be interested in looking at this one? I have done an initial review.

divijvaidya avatar Jul 19 '22 09:07 divijvaidya

I'm afraid I have no experience running Kafka on Windows so I'm not familiar with the issues on that platform.

mimaison avatar Jul 19 '22 10:07 mimaison

@ijuma, would like to take a look at this or perhaps tag a committer who would be most suited to look into this one? (note that we can review this PR from the lens of change the file preallocation logic to use the new Java NIO.2 APIs. The new APIs are better at handling Windows OS which is nice.)

divijvaidya avatar Jul 19 '22 10:07 divijvaidya

I also had a problem with Kraft which did not start on Windows. The latest commit https://github.com/apache/kafka/pull/12331/commits/f7cc920771735576d9cfba2afe6f26fdcfb2ccd4 fixes it. The commit https://github.com/apache/kafka/pull/12331/commits/77ae23d816ea1a30a3ec970b7dfe77fd35f7fe00 most likely fixes KAFKA-7575 and KAFKA-2427, but I will have to do more testing to confirm that.

MPeli avatar Jul 19 '22 15:07 MPeli

@MPeli what is the latest on this one?

divijvaidya avatar Dec 07 '22 11:12 divijvaidya

This is a really important fix for Windows. Can someone help move this along, please?

fasterinnerlooper avatar Jan 12 '23 18:01 fasterinnerlooper

@sjetha you should probably send an email to the [email protected] mailing list, explaining the urgency and asking for someone to take a look at this.

divijvaidya avatar Jan 24 '23 11:01 divijvaidya

@MPeli do tests pass for you? I ran unitTest task on Windows 10 and some are failing test-results.zip

chester89 avatar Mar 18 '23 19:03 chester89

@MPeli I found AccessDenied exception again when a topic is deleted. This causes a crash same way it used to while deleting message. Could the same fix be applied to delete topic as well?

ERROR Error while renaming dir for contentIndexing-0 in log dir d:\3rdParty\kafka-logs (org.apache.kafka.storage.internals.log.LogDirFailureChannel) java.nio.file.AccessDeniedException: d:\3rdParty\kafka-logs\contentIndexing-0 -> d:\3rdParty\kafka-logs\contentIndexing-0.57b7fab3260c43d29cb7941747b5e3a5-delete at java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:89) at java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103) at java.base/sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:403) at java.base/sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:293) at java.base/java.nio.file.Files.move(Files.java:1432) at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:951) at kafka.log.LocalLog.$anonfun$renameDir$2(LocalLog.scala:111) at kafka.log.LocalLog.renameDir(LocalLog.scala:711) at kafka.log.UnifiedLog.$anonfun$renameDir$2(UnifiedLog.scala:627) at kafka.log.UnifiedLog.renameDir(UnifiedLog.scala:1729) at kafka.log.LogManager.asyncDelete(LogManager.scala:1151) at kafka.log.LogManager.$anonfun$asyncDelete$3(LogManager.scala:1186) at scala.Option.foreach(Option.scala:437) at kafka.log.LogManager.$anonfun$asyncDelete$2(LogManager.scala:1184) at kafka.log.LogManager.$anonfun$asyncDelete$2$adapted(LogManager.scala:1182) at scala.collection.mutable.HashSet$Node.foreach(HashSet.scala:435) at scala.collection.mutable.HashSet.foreach(HashSet.scala:361) at kafka.log.LogManager.asyncDelete(LogManager.scala:1182) at kafka.server.ReplicaManager.stopPartitions(ReplicaManager.scala:503) at kafka.server.ReplicaManager.stopReplicas(ReplicaManager.scala:440) at kafka.server.KafkaApis.handleStopReplicaRequest(KafkaApis.scala:304) at kafka.server.KafkaApis.handle(KafkaApis.scala:185) at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:145) at java.base/java.lang.Thread.run(Thread.java:833) Suppressed: java.nio.file.AccessDeniedException: d:\3rdParty\kafka-logs\contentIndexing-0 -> d:\3rdParty\kafka-logs\contentIndexing-0.57b7fab3260c43d29cb7941747b5e3a5-delete at java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:89) at java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103) at java.base/sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:317) at java.base/sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:293) at java.base/java.nio.file.Files.move(Files.java:1432) at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:948) ... 18 more

sandybis avatar May 04 '23 09:05 sandybis

I'm going to leave a -1 here because there are a number of issues in the patch and it's not ready to commit yet. I would encourage you to split this patch into several parts and clearly explain what issue each one solves and how.

From the high level, the challenge for us is that Windows does things quite differently in some cases, but we don't want to degrade performance on all the other platforms. So a lot of thought is probably required here. Ideally other platforms would not be impacted at all.

cmccabe avatar May 04 '23 16:05 cmccabe

Agree that this change should be ideally confined to the Windows platform w/o any code change for others.

Hangleton avatar May 04 '23 16:05 Hangleton

@Hangleton : Agree that this change should be ideally confined to the Windows platform w/o any code change for others.

Yeah. To be totally clear, it's OK to change the main code paths, but it really raises the stakes because then we want to understand that there haven't been performance or correctness regressions.

@MPeli: I really do want this to get in eventually but in its current form it's very difficult to review. As a first step, how about splitting out just the FileRecords.java change, for example. If you want to do preallocation differently on windows, then explain how it will be different and why, and have a PR that just does that. Since preallocation seems to be working fine on other platforms this particular change could be put behind an isOsWindows check.

cmccabe avatar May 05 '23 08:05 cmccabe

Can someone help kafka on windows , please? Two years have passed !

huxiangquan avatar May 05 '23 09:05 huxiangquan

@Hangleton @cmccabe I encountered a bug while using KRaft where Kafka would mark .checkpoint files in the _cluster-metadata_0 folder for deletion and then would fail to delete them because they are read-only. The root cause is that it tried to do it using the Files.deleteIfExists method, which can't delete read-only files on Windows. I made changes to KafkaMetadataLog.scala and Snapshots.java where I created a new method for this purpose. path.toFile().setWritable(true) works if there is no file at the path on Windows, but I'm not sure how it behaves on Linux, so I can create a new unit test for it or put it behind an isOsWindows check. Should I create a new PR for this?

AB027PS avatar May 11 '23 08:05 AB027PS

Hi Julius (@AB027PS) - just to clarify, this isn't the same as the AccessDeniedException addressed by this CR right?

Hangleton avatar May 11 '23 08:05 Hangleton

I work with Martin Pelak and he suggested that I add my change to this PR, but my change is unrelated to the rest of this PR. So this doesn't address the errors mentioned in the first comment if that's what you're asking. I should have been clearer about this.

AB027PS avatar May 11 '23 08:05 AB027PS

Who can reslove that test failures?

ckmbks avatar Mar 29 '24 07:03 ckmbks