hdfs
hdfs copied to clipboard
Add ability to start new blocks without waiting for acks
Hi, I am using FileWriter
to performance file transfer and hitting a performance degradation on cross-zone connections. Network latency between two hosts is 76ms, throughput is 10~15MB/s. Both CPU and network are under-utilized.
When performing transfer on intra-zone connections, throughput can go up to 250MB/s, with network latency 1ms.
And I am seeing a lot of spikes in my dashboard so I am guessing the bottleneck is the goroutine waiting for ACKs. The application sends a lot of data and waits for ACKs to finish, creating a lot of spikes.
Also please correct me if I am wrong, don't we need to break the outer for-loop at L240 here:
https://github.com/colinmarc/hdfs/blob/c5e93df697d823976844a79ec9a70e51105b4639/internal/rpc/block_write_stream.go#L217-L255
Huh - this is a pretty interesting performance case. At a guess, the pauses are between blocks, not between writes. Can you tell if it's writing about 1mb data before every pause? The issue would be in startNewBlock
; note the TODO: https://github.com/colinmarc/hdfs/blob/c5e93df697d823976844a79ec9a70e51105b4639/file_writer.go#L245-L247
Also please correct me if I am wrong, don't we need to break the outer for-loop at L240 here
Hm, yes, that seems possible. It may be it's relying on HDFS not sending any other packets after an error one.
I don' think startNewBlock
is the reason but I didn't verify it. I am guessing the reason is that the roundtrip of sending and acking packets is expensive in a cross-zone setup. I switched to WebHDFS and throughput went up to around 200MB/s. AFAIK, WebHDFS has the same way of sending and acking packets, which is done by WebHDFS proxy in the same host as the namenode's.
Sorry, I should have been more clear. I agree the problem is acks. During a block acks can be handled asynchronously, so it shouldn't affect write speed. But in between blocks, the code waits for all acks to come back, which should take around 75ms if the transfer is very fast. If you have a 1mb block size, dividing 1s/75ms gives you 13mb/s, which is right around what you're reporting.
Would it be possible for you to set the block size higher as a test, to see if throughput increases? You can do this for just one file with CreateFile.
I don' think
startNewBlock
is the reason but I didn't verify it. I am guessing the reason is that the roundtrip of sending and acking packets is expensive in a cross-zone setup. I switched to WebHDFS and throughput went up to around 200MB/s. AFAIK, WebHDFS has the same way of sending and acking packets, which is done by WebHDFS proxy in the same host as the namenode's.
Q: What is webhdfs prot?