hdfs Add ability to start new blocks without waiting for acks

Hi, I am using FileWriter to performance file transfer and hitting a performance degradation on cross-zone connections. Network latency between two hosts is 76ms, throughput is 10~15MB/s. Both CPU and network are under-utilized. When performing transfer on intra-zone connections, throughput can go up to 250MB/s, with network latency 1ms.

Nov 01 '18 23:11 yetanotherbot

And I am seeing a lot of spikes in my dashboard so I am guessing the bottleneck is the goroutine waiting for ACKs. The application sends a lot of data and waits for ACKs to finish, creating a lot of spikes. screen shot 2018-11-01 at 10 36 53 pm

Nov 02 '18 05:11 yetanotherbot

Also please correct me if I am wrong, don't we need to break the outer for-loop at L240 here:

https://github.com/colinmarc/hdfs/blob/c5e93df697d823976844a79ec9a70e51105b4639/internal/rpc/block_write_stream.go#L217-L255

Nov 02 '18 05:11 yetanotherbot

Huh - this is a pretty interesting performance case. At a guess, the pauses are between blocks, not between writes. Can you tell if it's writing about 1mb data before every pause? The issue would be in startNewBlock; note the TODO: https://github.com/colinmarc/hdfs/blob/c5e93df697d823976844a79ec9a70e51105b4639/file_writer.go#L245-L247

Also please correct me if I am wrong, don't we need to break the outer for-loop at L240 here

Hm, yes, that seems possible. It may be it's relying on HDFS not sending any other packets after an error one.

Nov 19 '18 15:11 colinmarc

I don' think startNewBlock is the reason but I didn't verify it. I am guessing the reason is that the roundtrip of sending and acking packets is expensive in a cross-zone setup. I switched to WebHDFS and throughput went up to around 200MB/s. AFAIK, WebHDFS has the same way of sending and acking packets, which is done by WebHDFS proxy in the same host as the namenode's.

Nov 20 '18 04:11 yetanotherbot

Sorry, I should have been more clear. I agree the problem is acks. During a block acks can be handled asynchronously, so it shouldn't affect write speed. But in between blocks, the code waits for all acks to come back, which should take around 75ms if the transfer is very fast. If you have a 1mb block size, dividing 1s/75ms gives you 13mb/s, which is right around what you're reporting.

Would it be possible for you to set the block size higher as a test, to see if throughput increases? You can do this for just one file with CreateFile.

Nov 20 '18 09:11 colinmarc

I don' think startNewBlock is the reason but I didn't verify it. I am guessing the reason is that the roundtrip of sending and acking packets is expensive in a cross-zone setup. I switched to WebHDFS and throughput went up to around 200MB/s. AFAIK, WebHDFS has the same way of sending and acking packets, which is done by WebHDFS proxy in the same host as the namenode's.

Q: What is webhdfs prot?

Mar 31 '19 10:03 g10guang

hdfs hdfs copied to clipboard

Add ability to start new blocks without waiting for acks

hdfs
hdfs copied to clipboard