sftp icon indicating copy to clipboard operation
sftp copied to clipboard

Question about generating a digest value during sftp transfer

Open drimmeer opened this issue 1 year ago • 18 comments

Please advice on how to generate a digest (specifically, the Sha256 value) of the file being transferred via this sftp tool? The counterpart in Java is like this:

		MessageDigest messageDigestSha256=MessageDigest.getInstance( "SHA-256");
			while ((bytesRead = inStream.read(dataBytes)) != -1) {
				outStream.write(dataBytes, 0, bytesRead);
				messageDigestSha256.update(dataBytes, 0, bytesRead); // generate digest
			}
		String EncodedDigestSha256 = Hex.encodeHexString(messageDigestSha256.digest());
...

I know that to generate the Sha256 of a local file, the Go code would be like this:

			file, err := os.Open("file.txt")
			hash := sha256.New()
			_, err := io.Copy(hash, file)
			hashBytes := hash.Sum(nil)

But how to get that during the sftp transfer?

BTW, currently I wrote code like below to do file transfer (error handling being removed just to show the major logic):

	fDestination, err := sftpClient.Create(destNameRemote)
	fSource, err := os.Open(srcNameLocal)
	_, err = io.Copy(fDestination, fSource)

or

	fSource, err := sftpClient.Open(srcNameRemote)
	fDestination, err := os.Create(destNameLocal)
	_, err = io.Copy(fDestination, fSource)

Thank you in advance.

drimmeer avatar Oct 01 '24 19:10 drimmeer

I've used https://pkg.go.dev/github.com/icub3d/gop/wrapio

lespea avatar Oct 01 '24 21:10 lespea

While using wrapio.NewHashReader() would definitely make it easy enough to get both a hash and the file at the same time, it would however break any ReadFrom or WriteTo that we implement, which is usually how we can ensure the best performance, rather than reading individual blocks and then waiting for the response. If you’re using WithConcurrentReads(true) with v1 of the API, or if you’re using v2 of the API, then you will see significantly reduced performance with a wrapped IO than you would get by downloading the file locally, and then generating the SHA hash after it has arrived.

puellanivis avatar Oct 02 '24 01:10 puellanivis

Good to know, thanks for the tip!

lespea avatar Oct 02 '24 06:10 lespea

Thank you Cassondra very much for the tips.

We are indeed relying on the WithConcurrentReads(true) for the required performance, cause we might transfer hundreds of GBs, or even TBs.

However, we also need a way to get both a hash and the file at the same time, because the source is not a physical file, but a pipe, while the destination is a remote file. So there is no way to generate the SHA out of a local file, but only a remote file, which will double the transfer data then!

Do you have any suggestion on how we could overcome this? Is it possible to wrap the ReadFrom() or WriteTo() that you implement to get hash out of them?

drimmeer avatar Oct 02 '24 14:10 drimmeer

Is it possible to do in this way while performance of your code won't be affected?

fSource, err := os.Open(pipeNameLocal)
fDestination, err := sftpClient.Create(destNameRemote)

s := sha256.New()
hr := NewHashReader(s, fSource)

io.Copy(fDestination, hr)
// Use the Sum()s which in this case we'll just print it out.
fmt.Println(hex.EncodeToString(s.Sum(nil)))

drimmeer avatar Oct 02 '24 15:10 drimmeer

I tested the above code and compared it with other approach, and confirmed that it has significantly reduced performance. Here is the benchmark:

  1. To transfer a 900M file (not a pipe) from z/OS to Linux, the above code took 14~16 minutes;
  2. While without any sha256 hash calculation, it took only 1 minute;
  3. Then I tried calculating the hash at the beginning, it took only 1 minute for transfer, and 1~6 seconds for hash calculation)

But as I said, we cannot do like #3 above, because in our case the local file is a pipe. What can I do then?

drimmeer avatar Oct 03 '24 01:10 drimmeer

I think, yeah, you should be able to wrap the ReadFrom or WriteTo functions on your own to perform a double write, once to a hash, then to the underlying writer.

Hm, you might also be able to use a NewHashWriter to wrap the local file operation, and still get the WriteTo functionality from this package?

puellanivis avatar Oct 03 '24 07:10 puellanivis

Thanks Cassondra for the confirmation. But as seen from my testing above, using the NewHashReader significantly reduced the performance. NewHashReader is also a wrapper to ReadFrom(), am I not right? What could be different if I do the wrapping by my own code?

Or you meant using NewHashWriter won't affect the performance as using NewHashReader does?

drimmeer avatar Oct 03 '24 20:10 drimmeer

Yeah, I meant using NewHashWriter as that would allow our WriteTo to still work.

puellanivis avatar Oct 04 '24 18:10 puellanivis

Hi Cassondra, I tried wrapping the writer by my own but it also was very slow. Here is how I did it:

I created a wrapper like this:

// sha256_writer.go implements the io.Writer interface.
type sha256_writer struct {
	sha256_calculator hash.Hash
	origin_writer     io.Writer
}

// Write implements the io.Writer interface.
func (this_writer *sha256_writer) Write(write_out_buffer []byte) (int, error) {
	this_writer.sha256_calculator.Write(write_out_buffer)
	return this_writer.origin_writer.Write(write_out_buffer)
}

func NewSha256Writer(writer_to_be_wrap io.Writer) *sha256_writer {
	return &sha256_writer{sha256_calculator: sha256.New(), origin_writer: writer_to_be_wrap}
}

func (this_writer *sha256_writer) GetHashSum() []byte {
	return this_writer.sha256_calculator.Sum(nil)
}

Then I used it like this:

	fSource, err := os.Open(srcNameLocal)
	fDestination, err := sftpClient.Create(destNameRemote)
	hashWriter := NewSha256Writer(fDestination)
	_, err = io.Copy( hashWriter, fSource)
	log.Println("File sent with sha256 checksum: " + hex.EncodeToString(hashWriter.GetHashSum()))

But it still took 12 minutes to transfer that 900M file, while it took only 1 minute without the wrapper...

drimmeer avatar Oct 04 '24 20:10 drimmeer

Using the NewHashWriter like this is even slower, taking 14 minutes to tranfer.


	fSource, err := os.Open(srcNameLocal)
	fDestination, err := sftpClient.Create(destNameRemote)
	sha256 := sha256.New()
	hashWriter := wrapio.NewHashWriter(sha256, fDestination)
	_, err = io.Copy( hashWriter, fSource)
	log.Println("File sent with sha256 checksum: " + hex.EncodeToString(sha256.Sum(nil)))

drimmeer avatar Oct 04 '24 21:10 drimmeer

Good news! I finally got what 'WriteTo' mean by you, and wrapped my reader around os.File instead of io.Reader. That way it worked perfectly! The performance was not dropped anymore!

Thank you for the help!

drimmeer avatar Oct 06 '24 01:10 drimmeer

So, yeah, either way the io.Copy goes, you want to ensure that you’re wrapping the other reader/writer from our sftp.File object. This will ensure the sftp implementation of either WriteTo or ReadFrom will be used.

puellanivis avatar Oct 06 '24 10:10 puellanivis

I played around with this some more and I actually got the best performance with this wrapper, the key being to impl the Size() int64 method.

type HashReader struct {
	size int64
	r    io.ReadCloser
	h    hash.Hash
}

func NewHashReader(r io.ReadCloser, size int64, h hash.Hash) HashReader {
	return HashReader{size, r, h}
}

func NewHashReaderFile(path string, h hash.Hash) (HashReader, error) {
	fh, err := os.Open(path)
	if err != nil {
		return HashReader{}, err
	}

	if stat, err := fh.Stat(); err != nil {
		return HashReader{}, err
	} else {
		return NewHashReader(fh, stat.Size(), h), nil
	}
}

func (ha HashReader) Read(p []byte) (int, error) {
	n, err := ha.r.Read(p)
	if n > 0 {
		_, _ = ha.h.Write(p[:n])
	}
	return n, err
}

func (ha HashReader) Hash() []byte {
	return ha.h.Sum(nil)
}

func (ha HashReader) HashStr() string {
	return hex.EncodeToString(ha.Hash())
}

func (ha HashReader) Size() int64 {
	return ha.size
}

func (ha HashReader) Close() error {
	return ha.r.Close()
}

Then just make sure you use fh.ReadFrom(wrapper) where fh is an sftp.File

lespea avatar Nov 17 '24 16:11 lespea

Also do not wrap the os.File object with bufio that also tanks performance.

lespea avatar Nov 17 '24 17:11 lespea

I strongly recommend making your HashReader a pointer. Since it contains interfaces, it’s functionality is already non-value-like, and mixed value-like and reference-like behavior can lead to very unexpected and difficult to debug bugs.

You also might want to use the WriteTo or ReadFrom calls on the sftp.File directly, which will ensure you get the SFTP implementation, and not the other-side implementation. (os.File’s WriteTo is faster than a naïve read to buffer write from buffer, but it doesn’t put multiple reads/writes in flight to the SFTP file.)

puellanivis avatar Nov 17 '24 17:11 puellanivis

I actually started with it as a pointer but making it a non-pointer lead to a consistent 30-40 mb/s increase in transfer speed (which surprised me I only tried it on a whim). And yeah it was necessary to use ReadFrom to get the best speed.

lespea avatar Nov 17 '24 17:11 lespea

Huh… interesting. That’s something I wouldn’t have expected.

puellanivis avatar Nov 17 '24 19:11 puellanivis