B2_Command_Line_Tool icon indicating copy to clipboard operation
B2_Command_Line_Tool copied to clipboard

b2 upload-file --sha1 <hash> gives no error when receives wrong hash.

Open viniciusmr opened this issue 6 years ago • 11 comments

I'm trying to automate some single-file backup tasks, and I've found that while the web API apparently works, I don't know how much I can trust the b2 CLI.

When I call the function with the correct SHA1 as argument:

b2 upload-file --sha1 f871a33490af899cf5476c002c513d1bc4e537b9 futuroapiteste test_file.txt test_file_b2_cli.txt test_file.txt: 100%|██████████████████████████████████████████████████████████| 5.52k/5.52k [00:01<00:00, 5.42kB/s] URL by file name: https://f002.backblazeb2.com/file/futuroapiteste/test_file_b2_cli.txt URL by fileId: https://f002.backblazeb2.com/b2api/v1/b2_download_file_by_id?fileId=4_zeee396220d1bff11694a0a1d_f107a92e113ff0c11_d20180731_m193133_c002_v0001095_t0024 { "action": "upload", "fileId": "4_zeee396220d1bff11694a0a1d_f107a92e113ff0c11_d20180731_m193133_c002_v0001095_t0024", "fileName": "arquivo_teste_b2_cli.txt", "size": 5520, "uploadTimestamp": 1533065493000 }

======================================================

When I call the function with the wrong SHA1 as argument:

b2 upload-file --sha1 1111111111111111111111111111111111111111 futuroapiteste test_file.txt test_file_b2_cli_wrong_sha1.txt test_file.txt: 100%|██████████████████████████████████████████████████████████| 5.52k/5.52k [00:00<00:00, 6.10kB/s] URL by file name: https://f002.backblazeb2.com/file/futuroapiteste/test_file_b2_cli_wrong_sha1.txt URL by fileId: https://f002.backblazeb2.com/b2api/v1/b2_download_file_by_id?fileId=4_zeee396220d1bff11694a0a1d_f116da3bb244871bc_d20180731_m193245_c002_v0001094_t0044 { "action": "upload", "fileId": "4_zeee396220d1bff11694a0a1d_f116da3bb244871bc_d20180731_m193245_c002_v0001094_t0044", "fileName": "arquivo_teste_b2_cli_wrong_sha1.txt", "size": 5520, "uploadTimestamp": 1533065565000 }

======================================================= Using: b2 command line tool, version 1.3.0

When I use b2 list-file-names, it shows me that b2 has the correct SHA1 for both files, which shows that the upload happened well, but then - Did it check the hash I submitted? If yes, why there was no error? (Maybe it recalculated the hash anyway before submitting...?) (Does b2 upload-file already compare the locally created hash with the hash b2 calculates and returns when the file gets there...? )

The built-in CLI help states:

"By default, upload_file will compute the sha1 checksum of the file to be uploaded. But, if you already have it, you can provide it on the command line to save a little time."

Which seems to suggest it does, but it doesn't explicitly say it compares, just 'computes', and the behavior would suggest otherwise.

viniciusmr avatar Jul 31 '18 20:07 viniciusmr

b2.bucket._upload_small_file does not use upload_source.get_content_sha1(), so --sha1 doesn't seem to do anything at all. There are no tests for negative scenario of --sha1...

ppolewicz avatar Jul 31 '18 20:07 ppolewicz

All references to get_content_sha1 were probably removed when I implemented sending the hash at the end. So currently specifying the hash on the command line does nothing. The hash is still computed by b2 CLI internally and then compared on the client and server side when uploading files. Note that specifying the hash on the command line can only useful for small files (less than 100MB). For large files hashes of each individual upload part are required.

svonohr avatar Jul 31 '18 20:07 svonohr

Given that --sha1 doesn't do anything, and that it always computes the hash of small files anyway, it seems to me that donavan's commit seems specially interesting.

add contentSha1 to output of b2 upload-file #439

E.g. Besides "checking if the upload was alright", I also need to keep my files' hashes stored in a database. Given that the current b2 CLI behavior is to suppress the SHA1 output it gets from the api call, I have to manually calculate the hashes beforehand. And since it disregards the --sha1 and recalculates, it's unnecessarily doubling the cpu effort needed in my server. I don't believe this to be a very fringe case.

IMHO outputting the sha1 should be default behavior, at the very least when the user does not provide it. The info is already there anyway =)

viniciusmr avatar Aug 01 '18 12:08 viniciusmr

Fix plan for this issue:

If --sha1 is provided from user and:

  1. it is a large file - Add large_file_sha1 in fileInfo header when making b2_start_large_file request
  2. it is a small file and progress listener is enabled - do not calculate hash and send hash either in header or at the end of stream, whichever is easiest to implement
  3. it is a small file and progress listener is disabled - do not calculate hash and just open file and upload with hash in header, utilizing shutil.copyfileobj, pysendfile or os.sendfile, whichever is fastest from the properly installable ones on platforms that we support. pysendfile looks like the best candidate.

It might be possible to use the high performance calls with a progress reporter enabled and still get a performance improvement - a 99MB file can be chunked into 1000 portions of 99kb to get perfect (0.1%) progress bar granularity and it should still significantly improve the CPU utilization over the current implementation.

ppolewicz avatar Apr 25 '19 22:04 ppolewicz

See if https://github.com/Backblaze/B2_Command_Line_Tool/issues/523 can be fixed too, when fixing this

ppolewicz avatar Apr 25 '19 22:04 ppolewicz

This plan sounds good.

And pysendfile would be more efficient. Do we know that copying into memory and then to the network is a problem now? I'd probably also be fine with just copying the normal way, so there's only one code path.

bwbeach avatar Apr 25 '19 23:04 bwbeach

I looked even closer into the current code. Opened file seems to be passed straight into requests, except for the hashing and progress reporting, which are implemented in python (and not very well), so if we can just not plug those into the chain, performance should improve due to reduced processing. In order to use pysendfile here, we would have to move b2http.B2Http.post_content_get_json into a very specialized implementation.

ppolewicz avatar Apr 26 '19 00:04 ppolewicz

I like the idea of improving performance without adding complexity.

bwbeach avatar Apr 26 '19 01:04 bwbeach

I think that even if we would add a little bit of complexity, if we can contain it properly (like we did a few times in this project where *magic* comments are), then it's an acceptable solution. While its true that it makes the project harder to maintain, when properly labeled and kept to a small fraction of a project, such contained improvements can provide benefits impossible to achieve without complexity.

Restore speed, to pick a particular example, is important to me as I worked in a company that provided a backend for backups. If one finds himself in a situation where a restore needs to be executed to restore the company/service back to shape after a catastrophic failure, finding about a severe performance bottleneck is a terrible experience. I would rather sacrifice a little bit of maintainability to help the user when we are needed most, which will only happen once or twice in the life of a customer (if at all). I know Backblaze Backup even sometimes sends people hard drives with data by courier, which is not the most maintainable solution I could think of, but some situations make it worth an additional effort. ~Since B2 service has no such HDD-by-mail restore~ (look below), the performance of restore operations in the sdk/CLI is not something we can ignore just because it would complicate one function in a secluded corner of internal b2sdk interface.

ppolewicz avatar Apr 26 '19 16:04 ppolewicz

I'm fine with the optimization, as long as it's a big enough improvement to outweigh the cost of the complexity.

And, BTW, Backblaze does offer HDD-by-mail for B2. You can create a Snapshot of some or all files in a bucket, and then have it sent to you on an HDD. And, since a year ago, you don't even have to pay for the drive; if you return it you get a full refund.

bwbeach avatar Apr 26 '19 17:04 bwbeach

oh, that's neat! Must have missed news about it.

I just checked the transit time for a courier from Backblaze to my location though. It's a week (I'm in EU). Add 4TB encrypted write and read and it's even longer. I know for a fact that it is possible to download from B2 at 320Mbps using the CLI (after #504 was merged), which would net a ~28h result, but it still can be improved.

When I'll have a moment, I will try to run a quick PoC to see how much we can gain by altering the transfer path.

UPDATE: preliminary results are very promising

ppolewicz avatar Apr 27 '19 11:04 ppolewicz