internetarchive
internetarchive copied to clipboard
ia upload --checksum doesn't always skip existing files
Situation: I'm uploading a large dataset to IA (cf. #288). As of right now, 130 of the 159 total files are uploaded. The local files should not have been modified since I started the upload.
Running ia upload $identifier * --checksum --no-derive
(regarding the last flag, see #288) starts uploading the first file again, which is already on IA, instead of resuming at the 131st file. The files were skipped correctly on a previous resume when 62 files had been uploaded before.
I can work around this by explicitly listing the missing files obviously, but --checksum
clearly isn't doing what it's supposed to do...
The checksum option only works when there are no queued or running tasks. I'm guessing you had queued or running tasks when you used it?
If so, it looks like this needs to be documented better (I thought it was).
Oh, I see. Yes, that is indeed the case. (The derive that was queued due to #288.)
Maybe it could be checked whether there are queued tasks? Although I guess that would require a fix for #167 first.
@jjjake What would you think about either returning an error or waiting for existing tasks to finish (with a warning message of course so the user knows what's going on) when the --checksum
option is used?
Spreadsheet uploads would complicate this somewhat though; checking and erroring/waiting every time the item identifier column changes seems most reasonable to me there.
I don't like the idea of waiting for existing tasks to finish because that could be hours to even days, depending on the task.
I think erroring out might make the most sense, but I don't love this either. I think most people use --checksum
to avoid having to re-upload the same file again (e.g. save some time). I feel like erroring out or stalling, rather than just re-uploading, would be confusing and annoying for most users.
I'd prefer to keep the same behavior we currently have, and adding a warning message. Alternatively, perhaps there should be another option that would support what you're talking about?
Just my 2 cents though, open to feedback! :)
Yeah, you're right, it wouldn't be the best UX. I'd be fine with a warning and an option to make it an error. The rest can be done with a small wrapper script when desired.