internetarchive icon indicating copy to clipboard operation
internetarchive copied to clipboard

ia upload --checksum doesn't always skip existing files

Open JustAnotherArchivist opened this issue 6 years ago • 5 comments

Situation: I'm uploading a large dataset to IA (cf. #288). As of right now, 130 of the 159 total files are uploaded. The local files should not have been modified since I started the upload.

Running ia upload $identifier * --checksum --no-derive (regarding the last flag, see #288) starts uploading the first file again, which is already on IA, instead of resuming at the 131st file. The files were skipped correctly on a previous resume when 62 files had been uploaded before.

I can work around this by explicitly listing the missing files obviously, but --checksum clearly isn't doing what it's supposed to do...

JustAnotherArchivist avatar Jan 22 '19 19:01 JustAnotherArchivist

The checksum option only works when there are no queued or running tasks. I'm guessing you had queued or running tasks when you used it?

If so, it looks like this needs to be documented better (I thought it was).

jjjake avatar Jan 22 '19 19:01 jjjake

Oh, I see. Yes, that is indeed the case. (The derive that was queued due to #288.)

Maybe it could be checked whether there are queued tasks? Although I guess that would require a fix for #167 first.

JustAnotherArchivist avatar Jan 22 '19 19:01 JustAnotherArchivist

@jjjake What would you think about either returning an error or waiting for existing tasks to finish (with a warning message of course so the user knows what's going on) when the --checksum option is used? Spreadsheet uploads would complicate this somewhat though; checking and erroring/waiting every time the item identifier column changes seems most reasonable to me there.

JustAnotherArchivist avatar Feb 10 '22 13:02 JustAnotherArchivist

I don't like the idea of waiting for existing tasks to finish because that could be hours to even days, depending on the task.

I think erroring out might make the most sense, but I don't love this either. I think most people use --checksum to avoid having to re-upload the same file again (e.g. save some time). I feel like erroring out or stalling, rather than just re-uploading, would be confusing and annoying for most users.

I'd prefer to keep the same behavior we currently have, and adding a warning message. Alternatively, perhaps there should be another option that would support what you're talking about?

Just my 2 cents though, open to feedback! :)

jjjake avatar Feb 10 '22 18:02 jjjake

Yeah, you're right, it wouldn't be the best UX. I'd be fine with a warning and an option to make it an error. The rest can be done with a small wrapper script when desired.

JustAnotherArchivist avatar Feb 10 '22 19:02 JustAnotherArchivist