bagit-python icon indicating copy to clipboard operation
bagit-python copied to clipboard

Download files in fetch.txt

Open kba opened this issue 7 years ago • 6 comments

Adds a new method Bag.fetch_files_to_be_fetched() that fetches files listed in fetch.txt, c.f. #118.

If this is useful for someone, can be further refined (CLI, overrideable fetch implementation, anti-hammering interval).

kba avatar Nov 27 '18 14:11 kba

Coverage Status

Coverage decreased (-1.7%) to 81.842% when pulling 1de38c481bcb88ff2a4bf1258f44532f60bcb8a7 on kba:fetch-files into c39b650a80837a1d32314599143d0b5b04159248 on LibraryOfCongress:master.

coveralls avatar Nov 27 '18 15:11 coveralls

Coverage Status

Coverage increased (+0.09%) to 83.595% when pulling 6fda25c0103ec38c040052b57f5aec83f3dc1840 on kba:fetch-files into 8a8263e02d4b2bb95de4835e8d49fc70fd964f97 on LibraryOfCongress:master.

coveralls avatar Nov 27 '18 15:11 coveralls

Thanks for diving in to add the fetch functionality @kba. I wonder if it might be a bit more readable to rename fetch_files_to_be_fetched() to fetch() and have it take an optional force parameter that would re-download things that are already present?

edsu avatar Dec 06 '18 17:12 edsu

Also, I haven't worked with fetch.txt files much before. But I'm kind of surprised that the test suite considers a bag valid if it has a fetch.txt containing an item that is not present in the payload directory.

See: https://github.com/LibraryOfCongress/bagit-python/blob/master/test.py#L1023

edsu avatar Dec 06 '18 17:12 edsu

I wonder if it might be a bit more readable to rename fetch_files_to_be_fetched() to fetch()

Fine with me, I wanted to avoid confusion with fetch_entries.

and have it take an optional force parameter that would re-download things that are already present?

Sure.

I'm kind of surprised that the test suite considers a bag valid if it has a fetch.txt containing an item that is not present in the payload directory.

These tests break the "Every file listed in the fetch file MUST be listed in every payload manifest" rule and it isn't validated. fetch_entries should not just check for unsafe filenames but ensure files is also listed in payload_entries. The validation only checks data on disk and manifest entries. That is a bug.

Since the manifests determines the number and size of files, it could make sense to allow "bag with holes" validation against only the files not mentioned in fetch.txt with a special parameter though, if you don't want to fetch the whole thing. By default,

if it has a fetch.txt containing an item that is not present in the payload directory

should not be valid, you're right.

kba avatar Dec 07 '18 12:12 kba

I guess we could consider validation as a separate issue from this PR though.

edsu avatar Dec 07 '18 12:12 edsu