bagitspec icon indicating copy to clipboard operation
bagitspec copied to clipboard

How to handle empty directories

Open nkrabben opened this issue 8 years ago • 6 comments

Current BagIt libraries differ in how they handle empty directories added as a payload. For example, with the following sample structure, |-directory
| |-empty_directory
| |-payload_file ... bagit-java (and Bagger) and bagins create the following data directory, |-data
| |-payload_file ... and bagit-python creates, |-data
| |-empty_directory
| |-payload_file ...

Both are valid according to the spec, but I think that dropping the empty directory is not expected behavior. The spec has a suggestion in 2.1.3 to a zero-length file with the same name as the directory and keep as an extension (pointed out to me by @andrewjbtw), but I haven't seen any bagging tools follow this suggestion. Andrew and I do not see this as an expected behavior.

Can the specification be more explicit about how to handle empty directories? I prefer bagit-python's strategy, but realize that this means empty directories have no bearing on the completeness or fixity of a bag.

nkrabben avatar Jan 23 '17 21:01 nkrabben

Since the 02 draft (~July 2008) the spec has clarified that empty directories cannot be stored:

https://tools.ietf.org/html/draft-kunze-bagit-02#section-6

The behaviour you're seeing in bagit-python is probably something we should warn about: the empty directory isn't referenced in the manifest but it's “preserved” during the bag creation process because bagit-python does just an atomic move to the data directory, and it's not reported in validation because that only looks at filenames. We should at the very least start reporting empty directories during validation to avoid surprises if the bag is copied somewhere else using a manifest-driven tool rather than a simple filesystem copy.

acdha avatar Jan 23 '17 22:01 acdha

My issue is two-fold.

  1. Can the spec have stricter language about how empty dirs should be handled, so I can refer to that documentation when creating issues with the various bagging tools.
  2. Is dir_name.keep the strategy that should be required by that stricter language?

nkrabben avatar Jan 23 '17 22:01 nkrabben

One consideration is maintaining the compatibility between the manifest format and tools like md5sum and md5deep.

justinlittman avatar Jan 24 '17 12:01 justinlittman

@justinlittman I would be curious how many people manually create a bag using md5sum or md5deep today? If would seem with so many readily available bagit tools, if you manually create a bag, it is on that person to ensure it is compatible with the bagit specification

johnscancella avatar Jan 24 '17 14:01 johnscancella

I'd suggest that it is less for bag creation, then bag validation. md5sum/md5deep are readily available on most *nix platforms and allow checking a bag without installing any BagIt software. Furthermore, the existing manifest format is readily recognizable to most technical folks, even if they have no knowledge of BagIt.

I know in the past, I've reached for md5sum/md5deep for quick bag checking.

justinlittman avatar Jan 24 '17 16:01 justinlittman

This adds complexity, but would it be possible to add a separate file in the data directory representing the file/folder tree and then validate that as a separate step?

andrewjbtw avatar Jan 25 '17 23:01 andrewjbtw