Auto-generate `File` metadata fields
If you crate.add_file() a local file, I think it makes sense to populate:
This seems to be done already for remote files, so it makes sense to do the same for local ones.
Some other fields like height, width and duration could be automatically determined for images and videos respectively, but this is a much harder and less essential.
Happy to help with this!
On contentSize I agree, it's just a system call to get a number.
However, generating the sha256 of a file can take quite some time, especially if the file is large. Imagine if you have to do the same for 10000 files... This could lead to unwanted overheads when creating an RO-Crate.
Maybe a boolean option to calculate the sha256, which defaults to false then?
Even a system call can introduce a consistent overhead, so we'd need boolean options for both properties. However, the FileOrDir and File code are already complicated enough, and a library user that needs to set those properties can easily compute them in client code and pass them through the properties argument:
size = ...
checksum = ...
crate.add_file(source, dest, properties={
"contentSize": size,
"sha256": checksum
})
It's better for the library to stay lean and leave optional stuff like this to client code.
In the case of remote files, note that contentSize is only added if validate_url is True and it's copied from the Content-Length HTTP header that we already have from checking the URL.
So that's a no to adding these arguments? The annoying thing for end users is that we have to do the size and checksum calculation for every file, and there are likely to be many of these. So without a built-in argument, I suspect many of us will end up implementing our own wrapper functions to do this.
So that's a no to adding these arguments? The annoying thing for end users is that we have to do the size and checksum calculation for every file, and there are likely to be many of these. So without a built-in argument, I suspect many of us will end up implementing our own wrapper functions to do this.
Well, those properties are not required by the RO-Crate spec at any level, so some end users might want to add them while others might not. I'd rather not add the burden of more code to maintain to the library for something that's optional. I'll leave this open for others to chime in.
Other things to consider:
sha256is not in the RO-Crate 1.1 context (but it's present in the forthcoming 1.2)- One might want to use other checksum algorithms (e.g.,
sha512). These are also not in the RO-Crate 1.1 context (but several are present in the workflow-run context)
I think having flags to turning these fields (and others, width/height/etc.) would be the way to go.
On having ro-crate-py doing it, or staying lean and asking clients to implement this, maybe there could be other options too. Like having plugins in ro-crate-py, like ro-crate-py-fileutils or so. When installed, then that brings code that tries to populate file information, mime type, whatnot.
This way ro-crate-py focuses only on RO-Crate and Python, and anything more specific but that helps users/implementation-devs would go to these plug-ins, and implementations can decide to use it or not.
Discussed with @stain last Thursday at the Workflow Run RO-Crate meeting. We decided to add contentSize: while it's not mentioned explicitly in the spec, the property shows up in many examples, so we can consider it kind of a recommendation. The implementation is in #201. Notes:
- The
record_sizeflag is inadd_fileand is propagated toFileOrDir.__init__, but thecontentSizeproperty is actually added inFile.writewhen the file is written to disk. If we added it when the file entity is created, it could get invalidated multiple times (e.g. if data is appended to the file before the crate is written). - I ran a quick performance test that involved creating an RO-Crate with 10K files: the process is about 15% slower when
contentSizeis added to eachFileentity. For this reason I introduced therecord_sizeflag and set its default toFalse.
Regarding the recording of a checksum such as sha256, we observed that:
- it's not mentioned in the RO-Crate spec at all
- there are too many types, and the terms are not in the RO-Crate context
- users might want to record more than one checksum, e.g., sha256 and sha512.
So this is better left to user-level code.
A fair approach! Thanks for summarizing it here, @simleo !
Implemented in #201