ro-crate-py icon indicating copy to clipboard operation
ro-crate-py copied to clipboard

Auto-generate `File` metadata fields

Open multimeric opened this issue 1 year ago • 9 comments

If you crate.add_file() a local file, I think it makes sense to populate:

This seems to be done already for remote files, so it makes sense to do the same for local ones.

Some other fields like height, width and duration could be automatically determined for images and videos respectively, but this is a much harder and less essential.

Happy to help with this!

multimeric avatar Sep 23 '24 04:09 multimeric

On contentSize I agree, it's just a system call to get a number.

However, generating the sha256 of a file can take quite some time, especially if the file is large. Imagine if you have to do the same for 10000 files... This could lead to unwanted overheads when creating an RO-Crate.

rsirvent avatar Sep 25 '24 10:09 rsirvent

Maybe a boolean option to calculate the sha256, which defaults to false then?

multimeric avatar Sep 25 '24 11:09 multimeric

Even a system call can introduce a consistent overhead, so we'd need boolean options for both properties. However, the FileOrDir and File code are already complicated enough, and a library user that needs to set those properties can easily compute them in client code and pass them through the properties argument:

size = ...
checksum = ...
crate.add_file(source, dest, properties={
    "contentSize": size,
    "sha256": checksum
})

It's better for the library to stay lean and leave optional stuff like this to client code.

In the case of remote files, note that contentSize is only added if validate_url is True and it's copied from the Content-Length HTTP header that we already have from checking the URL.

simleo avatar Sep 25 '24 13:09 simleo

So that's a no to adding these arguments? The annoying thing for end users is that we have to do the size and checksum calculation for every file, and there are likely to be many of these. So without a built-in argument, I suspect many of us will end up implementing our own wrapper functions to do this.

multimeric avatar Sep 25 '24 14:09 multimeric

So that's a no to adding these arguments? The annoying thing for end users is that we have to do the size and checksum calculation for every file, and there are likely to be many of these. So without a built-in argument, I suspect many of us will end up implementing our own wrapper functions to do this.

Well, those properties are not required by the RO-Crate spec at any level, so some end users might want to add them while others might not. I'd rather not add the burden of more code to maintain to the library for something that's optional. I'll leave this open for others to chime in.

simleo avatar Sep 25 '24 14:09 simleo

Other things to consider:

  • sha256 is not in the RO-Crate 1.1 context (but it's present in the forthcoming 1.2)
  • One might want to use other checksum algorithms (e.g., sha512). These are also not in the RO-Crate 1.1 context (but several are present in the workflow-run context)

simleo avatar Sep 26 '24 08:09 simleo

I think having flags to turning these fields (and others, width/height/etc.) would be the way to go.

On having ro-crate-py doing it, or staying lean and asking clients to implement this, maybe there could be other options too. Like having plugins in ro-crate-py, like ro-crate-py-fileutils or so. When installed, then that brings code that tries to populate file information, mime type, whatnot.

This way ro-crate-py focuses only on RO-Crate and Python, and anything more specific but that helps users/implementation-devs would go to these plug-ins, and implementations can decide to use it or not.

kinow avatar Sep 26 '24 09:09 kinow

Discussed with @stain last Thursday at the Workflow Run RO-Crate meeting. We decided to add contentSize: while it's not mentioned explicitly in the spec, the property shows up in many examples, so we can consider it kind of a recommendation. The implementation is in #201. Notes:

  • The record_size flag is in add_file and is propagated to FileOrDir.__init__, but the contentSize property is actually added in File.write when the file is written to disk. If we added it when the file entity is created, it could get invalidated multiple times (e.g. if data is appended to the file before the crate is written).
  • I ran a quick performance test that involved creating an RO-Crate with 10K files: the process is about 15% slower when contentSize is added to each File entity. For this reason I introduced the record_size flag and set its default to False.

Regarding the recording of a checksum such as sha256, we observed that:

  • it's not mentioned in the RO-Crate spec at all
  • there are too many types, and the terms are not in the RO-Crate context
  • users might want to record more than one checksum, e.g., sha256 and sha512.

So this is better left to user-level code.

simleo avatar Sep 30 '24 10:09 simleo

A fair approach! Thanks for summarizing it here, @simleo !

kinow avatar Sep 30 '24 10:09 kinow

Implemented in #201

simleo avatar Nov 13 '24 09:11 simleo