droid icon indicating copy to clipboard operation
droid copied to clipboard

GZIP as a container trigger

Open thorsted opened this issue 2 years ago • 4 comments

Our Rosetta Working group has identified a couple file formats which use GZIP instead of regular ZIP as a container of an existing format.

  • Some institutions use GZIP to compress WARC files. They retain .gz extension, but needs a PUID as a GZIPPED Warc.
  • Adobe Premiere Pro Project files also compress their XML project files with GZIP but uses ".prproj" for extension.

Could DROID add a "container" tigger and parser to identify files like this similar to ZIP/OLE?

CC2022-S01.prproj.zip

thorsted avatar May 19 '22 17:05 thorsted

@thorsted there's a bit of a discussion about GZIP here, I don't know exactly how close the gzip part of the discussion there is to what you're looking for here: https://github.com/digital-preservation/droid/issues/221

ross-spencer avatar May 20 '22 06:05 ross-spencer

@ross-spencer ohh, thank you. I have a vague memory of seeing this but only searched for open issues. Do you feel this is the right direction for identification or is there another way?

thorsted avatar May 20 '22 16:05 thorsted

@thorsted I haven't looked at it in a while. That piece was for a client (one of the statistical outputs of Dataverse) but I didn't have a massive amount of time to dig into it. I feel like we're both describing a clear two-step process - identify gzip -> identify the contents of the gzip which are known and together are their own discrete "thing" which looks a lot like container identification. Issue 221 had a broad scope which a few people strongly disagreed with, but I also feel that if technically it makes sense to treat gzip like a container, or it can be treated like a container, then there are benefits.

ross-spencer avatar May 23 '22 07:05 ross-spencer

Ran across another format based on gzip. The "Art of Illusion" AOI 3D format. Format is gzipped with a variable filename. Double problem. Art of Illusion

thorsted avatar Jul 05 '23 19:07 thorsted