SigMF Compression for SigMF Archives

Item from GRCon discussions:

So, we had previously decided in the discussion about archive formats #15 to dis-allow compression, with the reasoning that compressing IQ recordings rarely gives you anything and complicates the reader / writer applications. But, a couple of folks from the National Labs pointed out that you sometimes have recordings of mostly zero values in some systems. They are currently using HDF5 and considering moving over to SigMF in a number of their programs, but would like to see a compression capability, which HDF5 allows for.

I think this argument is reasonable, and it does make sense. There will definitely be recordings where the Vpp is not changing so much as to make compression useless.

I'd like to solicit thoughts and opinions on both this topic, generally, as well as what forms of compression might make the most sense for SigMF archives.

Sep 20 '17 15:09 bhilburn

Not having a compression format in the spec doesn't stop people from using compression, though?

Using tar (which we've used as our archive format) as an example, tar doesn't itself specify any compression, but it's almost always compressed, with the compression format tacked on after .tar, so .tar.gz or .tar.bz2, etc.

Could we recommend a similar syntax, where we use .sigmf.gz, .sigmf.bz2, .sigmf.whatever_works_for_you?

Pros:

simplicity of the spec
simplicity of reader/writer applications (they only have to support untarring/tarring, leaving the user to use an external {de}compression utility first, or they could detect the common compression formats and do that for you as well).
flexibility to use best compression format for your application

Cons:

someone could theoretically compress all their sigmf data with some tool/format that is not free, not OSS, not available on all OS's, etc

Sep 20 '17 16:09 djanderson

This is a side issue to the question of doing it at all but note that extensions like .sigmf.gz are bad for use in (at least some) desktop environments because .gz would end up matched to a generic decompressor rather than a SigMF viewer which can decompress transparently. This is why when I proposed standardizing an archive format I specified that the extension should be .sigmf and not .tar.

Sep 20 '17 16:09 kpreid

I'm actually using the SigMF archive format, and I can tell you that 100% of the time I just change the extension to .tar when I download the file so that the built-in archive utility recognizes it and opens it. That's fine, and I get and agree with the point of using the .sigmf extension, but also hiding the compression format would just make the file impossible to open with external tools on a system without a dedicated sigmf reader installed.

You can always open the files directly in the sigmf reader that supports compression to avoid it being picked up by a decompression utility first.

Sep 20 '17 16:09 djanderson

We can also, very easily, add tools for common OS's to identify .sigmf as tar, and .sigmf.gz as tar.gz etc. I like @djanderson's approach -- it's common practice, and solves the problem making it not-our-problem. The issue that people could start using non-free compressors is a thing, but when does that actually happen these days? WinRAR anyone? I haven't used ARJ since the era of distributing video games on 3.5" floppies :older_man:

Oct 03 '17 19:10 mbr0wn

We can also, very easily, add tools for common OS's to identify .sigmf as tar, and .sigmf.gz as tar.gz etc.

@mbr0wn I don't think there is usually the ability to assign a meaning to an extension with another dot in the middle, as .sigmf.gz has. I could be wrong, but I've never seen it actually done. An extension like .sigmfgz would not have this problem (if it is a problem) and would still be documenting the format.

Oct 04 '17 03:10 kpreid

It doesn't sound like anyone has any arguments against allowing for compression, so great! We'll move ahead with that. Now, to figure out how to spec it, hah =)

@kpreid has an interesting point regarding enabling transparent reading (and thus decompression) if we have a unique extension. Using OS built-ins can be really convenient, too, as @djanderson.

It seems like a key question is: do we expect most people to be interacting with SigMF Archives using archive utilities or SigMF applications? Let's say we have a SigMF reader application that is popular and widespread - how many people will still want to just download the archive, peek into it, and open the metadata in vim / less / whatever?

To be honest, I don't think I have a strong opinion one way or the other right now. I'll spin on it a bit, and am really interested to hear everyone else's opinion, if you feel strongly about it.

Oct 26 '17 20:10 bhilburn

Just want to confirm that some of my real-world raw I-Q data files do compress very well (50% or more). int16 interleaved; could be more effective if the data is in ieee754. This feature will help transferring data files over the Internet a lot.

One alternative to compressing the whole archive file could be compressing the raw I-Q data file only, and offer a SigMF label for this. (compression_type: raw, gz, etc.)

Nov 01 '17 01:11 cityscapesc

@cityscapesc - Thanks for the input and data point!

You raise an interesting idea about just compressing the datafile. The primary issue I can think of, there, is that then we are almost forcing Reader / Writer applications to have de-compression libs built into them. If we do it at the archive level, it could be a more natural part of the workflow to do it prior to application loading. Either route carries some risks, though, in terms of application complexity.

I think @cityscapesc's data point is a good one, though, and backs up the feedback we got at GRCon: compression would be useful.

So, going back to the question of extension: I just pulled up the documentation on MIME Extensions, and as it turns out, the specification is that the longest pattern has the heaviest weight. From https://specifications.freedesktop.org/shared-mime-info-spec/shared-mime-info-spec-latest.html#idm140625828677088 :

If several patterns of the same weight match then the longest pattern SHOULD be used. In particular, files with multiple extensions (such as Data.tar.gz) MUST match the longest sequence of extensions (eg '.tar.gz' in preference to '.gz').

My read of this is that we could safely do sigmf.gz, as long as we also ship a MIME Extension XML definition with SigMF. Thoughts?

Nov 02 '17 20:11 bhilburn

We've seen in the past couple of years that most people are compressing SigMF recordings - even when the gains from compressing the raw samples are low (as expected), in recordings with siginificant annotations, they can be meaningful.

In #99, I sort of sloppily added support for gzip. Realistically, supporting gzip and bzip2 are probably safe given their ubiquity. It's entirely possible people will want to use other forms of compression, but making those canonical would be problematic on some systems (thinking of 7zip, for example).

My inclination is to allow compression of archives using gzip and bzip2. I'm leaving this issue open for disagreeing opinions for a bit longer, in case anyone wants to comment.

Jul 12 '19 15:07 bhilburn

So I have an alternate proposal. From the conversation I've picked up on the following desires:

use some compression to shrink the filesize because the data portion is large
be able to use tools to work with this compressed record

I think that the experience of using a library to work with a compressed tar would be fairly painful because it requires you to decompress the whole archive in order to read the metadata and provides 0 random-access to the data (so you have to decompress and untar the whole thing). This was also pointed out by @citscapesc. I propose

add an option around the datatype to compress just the data file portion of a record using compression formats that support psuedo-random access (see https://stackoverflow.com/questions/429987/compression-formats-with-good-support-for-random-access-within-archives)
leave the archive as is. If people want to compress the archive that would be OK too, but there's marginal benefit to it

This would allow a lightweight reading of the metadata, then randomly seeking to samples of interest without decompressing the entire archive. This means it's now possible to deal with very large recordings in a compressed manner without a very large memory requirement and no need to make a duplicate decompressed copy on disk. The expense is that of CPU power to decompress on the fly.

Lots of attention and care to selecting the right compression format is required, but I'm curious if this sounds like a reasonable approach to the folks with the use case for it.

Apr 24 '20 15:04 n-west

@n-west - I like your suggestion a lot. Reading that thread, it actually sounds like both gzip and bzip2 are also capable of PR-access of the compressed blocks, but it's not clear to me if those are natively available in the primary packages that are distributed.

I would like to have the feature described by Nathan 👆👆 in v1.0.0. Marking as such

May 26 '21 14:05 bhilburn

I don't see this being ready for v1.0... moving to 2.x (can pull in to 1.1 or something sooner if we want to, this is a new feature and should not break anything)

Jul 29 '21 12:07 jacobagilbert

As a possibly related bit of info, there's IQZIP by the Libre Space Foundation which implements a CCSDS lossless compression standard. They did some interoperability work with SigMF and it looks like that was removed. https://gitlab.com/librespacefoundation/sdrmakerspace/iqzip

Nov 01 '21 17:11 dkozel