SigMF icon indicating copy to clipboard operation
SigMF copied to clipboard

Schema vs Model Distinction

Open bhilburn opened this issue 6 years ago • 17 comments

One of the comments we got at GRCon about SigMF is that it seemed to make working with datasets difficult.

Specifically, what this person wanted to be able to do was SELECT something in a database, parametrically, based on the metadata, and then have it return a chunk of samples. The obvious solution is chunking the SigMF data file by capture segment and then storing those chunks with the segments as keys - but this no longer represents a compliant recording per the standard. Possible? Yes. But not standard.

Is this something we should address? I agree that it is a useful structure and I think a lot of users will want to use something like it. Even if we don't want to make this a compliance requirement, are there things we can do in the standard to make it easier to accomplish?

bhilburn avatar Sep 20 '17 15:09 bhilburn

I claim that as a general principle of software engineering, one should not call an application noncompliant just because:

  • it internally, as an implementation detail, stores data in a format different than the standard, or
  • is capable of returning the data in a nonstandard but useful format.

Rather, compliance of an application should be such conditions as:

  • it can read/import/intake files in the standard format (if the application reads such files);
  • it can write/export files in the standard format (if the application writes such files);
  • it does not produce nonstandard-format files it claims are standard; and
  • there are no compliant files which it cannot read, other than due to size limits.

kpreid avatar Sep 20 '17 16:09 kpreid

@bhilburn, did you get any more insight into what specifically makes it difficult with databases? I think the fact that we split metadata from data, break data into capture segments, and provide unique keys in the form of sample_start to find those capture segments makes it pretty straight forward to load into a database. For the record, I'm storing SigMF data in a relational db, though I don't give each capture its own row. Though I do store in a db for more efficient searching/filtering/seeking into data, I wouldn't want the actual sigmf format to be anything other than a flat file.

I'm honestly not sure what we could do to make SigMF easier to drop into a database, and as @kpreid said, there's nothing about the spec that stops or even discourages them from creating an application that does so.

djanderson avatar Sep 20 '17 17:09 djanderson

The biggest proponent for this, actually, was @namccart. He was explaining that one of the reasons that he really likes VITA49 for this particular application is that it provides pre/easily 'chunkable' data.

So, based on my understanding from @namccart, for example, if you load a SigMF recording into a database and search over sample_start as a key, once you identified one you wanted you would then still have to load the entire dataset to index to the key. As you said, @djanderson, "[you] don't give each capture it's own row", which I think is what Nick's issue is?

Nick, can you comment?

bhilburn avatar Sep 20 '17 17:09 bhilburn

I'm also not convinced this is really a SigMF problem. I see how it makes writing SQL <-> SigMF converters a bit more complicated, but they also solve really, really different problems.

mbr0wn avatar Oct 03 '17 19:10 mbr0wn

I think Ben captured my issue pretty well. Given what the Darpa folk want to do with SigMF, I think you really want to consider how SigMF plays nicely with the overall problem of data retrieval from big RF data archives. Hearing Tom talk about his gnuradio-SQL idea (which I also want and think is inevitable), everything starts with being able to retrieve arbitrary I/Q based on reasonable query strategies. I'm sure there are better solutions to this problem than I can imagine. I think hdf5 has an entirely different solution to searching the archive than trying to chunk the data into a database... what I don't know is whether hdf5 plays nicely with hdfs or other distributed setups. I know very little about hdf5 except that it's intriguing.

In any event, if you accept that this is in fact a problem SigMF should address, I think it makes sense to move in a direction that supports one or more existing ways of archiving and searching lots of data (time-series data or otherwise). For me, those candidates are databases (Couchdb, mongodb, postgres), database-like things (elasticsearch), and hdf5...

Cheers, Nick M.

On Tue, Oct 3, 2017 at 3:44 PM Martin Braun [email protected] wrote:

I'm also not convinced this is really a SigMF problem. I see how it makes writing SQL <-> SigMF converters a bit more complicated, but they also solve really, really different problems.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gnuradio/SigMF/issues/70#issuecomment-333956148, or mute the thread https://github.com/notifications/unsubscribe-auth/AEZpNZyNJxU7-oyjWPZc3j4iDs2aujw7ks5soo6FgaJpZM4PeC3U .

namccart avatar Oct 04 '17 01:10 namccart

Okay, so, SigMF already provides a solution to this, but we should discuss whether there are changes that would improve it:

So, what @namccart cares about, per my comment above, is the ability to load smaller "chunks" of data than the entire dataset, which makes it much easier to work with databases. SigMF allows for this using the offset field of the core namespace, which allows you to break datasets up into multiple files that represent a continuous recording. You could, for example, break a dataset into five .sigmf-data files that have five matching .sigmf-meta MD files, with offsets that connect each one to the one that precedes it.

So, the question here, then, is "What, if anything, could we do to make this better?" Is there some change we should recommend? If we just provide a tool that cleanly splits your dataset into multiple files, of a parametizable size, does that solve the issue?

bhilburn avatar Oct 09 '17 22:10 bhilburn

Spun on this a bit. @kpreid had a really good point, early on, that we shouldn't call something non-compliant because of anything it does "internally" or "locally". It's really all about the ingress and egress.

Per my previous comment, what @namccart wants to do is already pretty doable with SigMF. We could make it easier by providing a tool, for example, that showed you how to chunk the data based on metadata segment, but there really isn't anything difficult, here, in my opinion.

So, I think the final question that should be debated is whether or not this is a format that we want to be able to distribute SigMF Recordings in. Right now, a compliant recording can not be distributed where the binary data has been chunked into a bunch of files and one metadata file references all of them. We specifically decided against allowing the 1-to-many case in #19. It was in the context of multiple streams of data, but the reasoning still applies here, I think.

So, before we close this issue out with either a do nothing or a make an example chunker program decision, does anyone think we should revisit 1-to-many given this usecase? We do now have an archive format described, which we didn't at the time of #19, so it would (presumably) be easier to distribute multiple files in a recording.

bhilburn avatar Oct 26 '17 19:10 bhilburn

I'm new here, so if these comments are missing the point, I apologize.

One feeling I had as I read the spec (as an experienced spec reader and writer) is that the current draft spec conflates semantic content of the metadata with the transfer encoding/format of the data.

In plain English: it seems to me the definition of "what are the allowed tags and values in SigMF metadata" can (and should) be separate from HOW the tag value pairs are encoded.

I'm all for SigMF metadata including "datatype" and "sample_rate" and "version", and so forth; consider this the "schema" of "SigMF metadata". But I think the spec would be strengthened by separating out the fact that "it must be a JSON file".

I feel the SigMF spec SHOULD say: when SigFM metadata is written to a file, it then must be a JSON structured UTF-8 file, with a single object per file, and use the following extension.

If ALSO a standard way for "writing a SigMF object to a SQL database" is needed... then that should be specified as an alternate way to store an SigMF metadata (and maybe the dataset, too).

Should one write the JSON version of the metadata as text-blob to a single VARCHAR field? Or should each field of the metadata get its own SQL field? Personally, I don't care; I find both of these reasonable in certain cases. Should the SigMF spec weigh-in on the "correct"/standard way to do this? Only if the community thinks it is helpful.

And then what if I want to store SigMF data -- both metadata and the dataset -- in a document database such as MongoDB? Do we need to define a "standard" -- that is "compliant" -- way to do that?

MY MAIN POINT is that because the verbiage of the spec conflates "SigMF metadata is a JSON object with this format", I think it leads to the ambiguity that is being discussed in this thread.

My advice: separate sections for the semantic part of "what is SigMF metadata", and then requirements for how they should be serialized into a file (JSON), and -- if desired by the community -- recommendations for "best practices" when stored in relational records, or a document database, or -- as needed -- in other portable files/containers/transfer mechanisms.

dharasty avatar Nov 13 '17 20:11 dharasty

I agree that distinguishing consistently between schema/model and encoding would be useful, but I think that "separate sections" is a bad idea unless those sections are interleaved: the value of making the distinction clear is less than the value of making it obvious how to implement SigMF's intended primary use — an interchange file format.

kpreid avatar Nov 14 '17 03:11 kpreid

@kpreid: It is a pretty common technique in standards documents to separate the schema from encodings. In fact, in many standards documents, ALL the encodings show up as examples/supplementary information in appendices.

For SigMF to really catch on, I think it need to address is motivating use case of FILE interchange, but ought to give SOME consideration for logical next steps, such as storing both the dataset and metadata in either relational and document databases. (After all, a filesystem and a tarfile are simply ONE instance of a "document database" or "document datastore".)

Actual file storage might be many users primary use case... but for me, it probably won't be. Minor adjustments to the contents and the format of the spec might ensure my use case is well covered, too. This will be a boon to the spec if we can achieve it without impeding the file use case... and I feel we can.

All that said, I have no trouble if we inline/interleave JSON-file examples in the text, provided there is 1) a clear editorial distinction between "schema requirements" and "JSON-file encoding requirements", 2) there is some other place in the document that addresses the needs of other encodings (possibly appendices).

dharasty avatar Nov 14 '17 17:11 dharasty

So, it's taken me far too long to address this.

@dharasty - I think you make really excellent points, and I appreciate you providing your insight, here. I would like to make the change you suggest (i.e., distinguishing between schema and model) as part of the v0.0.2 stuff I'm hacking on, now.

I'm interested to know your thoughts on the best way to go about doing this. Is there any chance you would be up for putting together a PR that demonstrates an approach you thinks works well?

bhilburn avatar Jul 17 '18 19:07 bhilburn

Some minor changes that clearly distinguish between the schema and file encoding will be made in the v0.0.2 release per the discussion above.

bhilburn avatar Jul 12 '19 15:07 bhilburn

I feel like this is an important conversation, but it should probably be pushed to v1.1+ so as not to delay the timely release of v1.0.0.

jacobagilbert avatar May 27 '21 02:05 jacobagilbert

@bhilburn do you agree with @jacobagilbert 's comment? I do.

gmabey avatar Jun 14 '21 21:06 gmabey

@bhilburn ping

gmabey avatar Jul 09 '21 23:07 gmabey

I actually think the fundamental change we are talking about here is super simple and pretty light-touch. I'll get a PR together that does it once we've got the major churn done (merging #135 and #140)

bhilburn avatar Jul 28 '21 18:07 bhilburn

@bhilburn It is pretty exciting to this that this is the only issue still languishing in the "Not Started" bucket for the 1.0 release ... I wait with baited breath for progress :-D

gmabey avatar Aug 04 '21 21:08 gmabey