SigMF
SigMF copied to clipboard
Untarless
I've now added another class that allows for accessing data without un-tar-ing an archive. The intent is to address the case where an archive is so large that it is onerous/infeasible/obnoxious to unpack it. Please consider exercising this patch with this script:
#!/usr/bin/env python3
import json
import numpy
import sigmf
import warnings
warnings.filterwarnings("always")
global_info = {
"core:author": "Glen M",
"core:datatype": "ri16_le",
"core:license": "https://creativecommons.org/licenses/by-sa/4.0/",
"core:num_channels": 2,
"core:sample_rate": 48000,
"core:version": "1.0.0"
}
capture_info = {
"core:datetime": "2021-06-18T23:17:51.163959Z",
"core:sample_start": 0
}
NUM_ROWS = 5
for dt in "ri16_le", "ci16_le", "rf32_le", "rf64_le", "cf32_le", "cf64_le":
global_info["core:datatype"] = dt
for num_chan in 1,3:
global_info["core:num_channels"] = num_chan
base_filename = dt + '_' + str(num_chan)
archive_filename = base_filename + '.sigmf'
data_filename = base_filename + '.sigmf-data'
a = numpy.arange(NUM_ROWS * num_chan * (2 if 'c' in dt else 1))
if 'i16' in dt:
b = a.astype(numpy.int16)
elif 'f32' in dt:
b = a.astype(numpy.float32)
elif 'f64' in dt:
b = a.astype(numpy.float64)
else:
raise ValueError('whoops')
b.tofile(data_filename)
meta = sigmf.SigMFFile(data_file=data_filename, global_info=global_info)
meta.add_capture(0, metadata=capture_info)
meta.tofile(archive_filename, toarchive=True)
archi = sigmf.SigMFArchiveReader(archive_filename, skip_checksum=True)
print(f'{dt} with {num_chan} channels: .ndim = {archi.ndim} .shape = {archi.shape}')
print(archi.sigmffile._memmap[:])
print(archi[:])
print([i for i in archi])
print()
Should we close #222 and just review this PR?
@777arc I had asked @gmabey to split these out into separate requests since they are unrelated. Should be able to work through separating these things when its merge time.
@gmabey in order to make this reviewable we really need to remove things that are not related to the untarless operation. Can you please cherry-pick the appropriate commits onto a clean copy of this branch and force-push?
Regarding your latest commit 8ba036b - I think the data should be mapped read only everywhere; this declutters the API and I do not believe sigmf provides any constructs for modifying archive datafiles in place.
@jacobagilbert I'm going to suggest that #209 be fully integrated before reviewing/revising this one, since this builds on that. That's on me for not making more progress in revising my PR with @Teque5 's suggestions -- I'll commit to doing that this week.
Regarding read-only, I'm definitely ok with changing the default to read-only, but that's not what you suggested. I'm trying to objectively weigh my affinity for this feature as compared to the batch_size feature that I objected to. My rationale in dissing batch_size was that it's not akin to anything in the numpy API, and my rationale for liking map_readonly is that it does have an analog in numpy-land, kind of. That's because it's only part of numpy.memmap(), not numpy.ndarray.
I'm very comfortable with a parameter that toggles the readonlyness, probably because I've used numpy.memmap() a lot over the years -- not unlike @Teque5 's relationship with batch_size ... what do you think?
We can rely on #209 and I'll focus on getting that merged. This will still need to be rebased / squashed though.
I am not following the other discussion. Can you describe the reason this sigmf module would open the dataset as read/write?
Yes, I'm sure you're right that a rebase will be expedient.
Regarding writing to an archive, here's a purely hypothetical situation. Suppose you have a RF recording of complex samples that is very large and you're working on an algorithm to detect anomalous transmissions: deviations from what your algorithm has previously determined to be a "normal state" for that frequency range. And so to test, you want to add to the samples (like +=) some other signal (maybe, a wideband, short-duration burst) such that you have ground truth on the frequency and sample offset of the anomaly. If you open the archive read-write, you could do this pretty painlessly. Granted, you should update the core:sha512 field (if you're using it) after doing so [and, I have a plan for doing that but it's not ready yet] but then you're golden -- no un-taring and retarding, nor running out of disk space.
A simpler case is: add some AWGN to increase the noise floor (thereby reducing the SNR) over a particular interval of samples.
Whadaya think?
Both of these use cases are very real situations that are relevant for python DSP, but I do not think these types of operations are appropriate for SigMF to address much less provide tooling to support. I don't think it is appropriate for a library like sigmf to modify datasets at all and we might actually want to provide guarantee that sigmf will not [directly] modify datasets.
This is something an external tool would be used for (and that tool could use sigmf to update checksum or metadata information).
Hmm I'm very surprised to hear this response, maybe we could work through that a bit.
I guess I really didn't think of this as sigmf doing the modifying, nor even really facilitating the modifying. Instead, I claim that it's a builtin numpy feature, as the writing to disk would be a straight through the memmap. Like, as compared to other features, it sure doesn't seem like a very "hard" thing to support. So, WRT the "tooling to support" objection, seems like a pretty light lift to me -- or really no lift at all, since numpy does literally all of the actual work (or, what I would call "tooling").
I realize that prior to #209 the only supported way to get at the data was to literally copy every single sample out of the file and convert it to a [complex] floating point type. Now, if a similarly large effort were required to write portions of an archive in place, then I would probably agree with you -- and object to the maintenance burden (especially all of those pesky dtype permutations). However, as this PR builds on #209, then I think I disagree.
Instead, I would propose that the API guarantee that sigmf will not [directly] modify datasets, as long as map_readonly is asserted (which I agree needs to become the default).
And, if an external tool were to provide this support, would it not literally have to do all of the same operations that SigMFArchiveReader (oops that might need a name change) now does, except pass a different value to mode? Seems kinda silly-redundant to me, if 99% of the support is already there ... like, how bad is it to pass around one additional boolean through 2 layers of method calls?
I suppose you might also have an objection to modifying the metadata of an archive? I suppose that would then rule out adding an annotation? But if we allow that, then we probably shouldn't object to facilitating the updating of other metadata, maybe even sha512 ...
Adding a new annotation seems like a common task to me and it would be cool to support that with archives, as long as it's not a ton of code to write/maintain. Now modifying the signal data itself in-place, that seems like a corner case to me that could be accomplished with application side code, but if it's really already built into numpy and it's not much additional code for us to write/maintain, I don't see a huge problem with it. We should include a nice example of using it though. I do like the idea of having the Reader be read-only by default, so maybe if you go to do the write and haven't set it to the right mode, it can spit out a nice error explaining what to do.
@777arc I'll look at what can be done regarding the error thingy.
There is another mode in play here, which I didn't think to be worth exposing, but y'all should at least be aware of.
There is no objection to modification of metadata, and allowing modifying metadata in place is a good idea given that's part of the job of sigmf.
Also, if this becomes a widely requested or needed feature then I am not opposed to a larger discussion but the complexity and implied contracts that come about from allowing the modification of datasets is something we currently avoid, and instead focus on management of the metadata (as is done with the reference in the readme).
I'd like to hear whether these features are (at this point) of interest. I guess I'm surprised to realize (just now) that the fb89305ae5a9fe2bd85260857f4cded27046318b commit hasn't gone anywhere (or, attracted more interest). Furthermore, I'm confused as to why both commits are in this single PR -- and I would be happy to separate them, if they are of interest. My git acumen is increasing ...
Oh, sorry for the noise, but I now see that the first commit is really #209 ... any bodie interested in reviewing that PR?
I think this is mergable with the one comment above addressed... that said i'll need to cherry-pick the commit unless we can get 209 reviewed first.
I'll definitely revert that __version__ line, but I have found another bug related to reading from a buffer that I'll push at the same time (should be later today). That raises in my mind the lack of unit test for this new class ...
I see also that you're right, this change is independent of #209.
If you rebase this onto sigmf-v1.x it should take care of whatever is going on. If that rebase gets upside down, you might need to cherry-pick just the second commit onto sigmf-v1.x and force push that to your untarless branch.
Just a bit of context regarding accessing a sigmf archive from a buffer, for the record. You see, I've got a SQL database with a record (haha) of type BLOB or BYTEA that is the actual SigMF archive. So, my app has got this byte array and wants to plot a specgram, and it seems pretty silly to hafta write it out to a temp file in order for these utils to access the data and the meta data, especially when tarfile and numpy support file-like-objects so well already.
Also, after pushing this last commit, I did some sort of within-github conflict-merge-thingy that I've never done before, but it seemed go ok and I hope that's really the case.
Note the lack of a test that exercises SigMFArchiveReader. I think I could in fact implement one next week, but I am also curious whether any other contributors to this project like this concept well enough to learn its interface through the process of writing a test.
I like this idea and I think it's ready pending (2) items.
- At least 1 test added to
tests/mostly just consisting of your example at the top of the PR. - An example in the
READMEas to how to use this approach.
Ok, I think I've addressed all of this points except that my test isn't working quite right. I think that someone more familiar with pytest could probably give me a hint to finish it up.
Here's what I'm seeing:
test_sigmffile.data_file = None
with tempfile.NamedTemporaryFile() as temp:
> with pytest.raises(error.SigMFFileError):
E Failed: DID NOT RAISE <class 'sigmf.error.SigMFFileError'>
tests/test_archivereader.py:49: Failed
====================== short test summary info =======================
FAILED tests/test_archivereader.py::test_access_data_without_untar - Failed: DID NOT RAISE <class 'sigmf.error.SigMFFileError'>
==================== 1 failed, 29 passed in 0.47s ====================
@gmabey I think the solution to your pytest error is remove the line with pytest.raises(error.SigMFFileError): because that unit test isn't trying to test an error/invalid condition like the unit test I assume you copied that code from, so remove that with and unindent the rest of the lines and it should be good to go. I'll look into setting up a pipeline to run the unit tests automatically.