pycbc icon indicating copy to clipboard operation
pycbc copied to clipboard

Error in pycbc.catalog.Merger.strain() while retrieving GW170817-v2 (O1_O2-Preliminary release) event data

Open joao-aveiro opened this issue 2 years ago • 5 comments

Description

An error is raised while using pycbc.catalog.Merger.strain() to retrieve L1 strain data for the GW170817-v2 event from the O1_O2-Preliminary release. It seems that the data file is correctly downloaded and the problem is related to the way the channel name is defined in the class method:

ver = url.split('/')[-1].split('-')[1].split('_')[-1]
sampling_map = {4096: "4KHZ", 16384: "16KHZ"}
channel = "{}:GWOSC-{}_{}_STRAIN".format(ifo, sampling_map[sample_rate], ver)

which does not conform to the channel name provided in this dataset - L1:LOSC-STRAIN. This probably affects other datasets that are not from the GWT catalogs.

Test code

import pycbc.catalog


print("\nSelecting merger...")
m = pycbc.catalog.Merger("GW170817-v2", source="O1_O2-Preliminary")

print("Retrieving strain data...")
ts = m.strain("L1", sample_rate=4096, duration=2048).time_slice(m.time - 32, m.time + 32)

Error

XLAL Error - XLALFrameUFrChanRead_FrameL_ (LALFrameL.c:899): Channel L1:GWOSC-4KHZ_V1_STRAIN not found
XLAL Error - XLALFrameUFrChanRead_FrameL_ (LALFrameL.c:899): Wrong name
XLAL Error - XLALFrFileQueryChanVectorLength (LALFrameIO.c:199): Wrong name
Traceback (most recent call last):
  File "~/miniconda3/envs/gw-det/lib/python3.9/code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 8, in <module>
  File "~/miniconda3/envs/gw-det/lib/python3.9/site-packages/pycbc/catalog/__init__.py", line 143, in strain
    return read_frame(str(filename), str(channel))
  File "~/miniconda3/envs/gw-det/lib/python3.9/site-packages/pycbc/frame/frame.py", line 219, in read_frame
    data_length = lalframe.FrStreamGetVectorLength(first_channel, stream)
RuntimeError: Wrong name

joao-aveiro avatar May 13 '22 15:05 joao-aveiro

@joao-aveiro Thanks for reporting this. I think you've identified the issue correctly. If you have a suggestion we have happy to accept PRs. I think the main issue here is how to know the channel name. For the GWTC-* sets the patter was as written in the formula. For others, it appears to not be the case. It's not entirely clear from the json metadata (or wasn't, maybe this has been updated?).

Any ideas?

ahnitz avatar May 13 '22 16:05 ahnitz

I think looking at this those datasets are using the bulk data naming convention before it was changed. We could try a fallback here in case the standard changes?

Perhaps a simple try / except https://github.com/gwastro/pycbc/blob/master/pycbc/catalog/init.py#L139

If this fails, retry with the 'IFO: LOSC-STRAIN' , name?

@joao-aveiro If you'd like to submit a PR to make this change, I think we'd be happy to review it short order and merge it.

ahnitz avatar May 13 '22 16:05 ahnitz

It seems that the data is retrieved in .gwf files. I am not familiarized with this format, so some of the statements or ideas I might present from hereon might not make sense or be wrong.

@ahnitz I don't know how many naming conventions there are, but assuming there is only the one for the GTWC-* data and this one that is raising error, I believe an if/else statement to switch between the two cases depending on the catalog used - and a listing of which catalog uses each convention - would be clearer. More so, if for one catalog it isn't specified which convention is used, then we could use a try/except statement and, if that fails, raise a NotImplemententedError exception with an informative error message. I believe this would be helpful if in the future more catalogs are added and/or new conventions emerge; the exception and error message would help future users to more easily identify the problem and further expand this package to work with the new datasets.

joao-aveiro avatar May 13 '22 18:05 joao-aveiro

Nevertheless, I think this might not be the best approach either. Let me explain and propose a different approach:

I was able to find the correct channel name somewhat by trial-and-error and by analyzing the raw .gwf file (in plain-text/binary). I couldn't easily find documentation/metadata files where this name was specified and, for what I could find, no package (LALFrame, GWPy, PyCBC) had a method for checking which channels are present in a .gwf file. I don't know if this is correct, but I am assuming that a channel corresponds to a stream of data from a particular event and a .gwf file might have various channels corresponding to the same event but representing data obtained from different sensors, ADCs, ... . I believe it would be extremely helpful to have a function/method that, given a .gwf file, lists the available channel names. Apart from being a great tool for the end user, this would allow for automatically retrieving the channel name without using any formulas for single-channel files. For files with more channels, then the user would have to provide the desired channel name, and if the provided name isn't correct then an error message with a list of channel names could be given.

@ahnitz I await your feedback to see if any of this makes sense. If it does, I can try to put something together (either this or the approach described previously) and submit a PR.

joao-aveiro avatar May 13 '22 18:05 joao-aveiro

@joao-aveiro You are correct that that would be the better behavior. If you think you can add that I think we'd be happy to accept a PR. The underlying library we use (lalframe) doesn't expose that cleanly, which is why we don't do it that way already. However, if you think you can add this (or barring that the previous solution), we'd be happy to accept a PR.

ahnitz avatar May 13 '22 23:05 ahnitz