specutils
specutils copied to clipboard
Which (FITS) header should be stored in Spectrum1D.meta (and where)?
Loaders for FITS files are saving the full information from the header cards in Spectrum1D.meta['header']
. In https://github.com/astropy/specutils/pull/608#discussion_r395632586 I noted some ambiguity as to which header should be stored.
Most of the default_loaders
for such formats, namely apogee
, hst_*
, muscles
, subaru_pfs
and sdss
spec_loader
, are pulling the info from the FITS header of the first HDU, while the spectral data are read from one (or several) of the extension HDUs.
This primary HDU header will usually contain general information on the target, observing run etc., while the HDU with the spectrum will include more specific info on the dataset. In principle both could be of interest. Should there be a general recommendation for new default (or custom) loaders (and a fix to existing ones) how to handle this?
I the following options:
-
Leave it at the primary HDU header as with the majority of current
default_loaders
. Pros: no API changes Cons: discarding potentially useful information; inconsistent with Astropy'sio.fits
behaviour forTable
(see below) -
Always read the header for the HDU containing the data. This is already implicitly done by
generic_spectrum_from_table
andspectrum_from_column_mapping
as they are loadingtable.meta
under the hood. Pros: Matches theTable.meta
data; possibility to useio.fits
mechanism to filter relevant keywords (i.e. excludingREMOVE_KEYWORDS
and all, that have already been used in defining column formats and units). Cons: Losing info from the primary HDU header (among the current formats with data in a BINTABLE extension, at least the Apogee, HST, MUSCLES and SDSS also have potentially interesting information in the primary HDU). Some formats, e.g.apogee
, are parsing spectrum data from several different extensions. -
Combine all header info from primary HDU and any extension HDU from which data are read. This is already implemented in the
jwst_reader
by using the (FITS)header.extend
method to update the primary header with the data extension header. Pros: No or minimal loss of information. Cons: Possibly excessive accumulation of values inspectrum.meta['header']
; writing the spectrum back will produce a FITS file with different headers (but currently there is no guarantee that any of the loaders will write back identically formatted files anyway).
I tend to the 3rd option, using a similar implementation to jwst_reader
. It might still be preferable to use the Astropy Table.meta
dict for updating to take advantage of the io.fits
mechanism for stripping excess cards.
Addendum
In the above I was assuming that the Spectrum1D
format always has the header info collected in one dictionary within the meta dictionary as Spectrum1D.meta['header']
; this is also suggested in the docs for creating a custom loader by prescribing
meta = {'header': header}
and followed by most of the default_loaders
.
But in fact neither parsing_utils
nor jwst_reader
follow this scheme, but instead put all saved header cards directly into Spectrum1D.meta
.
I therefore suggest to first settle on a consistent scheme for organising the meta
dict.
I'm a bit worried about carrying around a lot of metadata by default in a data structure that could otherwise be relatively lightweight. Many users will want to do various things with the spectrum, perhaps combining it with data from other files. When writing out the data, they can go back to their FITS files and extract the relevant metadata, but I think it's very hard to anticipate in any general way what they will want.
Perhaps, at least as an interim solution, this could be handled in the docs by adding example or two of how to copy your FITS metadata into the Spectrum1D object.
That leaves the user free to put in the header or headers or select the subset of keywords they want. And also gives them a lot of freedom on the data structure to uses within meta
-- e.g. a header string just copied from the FITS header as meta={'header': header}
, a dictionary of key-value pairs, or a dictionary key-(value,comment) tuples, or a hierarchical dictionary {ext: {key:(value,comment)}}.
I'm a bit worried about carrying around a lot of metadata by default in a data structure that could otherwise be relatively lightweight. Many users will want to do various things with the spectrum, perhaps combining it with data from other files. When writing out the data, they can go back to their FITS files and extract the relevant metadata, but I think it's very hard to anticipate in any general way what they will want.
So your suggestion would be as a 4th option to not read any header info (by default) at all?
I would make this at least selectable with a kwarg, as I am sceptical how willing users would be to open their files separately through io.fits
just to access the metadata.
If the available/reasonable/useful data are read in, say with
Spectrum1D.read(..., read_header=True)
the dictionary in Spectrum1D.meta
can still be filtered, reformatted, stripped down etc.
Ah. Good point about having to use io.fits
separately to get the metadata. But even if we add an argument read_header=True
, there is still the issue of "which FITS header?"
I don't think one can default to storing both the primary and extension header in the same string or flat dictionary, because there can be naming collisions between the two headers. So one either needs dictionary of strings or a nested dictionary, or something like an HDU list.
Yes, the header.extend
method that jwst_reader
is currently using does provide some control over that through the update
and unique
keywords, but it does not include a convenient method to store duplicate cards under unique names.
I had been thinking about (my original) option 4, making spectrum.meta['header']
a list, but that seemed most likely to break current usage. But since the current default_loaders
do not return a consistent format right now, maybe it's acceptable.
Since we already have a dependency on io.fits
to read the file -- and with the same drawback of breaking current usage -- might it make sense to have something like spectrum.meta['fits_headers']
be a list of astropy.io.fits.header.Header
objects? (If read_header=True
...I'd still suggest not making this the default).
I think it's totally fine (and good) to collect header/meta data. The problem I always encounter is that there is not enough meta data to fully characterize the data, rather than the other way around.
I don't have strong feelings about read_header = True | False
but I do think that when a Spectrum1D object is created, it should contain a header with a minimal amount of information.
I would as much as possible like to take advantage of and mirror the functionality which exists already for headers so I think I am in favor of meta
including a list of header objects.
Having gone back through this with a bit of hindsight, I think I want to vote for @dhomeier's original option 3. My reasoning is twofold:
- As I user I don't want to do
spectrum.meta['something'][psosibly_somthing_else]['t_exp']
. I wantspectrum.meta['t_exp']
. Anything more feels like putting effort on the user that should be instead in the reader. - A
Spectrum1D
object is meant to represent either a single spectrum, or a vector of similar spectra, not a file. So it doesn't make sense to have themeta
be the metadata of a spectrum, not of all the information that was in one file.
2 in particular leads to the impedance mismatch we're struggling with: FITS or similar files in principle can store multiple spectra in one file, but Spectrum1D
by design is not so loose. So I think we necessarily have to do some funneling, and that's OK.
As a middle-ground of https://github.com/astropy/specutils/issues/617#issuecomment-604667053 and this option 3, another possibility (Option 3.5?) is to do option 3 and also what @hcferguson suggests in https://github.com/astropy/specutils/issues/617#issuecomment-604667053 - then the user can see the "originals" if they really want to, but we tell them the only thing that will be used by downstream tools is what's in meta
itself.
As for read_header=False
/option 4, I'm broadly against that. I think in general it's better to assume that if the user doesn't know what they're doing, we're better off just carrying as much of the information they want as possible (and using warnings or exceptions only when the information is in conflict). Moreover, I think for the user who really wants/has the time to customize their meta
in this way, you tell them instead to just do something like:
spectrum = Spectrum1D.read(...)
spectrum.meta.clear()
That's only one line more, and is more explicit in that it makes it clear that the user is intentionally starting fresh. And that leaves us free to implement option 3 in read
and in this case that @hcferguson raises that would be option 4, the user still gets what they want.
I'm excited to move this discussion forward and am happy with the header data (a dictionary) be in the meta
. If we are all in favor, should we document this decision somewhere?
In the docs, I think we need to include info about meta
in
- Basic Spectrum Creation
- Somewhere in the readers. E.g., does the iraf reader put the IRAF header into the Spectrum1d
meta
?
If there's code which needs to be written for the writers/readers, I'm happy to contribute, especially if someone pointed me in the right direction. E.g., open some issues and assign me. :)
I've done some "round trip" testing using the tabular-fits
format and it doesn't handle multiple dictionaries in meta
well at all. For example, if I have two dictionaries in meta['header']
and meta['header2']
, only one of them gets preserved in the resulting FITS file. And when I read it back in, all the header cards are in meta
and not in meta['header']
. Bottom line, any dictionary sub-structure within the meta
dictionary is not currently preserved and it would take some work to make it so.
~~As a result, I re-affirm my vote that we recommend that all header cards be stored directly in meta
.~~
Update Nov 16, 2023. I change my vote to store the Primary header in meta['header']
.
I can also update the tabular-fits
writer to be less sensitive to there being a meta['header']
and instead find all the dictionaries in meta
.
I agree with putting everthing directly in meta because that's the most usable.
I do want to point out something though to keep in mind for both users and developers: Operations that the user performs on the Spectrum1D
object can make the values saved in the header wrong, e.g. I can manually update the wavelength solution or divide the flux by 5. In those cases I should probably also update the meta info (e.g. divide the EXPTIME by 5), but that's easily forgotten and, given the range of keywords used in fits headers, not possible automatically. So, there is an argument not to propagate too much information - propagate only what's right and drop what's dubious unless the user explicitly chooses to keep e.g. the EXPTIME (presumably they know way). People have come to trust fits header and I would hate it to receive a file from a collaborator that looks exactly like a HST/COS spectrum, but has been modified in some way that's not recorded in the headers. So, I think writing out fewer keywords or at least header in a different format is "a good thing".
it's always going to be possible for data and metadata to get out of sync somehow (if i had a nickel for every time i've had this happen with a FITS file...). i don't think dropping metadata on the floor is the answer, though. anything in meta
should be included in an output file somehow. that's easier in some formats than others with FITS being particularly restrictive.
There seems to be consensus that the Primary Header, e.g., whatever you'd put in HDU0, "should" be in meta['header']
.
There is not a consensus around what to do with extension headers and I am currently ok with that.
I propose that docs be updated to reflect the general expectation that a header be put in meta['header']
. I am going to open a dedicated issue with this bytesize suggestion.
As a result, I re-affirm my vote that we recommend that all header cards be stored directly in
meta
.
I agree with putting everthing directly in meta because that's the most usable.
I understood those two just to be the opposite, storing the cards directly as keys of meta
?
Yes, you're right, I've changed my mind. I'm gonna edit my comment.
Since there are a lot of open issues on this topic, I went ahead and made a discussion to consolidate these discussions and so we can decide how to proceed - we have a ticket on our board this sprint to start this work. I have a proposed plan there (storing everything in 'meta' as Header objects, and keeping the primary header separate for those formats that want to preserve that structure, or combine them) but I would appreciate some input before I start this work to make sure we are in agreement. Thanks!
https://github.com/astropy/specutils/discussions/1125
I think most of the information should go into the primary header, according to my understanding of multi extension files their use case is to store different planes of the same data not just a pile of that, for that case another model should be used. I've also seen multidetector instruments use MEFs, where each detector is stored as a separate extension . In either case the user must be able to control where a given metadata is going to be stored.