modification of _strg for detecting valid FITS headers
Description
This is a continuation of this issue. If there is a bare doubled-quote in a value, the regex will terminate there instead of at the end of the string. I see that there are ways that this is worked around but the following might allow better fidelity in parsing (and allow errors to be raised when there are not the correct number of paired single quotes within the outermost non-comment single quotes). The inline comment acknowledges that
# The <strg> regex is not correct for all cases, but
# it comes pretty darn close. It appears to find the
# end of a string rather well, but will accept
# strings with an odd number of single quotes,
# instead of issuing an error. The FITS standard
# appears vague on this issue and only states that a
# string should not end with two single quotes,
# whereas it should not end with an even number of
# quotes to be precise.
#
# Note that a non-greedy match is done for a string,
# since a greedy match will find a single-quote after
# the comment separator resulting in an incorrect
# match.
>>> import re
>>> from astropy.io.fits import Card
>>> _strg = Card._strg
>>> re.match(_strg, "'a '' b'")
<re.Match object; span=(0, 5), match="'a ''"> <-- truncated after first ''
>>> re.match(_strg, "'a ' b ' /c'")
<re.Match object; span=(0, 4), match="'a '"> <--truncated after first ' even though remainder is not comment
Expected behavior
>>> __strg="'(?P<strg>(?:[ -&(-~]|'')*)'(?= *(?:$|/))"
>>> re.match(__strg, "'a '' b'")
<re.Match object; span=(0, 8), match="'a '' b'">
>>> re.match(__strg, "'a ' b ' /c'") is None
True
That string enforces that any single quotes in a string must appear doubled and that what appears after the last single quote is space followed by the end of the line or else a comment mark (forward slash).
How to Reproduce
see above
Versions
import astropy
astropy.system_info()
platform
--------
platform.platform() = 'Windows-11-10.0.26100-SP0'
platform.version() = '10.0.26100'
platform.python_version() = '3.13.7'
packages
--------
astropy 7.1.1
numpy 2.2.3
scipy 1.15.2
matplotlib 3.10.1
pandas 2.2.3
pyerfa 2.0.1.5
I'm confused, I agree that Card._strg does not match the string but it's a part of the regex and is not meant to be used alone. When using the full parsing code it works as expected:
In [2]: fits.Card.fromstring("TEST = 'a '' b'").value
Out[2]: "a ' b"
In [3]: fits.Card.fromstring("TEST = 'a ' b ' /c'").value
Out[3]: "a ' b"
(if considering that parsing a single quote is a feature to be more tolerant to what can be found in FITS files... I don't know the history for this one but possibly it was done on purpose).
Notice that part of the string is missing in Out[3]. Use of the current regex is not able to detect the error condition for a malformed first string (which, as I understand, In[3] is).
Notice that part of the string is missing in Out[3].
Which part ?
Use of the current regex is not able to detect the error condition for a malformed first string (which, as I understand, In[3] is).
Yes but changing that could make it impossible to parse some files. So in many cases like this io.fits is tolerant to errors. Ideally we could detect this case with .verify() and raise a warning / error or fix it when the card is verified, but here it's not easy because we replace double quotes with single ones when parsing so verify cannot now that there was only a single quote.
which part
' /c'
It should take everything between single quotes ignoring ''. It does that but there is more after the first closing ' so it should have not matched.
Well here the remaining part is considered as the comment part, which I makes sense I think and corresponds to what the comment says:
# Note that a non-greedy match is done for a string,
# since a greedy match will find a single-quote after
# the comment separator resulting in an incorrect
# match.
However there are other cases that are not handled correctly with the current regex, e.g.
In [7]: fits.Card.fromstring("TEST = 'a '' b '' /c'").value
Out[7]: "a ' b '"
here /c should be part of the value.
If disallowing a single quote inside a string allows to fix this, I might reconsider what I said above :)