rust-fitsio
rust-fitsio copied to clipboard
impl fitsio::tables::ReadsCol for OSString
Hello, I've had the misfortune of encountering a fits file in which random non-utf8 junk bytes are present after a null byte in a string column. Because fitsio impls ReadsCol for String, but not OSString, this means that it can't read these columns at all.
Here's what the data looks like in Python:
>>> from astropy.io import fits
>>> hdus=fits.open('file.uvf')
>>> print(repr(hdus['AIPS AN'].data['ANNAME'][0]))
b's0000\x00\xff\xff'
The first ANNAME should just be s0000. If there were an impl OSString, I could trim these junk bytes before converting to string and read the column correctly.
Thanks
Hi @d3v-null, thanks for raising this issue. I think you're right - the string handling needs to be better in rust-fitsio. I think you're probably right that these need to be OsStrings since they are not null terminated in cfitsio. It might make the rest of the string handling simpler as well since I think we're stripping trailing junk data from the strings as part of the reading process. If we read as OsString then we can strip in a cheaper way (I think, it's been a while... :joy:).
I don't suppose you have (or can you generate) a test file I can use for my testing?
@d3v-null do you have an example file that you can share with me? I'd like to investigate further the use of non-unicode characters in table values.
Hey @simonrw , sorry it took me so long to respond. I deleted my copy of the corrupt file and had to figure out how to get you a copy of the broken hdu without sending you the entire 9GB file. Here's a file with a copy of the AIPS AN table in question.
>>> test['AIPS AN'].data
/usr/lib/python3/dist-packages/astropy/io/fits/fitsrec.py:699: UserWarning: Field 2 has a repeat count of 0 in its format code, indicating an empty field.
warnings.warn(
FITS_rec([('s0000\x0002', [-2564976.05653377, 5085352.7292609 , -2861040.11633764]),
...
dtype=(numpy.record, [('ANNAME', 'S8'), ('STABXYZ', '>f8', (3,)), ('ORBPARM', '>f8', (0,)), ('NOSTA', '>i4'), ('MNTSTA', '>i4'), ('STAXOF', '>f4'), ('POLTYA', 'S1'), ('POLAA', '>f4'), ('POLCALA', '>f4'), ('POLTYB', 'S1'), ('POLAB', '>f4'), ('POLCALB', '>f4'), ('DIAMETER', '>f4')]))
the ANNAME column should just say s0000, not s0000\x0002
The original file is from here if you're curious
The other files are corrupt in different ways, but the root problem is the non-utf8 characters.
Cheers!
Hi, sorry I am confused here. If I read the data from that file using a String column, the column names read until the null byte ("\x00") in this case, and s0000 is returned correctly for the first row.
I'm happy to also add reading as OsString since the caller can do their own string manipulation, but I'm curious. The String implementation already reads until the null character anyway.
@d3v-null have you had a chance to reflect on my last comment?
Hey Simon,
when I use the example below, I get the s0000, but instead of stopping at the null byte, skips it, and \x0002, becomes an ASCII 02 at the end. I would actually prefer it if it stopped at the null byte like you described.
Otherwise, you could be right about OsString not handling things better, maybe Vec<u8> be a better option then?
use std::error::Error;
use std::path::PathBuf;
use fitsio::FitsFile;
fn run() -> Result<(), Box<dyn Error>> {
let mut fitsfile = FitsFile::edit(PathBuf::from("test.fits"))?;
fitsfile.pretty_print().expect("printing fits file");
let ant_hdu = fitsfile.hdu("AIPS AN")?;
for name in ant_hdu.read_col::<String>(&mut fitsfile, "ANNAME")? {
println!("{:?}", name);
}
Ok(())
}
fn main() {
run().unwrap();
}
cargo run --example dev
file: "test.fits"
mode: READWRITE
extnum hdutype hduname details
0 IMAGE_HDU dimensions: [], type: UnsignedByte
1 BINARY_TBL AIPS AN num_cols: 13, num_rows: 512
"s000002"
"s000102"
"s000202"
"s000302"
"s000402"
"s000502"
"s000602"
compared to astropy fits, which gives me
python -c "from astropy.io import fits; names=fits.open('test.fits')['AIPS AN'].data['ANNAME']; print('\n'.join(map(repr, names)))"
's0000\x0002'
's0001\x0002'
's0002\x0002'
's0003\x0002'
's0004\x0002'
's0005\x0002'
's0006\x0002'
's0007\x0002'
's0008\x0002'