rust-fitsio icon indicating copy to clipboard operation
rust-fitsio copied to clipboard

impl fitsio::tables::ReadsCol for OSString

Open d3v-null opened this issue 2 years ago • 7 comments

Hello, I've had the misfortune of encountering a fits file in which random non-utf8 junk bytes are present after a null byte in a string column. Because fitsio impls ReadsCol for String, but not OSString, this means that it can't read these columns at all.

Here's what the data looks like in Python:

>>> from astropy.io import fits
>>> hdus=fits.open('file.uvf')
>>> print(repr(hdus['AIPS AN'].data['ANNAME'][0]))
b's0000\x00\xff\xff'

The first ANNAME should just be s0000. If there were an impl OSString, I could trim these junk bytes before converting to string and read the column correctly.

Thanks

d3v-null avatar May 15 '23 02:05 d3v-null

Hi @d3v-null, thanks for raising this issue. I think you're right - the string handling needs to be better in rust-fitsio. I think you're probably right that these need to be OsStrings since they are not null terminated in cfitsio. It might make the rest of the string handling simpler as well since I think we're stripping trailing junk data from the strings as part of the reading process. If we read as OsString then we can strip in a cheaper way (I think, it's been a while... :joy:).

simonrw avatar May 15 '23 16:05 simonrw

I don't suppose you have (or can you generate) a test file I can use for my testing?

simonrw avatar May 16 '23 19:05 simonrw

@d3v-null do you have an example file that you can share with me? I'd like to investigate further the use of non-unicode characters in table values.

simonrw avatar May 24 '23 20:05 simonrw

Hey @simonrw , sorry it took me so long to respond. I deleted my copy of the corrupt file and had to figure out how to get you a copy of the broken hdu without sending you the entire 9GB file. Here's a file with a copy of the AIPS AN table in question.

>>> test['AIPS AN'].data
/usr/lib/python3/dist-packages/astropy/io/fits/fitsrec.py:699: UserWarning: Field 2 has a repeat count of 0 in its format code, indicating an empty field.
  warnings.warn(
FITS_rec([('s0000\x0002', [-2564976.05653377,  5085352.7292609 , -2861040.11633764]),
...
         dtype=(numpy.record, [('ANNAME', 'S8'), ('STABXYZ', '>f8', (3,)), ('ORBPARM', '>f8', (0,)), ('NOSTA', '>i4'), ('MNTSTA', '>i4'), ('STAXOF', '>f4'), ('POLTYA', 'S1'), ('POLAA', '>f4'), ('POLCALA', '>f4'), ('POLTYB', 'S1'), ('POLAB', '>f4'), ('POLCALB', '>f4'), ('DIAMETER', '>f4')]))

the ANNAME column should just say s0000, not s0000\x0002

test.fits.zip

The original file is from here if you're curious

The other files are corrupt in different ways, but the root problem is the non-utf8 characters.

Cheers!

d3v-null avatar May 30 '23 09:05 d3v-null

Hi, sorry I am confused here. If I read the data from that file using a String column, the column names read until the null byte ("\x00") in this case, and s0000 is returned correctly for the first row.

I'm happy to also add reading as OsString since the caller can do their own string manipulation, but I'm curious. The String implementation already reads until the null character anyway.

simonrw avatar Jun 06 '23 21:06 simonrw

@d3v-null have you had a chance to reflect on my last comment?

simonrw avatar Apr 05 '24 16:04 simonrw

Hey Simon, when I use the example below, I get the s0000, but instead of stopping at the null byte, skips it, and \x0002, becomes an ASCII 02 at the end. I would actually prefer it if it stopped at the null byte like you described.

Otherwise, you could be right about OsString not handling things better, maybe Vec<u8> be a better option then?

use std::error::Error;
use std::path::PathBuf;
use fitsio::FitsFile;

fn run() -> Result<(), Box<dyn Error>> {
    let mut fitsfile = FitsFile::edit(PathBuf::from("test.fits"))?;
    fitsfile.pretty_print().expect("printing fits file");
    let ant_hdu = fitsfile.hdu("AIPS AN")?;
    for name in ant_hdu.read_col::<String>(&mut fitsfile, "ANNAME")? {
        println!("{:?}", name);
    }
    Ok(())
}

fn main() {
    run().unwrap();
}
cargo run --example dev
  file: "test.fits"
  mode: READWRITE
  extnum hdutype      hduname    details
  0      IMAGE_HDU               dimensions: [], type: UnsignedByte
  1      BINARY_TBL   AIPS AN    num_cols: 13, num_rows: 512
"s000002"
"s000102"
"s000202"
"s000302"
"s000402"
"s000502"
"s000602"

compared to astropy fits, which gives me

python -c "from astropy.io import fits; names=fits.open('test.fits')['AIPS AN'].data['ANNAME'];  print('\n'.join(map(repr, names)))"
's0000\x0002'
's0001\x0002'
's0002\x0002'
's0003\x0002'
's0004\x0002'
's0005\x0002'
's0006\x0002'
's0007\x0002'
's0008\x0002'

d3v-null avatar Apr 17 '24 02:04 d3v-null