spark-fits icon indicating copy to clipboard operation
spark-fits copied to clipboard

Wrong values returned when reading a FITS file

Open ptallada opened this issue 10 months ago • 11 comments

Hi Julien,

I've been using this wonderful software to read and ingest very large FITS files into Apache Hive. This week I've encountered an issue when reading a large FITS file (~122GiB). It returns wrong values for some rows, which scares me... I'm trying to create a smaller file to reproduce the issue, still no success.

One of the issues I've found, is returning very large positive numbers on a 'K' field, where astropy and fitsio correctly return very large negative numbers.

Is there any known issue?

ptallada avatar May 26 '25 11:05 ptallada

I've been debugging the issue all day and I can only reproduce it by using very large FITS file. A 13.9 GiB file does NOT trigger it, a 35GiB file DOES. I was hoping that there was something strange written in the data, but now I think there has to be some kind of pointer/counter in the code that is 32 bits or does not have enough size to read a file that large.

Any ideas?

Best regards,

ptallada avatar May 26 '25 17:05 ptallada

Just as an example of the kind of corruption:

A file, opened with fitsio:

Image

Same, with spark-fits:

Image

I've just filtered by the values that are wrong. As you can see, the values are garbage...

ptallada avatar May 26 '25 17:05 ptallada

Hi Pau,

hum, thanks for reporting -- it looks like a bug indeed!

Are you reading several FITS files, or one big FITS file at once? And do you initially read data with spark-fits from HDFS? If yes, what is the block size?

Regarding Spark, what is the version used (and the scala) and the java version used? And do you have a specific spark configuration (i.e. any runtime options that might be worth mentioning)?

Also, would you be willing to share a small sample of that file so that I can scale it (with random values of the same types) to several dozen of GB here to try to reproduce the error?

The agenda is tight this week with coming public holidays, but I will try to work on it next week.

JulienPeloton avatar May 26 '25 21:05 JulienPeloton

Hi,

Just a single huge 192GiB FITS file, stored on a shared Ceph filesystem (no HDFS).

Tomorrow I'll provide the rest of the info, and a small file sample :))

Thanks a lot!

ptallada avatar May 26 '25 21:05 ptallada

Hi again

I attach a sample of the file. It has 18 blocks of header and 13 blocks of data (exactly 32 rows). You should be able to replicate the last 13 blocks on and on (updating NAXIS2) to get a much larger file. The original file has NAXIS2 = 175406965 and 205226202240 bytes in size (~192GiB).

Hadoop 3.2.3 Spark 3.4.2 Using spark.jars.packages com.github.astrolabsoftware:spark-fits_2.12:1.0.0

About spark config... the only thing that may be worth mentioning is having

spark.sql.execution.arrow.pyspark.enabled true
spark.sql.execution.arrow.pyspark.fallback.enabled false

Thanks a lot!

test.zip

ptallada avatar May 27 '25 07:05 ptallada

thanks for the info! I will keep you in touch.

JulienPeloton avatar May 27 '25 08:05 JulienPeloton

Hi Julien,

Any progress on this?

Do you need additional info?

ptallada avatar Sep 24 '25 08:09 ptallada

Hi Pau -- I apologize for the silence. I worked briefly on this at then end of May, but found nothing obviously wrong in the code. I had no time since then, busy with Summer holidays and then the Rubin Observatory commissioning :-(

I'm not sure I will have time in the next coming weeks, but I'll keep you in touch in case I have more news.

JulienPeloton avatar Sep 24 '25 08:09 JulienPeloton

Hi Julien,

Let me know if I can assist you in debugging this. I could provide you with an account at PIC (www.pic.es) with a Spark environment and a faulty file, if you need help reproducing the issue. I'm a bit worried because I don't know if I can trust the data I ingest using this :S

Best,

Pau.

ptallada avatar Sep 24 '25 08:09 ptallada

Hi Pau --

I could provide you with an account at PIC (www.pic.es) with a Spark environment and a faulty file, if you need help reproducing the issue.

Good idea, that could accelerate the diagnostic indeed!

I'm a bit worried because I don't know if I can trust the data I ingest using this

yes, I agree...

JulienPeloton avatar Sep 25 '25 05:09 JulienPeloton

Hi Julien,

Please contact me on [email protected], and I'll provide instructions to get a PIC account so you can debug/test the issue at out data center.

Thanks a lot!

Pau.

ptallada avatar Oct 06 '25 11:10 ptallada