Wrong values returned when reading a FITS file
Hi Julien,
I've been using this wonderful software to read and ingest very large FITS files into Apache Hive. This week I've encountered an issue when reading a large FITS file (~122GiB). It returns wrong values for some rows, which scares me... I'm trying to create a smaller file to reproduce the issue, still no success.
One of the issues I've found, is returning very large positive numbers on a 'K' field, where astropy and fitsio correctly return very large negative numbers.
Is there any known issue?
I've been debugging the issue all day and I can only reproduce it by using very large FITS file. A 13.9 GiB file does NOT trigger it, a 35GiB file DOES. I was hoping that there was something strange written in the data, but now I think there has to be some kind of pointer/counter in the code that is 32 bits or does not have enough size to read a file that large.
Any ideas?
Best regards,
Just as an example of the kind of corruption:
A file, opened with fitsio:
Same, with spark-fits:
I've just filtered by the values that are wrong. As you can see, the values are garbage...
Hi Pau,
hum, thanks for reporting -- it looks like a bug indeed!
Are you reading several FITS files, or one big FITS file at once? And do you initially read data with spark-fits from HDFS? If yes, what is the block size?
Regarding Spark, what is the version used (and the scala) and the java version used? And do you have a specific spark configuration (i.e. any runtime options that might be worth mentioning)?
Also, would you be willing to share a small sample of that file so that I can scale it (with random values of the same types) to several dozen of GB here to try to reproduce the error?
The agenda is tight this week with coming public holidays, but I will try to work on it next week.
Hi,
Just a single huge 192GiB FITS file, stored on a shared Ceph filesystem (no HDFS).
Tomorrow I'll provide the rest of the info, and a small file sample :))
Thanks a lot!
Hi again
I attach a sample of the file. It has 18 blocks of header and 13 blocks of data (exactly 32 rows). You should be able to replicate the last 13 blocks on and on (updating NAXIS2) to get a much larger file. The original file has NAXIS2 = 175406965 and 205226202240 bytes in size (~192GiB).
Hadoop 3.2.3
Spark 3.4.2
Using spark.jars.packages com.github.astrolabsoftware:spark-fits_2.12:1.0.0
About spark config... the only thing that may be worth mentioning is having
spark.sql.execution.arrow.pyspark.enabled true
spark.sql.execution.arrow.pyspark.fallback.enabled false
Thanks a lot!
thanks for the info! I will keep you in touch.
Hi Julien,
Any progress on this?
Do you need additional info?
Hi Pau -- I apologize for the silence. I worked briefly on this at then end of May, but found nothing obviously wrong in the code. I had no time since then, busy with Summer holidays and then the Rubin Observatory commissioning :-(
I'm not sure I will have time in the next coming weeks, but I'll keep you in touch in case I have more news.
Hi Julien,
Let me know if I can assist you in debugging this. I could provide you with an account at PIC (www.pic.es) with a Spark environment and a faulty file, if you need help reproducing the issue. I'm a bit worried because I don't know if I can trust the data I ingest using this :S
Best,
Pau.
Hi Pau --
I could provide you with an account at PIC (www.pic.es) with a Spark environment and a faulty file, if you need help reproducing the issue.
Good idea, that could accelerate the diagnostic indeed!
I'm a bit worried because I don't know if I can trust the data I ingest using this
yes, I agree...
Hi Julien,
Please contact me on [email protected], and I'll provide instructions to get a PIC account so you can debug/test the issue at out data center.
Thanks a lot!
Pau.