Julien
Julien
The current implementation is not great for reading many files (100+). ### Current implementation, and why this is not great. The way we read and distribute the data from many...
I'm trying to benchmark spark-fits on s3, by internally looping over the same piece of code: ```python path = "s3a://abucket/..." fn = "afile.fits" # 700 MB for index in range(N):...
In the Image HDU header, there is no mention of the data type via TFORMn value (letters: L, K, J, ...). Instead, the number of bits used per image pixel...
The PR #55 fixed a bug with the header check (the connector was checking all FITS header before starting the job. Good idea until you have +10,000 files....). The fix...
big change... New in Spark 2.3.0. Fortunately, from https://databricks.com/session/apache-spark-data-source-v2, there is no immediate plan to deprecate v1!
We need to understand whether we can handle zipped files (that is unpack blocks in HDFS!). Is fpack doing this, or do we need to implement something new? Or do...
It would be worth investigating whether [data serialization](https://spark.apache.org/docs/latest/tuning.html#data-serialization) plays a role here.
This would dramatically speed-up the computation in some cases.