FITS Handling: Introduce Lossless Compression for HDUs
This PR introduces lossless compression algorithms for Header/Data Units (HDUs) within our FITS file handling module ('RMS/Formats/FFfits.py'). The goal is to reduce FITS file sizes to improve storage efficiency, mitigate wear and tear on storage devices, and enhance data transfer speeds, all while maintaining the data integrity and accessibility.
Changes Made:
- Implemented RICE_1 compression for data HDUs (maxpixel, maxframe, avepixel, stdpixel).
Impact:
- Expected reduction in FITS file sizes by approximately 30-60%, significantly saving disk space, reducing wear and tear on storage devices, and improving data transfer times.
- The size of compressed 1080p FITS is expected to be similar with that of uncompressed 720p FITS.
- Machines with marginal storage device performance are expected to benefit from the smaller file size.
- CPU load is expected to increase slightly.
Testing:
- Conducted compatibility testing on Rpi3, 4, and 5.
- Tested decompression in CMNbinViewer
Is this expected to be safe to run on stations that will upload, or should upload be turned off when testing this?
can you please share some example compressed files ? I'd like to check they're still compatible with other FITS viewing and analysis software,
Here are four FF fits files. Unfortunately, It's cloudy here right now. It would interesting to see what the size will be on a starry night - the size reduction might not be as dramatic.
- 1080p Uncompressed: 8.3 MB
- 1080p Compressed: 3.4 MB
- 720p Uncompressed: 3.7 MB
- 720p Compressed: 1.7 MB
[ Archive 2.zip ]
Dave, I suggest testing this separately until we can confirm that there are no effects on processing.
As noted by email i have found that the compressed files are incompatible with other FITS handling software - both FITS Liberator and Pixinsight refuse to open them anyway (they actually crash FL!). For me this feels like an issue as we don't know what the files might be getting used for downstream of RMS.
I'm also wondering whether compression is really beneficial. Storage is cheap and 3.5MB is not really that big and the data are bzip compressed for upload to GMN so I dont think it will save space on the GMN side or save time in uploads. I agree it means less data getting written to disk each night, but the lifespan of any decent SD card is pretty long (several years) and i've actually never had a card fail due to wear and tear.
On my end, CMNbinViewer, SAOImageDS9, and FITS Liberator have no issues handling it. I don't have access to Pixinsight but Pixinsight stated 'we have no interest in FITS, which we deprecated many years ago in PixInsight'.
Is anyone else having any issues? It would be strange for a Fits application to not support such a basic requirement.
Mark, you're making a very good point about the data being compressed before upload. The compressed and the uncompressed fits files, once compressed in a tar.bz2, have similar sizes. So, there is indeed no benefit as far transmitting the files. I don't know how the tar files are handled on the receiving end.
Regarding local storage, it would clearly be beneficial. So, I think it would be valuable to get to the bottom the compatibility issue before pulling this out.
Luc
As noted by email i have found that the compressed files are incompatible with other FITS handling software - both FITS Liberator and Pixinsight refuse to open them anyway (they actually crash FL!). For me this feels like an issue as we don't know what the files might be getting used for downstream of RMS.
I'm also wondering whether compression is really beneficial. Storage is cheap and 3.5MB is not really that big and the data are bzip compressed for upload to GMN so I dont think it will save space on the GMN side or save time in uploads. I agree it means less data getting written to disk each night, but the lifespan of any decent SD card is pretty long (several years) and i've actually never had a card fail due to wear and tear.
Pixinsight will keep supporting FITS for a long time, although the pixinsight guys have been pushing their own private format for years. Nobody else is interested in it! My point was that PixInsight is a -very- widely used tool in the astro-imaging world, and if it can't open the files then its a problem.
i think the problem with Fits Liberator is that there are two versions: version 3 is distributed by ESA from their website and can't open the compressed files. Version 4 from noirlab can handle the files.
Personally i actually don't agree there's much advantage in compressing the files, because storage is cheap and the days when 3-4MB was a big file are long gone. Saving 1-1.5MB per file isn't really very significant. I realise it'd mean we could keep the CapturedFiles data for a bit longer but its very rare that we need to look back more than a few days, and realistically we'd only be able to keep an extra day or so.
Anyway for me, i would want this to be an optional feature.
OK, will do.
From: Denis Vida @.> Sent: Tuesday, March 26, 2024 9:16 PM To: CroatianMeteorNetwork/RMS @.> Cc: David Rollinson @.>; Review requested @.> Subject: Re: [CroatianMeteorNetwork/RMS] FITS Handling: Introduce Lossless Compression for HDUs (PR #278)
Dave, I suggest testing this separately until we can confirm that there are no effects on processing.
— Reply to this email directly, view it on GitHubhttps://github.com/CroatianMeteorNetwork/RMS/pull/278#issuecomment-2020404913, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ASMOVH6CBPCC24ZFXXZRZM3Y2FYKXAVCNFSM6AAAAABFHXYEZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRQGQYDIOJRGM. You are receiving this because your review was requested.Message ID: @.***>
Hi all, Thank you for the thorough discussion. I agree about making this optional and releasing it to the community. We should announce this new feature and see what kind of feedback we get. If everyone is happy, we can make it the default. Squeezing a few more days of FF files could be very useful, as sometimes we're late to react and the data is gone forever. Any increases in this time window are very welcome. @Cybis320, could you add it as an option in the config file?
I added a 'hdu_compress' configuration setting to .config [Compression] section (defaults to False). And I made a command line utility to convert all FITS files in a directory to uncompressed FITS. RMS/Utils/ConvertCompressedFits.py
will try it out !
Update on compression ratio: my camera sees stars tonight and the compression still reduces the size by half.
Worth noting that AstroPy also supports full-file compression with gzip, zip or bzip2. Such files have .fits.gz/.fits.zip/.fits.bz2 extension.
IMO it should be preferred approach:
- files can at glance be recognized as compressed from extension alone
- all AstroPy-based tools will seamlessly accept such files, or alternatively, off-the-shelf zip decompressor can be used / no knowledge of fits internals is required
- I have serious doubts that "RICE_1" can achieve comparable compression ratio
As for justifiably question, I would agree that reducing size of individual files will be helpful. Besides retaining more nights on local storage before they're cleaned, it would also help will lifespan of SD cards, as they can only survive a limited number of write cycles.
As i've said earlier i do not see any real benefit to compressing the FITS files, and so would prefer this to be an option and not the default. I understand that it'd save a bit of space but as I've noted before diskspace is cheap and its rare that we need to recover raw data from more than a few days ago, especially now we have the EventMonitor in place and can recover data in near-realtime without camera-operator intervention. I understand the point about less wear and tear, but in my experience this isn't really a significant problem plus many owners are migrating to linux and/or SSD which probably mitigates that issue anyway.
Finally bear in mind that RMS is not the only app that reads/writes the FITS files. If we change the file format it could have unexpected impact on how end-users process the data in ways we cannot guess because we do not know what downstream processes the individual countrry networks have in place (we in UKMON don't use the FITS files, but its possible others do). I'm sure it'd be a simple fix, but i am never in favour of introducing incompatability.
In my use case, reduce storage size in ~50% is benefical. Storage isn't cheap here, and it would allow us to use 64GB or even 32GB microSD cards on the Pis.
It seems that required I/O write throughtput will be reduced as well, so we will be able to add more RMS instances on a x86_64 server, at cost of higher CPU usage.
I really like this patch. But I'd suggest you to collect more data. For instance, I'd like to see how extra CPU usage will affect the PIs, specially the Pi3 as they are at maximum load already, before making it default. The compatibility with software like Mark said is also important. Some people would prefer to have compression disabled for now, so they need to be aware and choose what to. I think most people don't use any external software, so I'm in favor in having it as default as long it doesn't break any existing RMS station.
I'll be happy to test it on BR0001, BR0002 and BR0003
To clarify what I meant above; it's not required to change file format to get benefits of compression. As such, I would be argue against this patch with current implementation of HDU-only compression.
Overall idea makes sense though. AstroPy can directly write to disk files with current FITS format, but already inside a generic compressed archive. Ie. best of both worlds, no changes to underlying format and smaller individual files, It will result in better compression ratio as well. I can open a branch if anybody is curious to test full-file compression?
To clarify what I meant above; it's not required to change file format to get benefits of compression. As such, I would be argue against this patch with current implementation of HDU-only compression.
Overall idea makes sense though. AstroPy can directly write to disk files with current FITS format, but already inside a generic compressed archive. Ie. best of both worlds, no changes to underlying format and smaller individual files, It will result in better compression ratio as well. I can open a branch if anybody is curious to test full-file compression?
Hm! Yes, It's a good balance if it can write the compressed file directly (and not by writting the full uncompressed file, then compress it and delete the full file). Also would be great if the other tools like skyfit, bin viewer, stars detection and so are able to uncompress it directly to the RAM, without expanding it to the disk first.
I would like to test it as well!
To clarify what I meant above; it's not required to change file format to get benefits of compression. As such, I would be argue against this patch with current implementation of HDU-only compression. Overall idea makes sense though. AstroPy can directly write to disk files with current FITS format, but already inside a generic compressed archive. Ie. best of both worlds, no changes to underlying format and smaller individual files, It will result in better compression ratio as well. I can open a branch if anybody is curious to test full-file compression?
Hm! Yes, It's a good balance if it can write the compressed file directly (and not by writting the full uncompressed file, then compress it and delete the full file). Also would be great if the other tools like skyfit, bin viewer, stars detection and so are able to uncompress it directly to the RAM, without expanding it to the disk first.
I would like to test it as well!
Would be interested to test it too. Agree we should not change the file format. I'd also argue strongly for using a common compression algo like gzip or zip to ensure portability across platforms. I know there are better algos, but they are sometimes not supported across windows, macos, and various linux flavours. Definitely needs to be tested on Pi3 as it might require too much memory or processor capacity. While the pi3 is now very old there are still a lot of RMS stations using them. I don't have any working Pi3 station however :(
Bumping this discussion. I think we need to address FITS file sizes as they're impacting the science GMN can perform.
Storage size directly affects how many nights of data we can keep on disk for revisiting interesting events. Currently, without compression, we can only go back half as many days as we could with compressed storage.
Current storage breakdown per 12hr night:
- CapturedFiles: ~18GB (would be ~9GB compressed)
- VideoFiles (24hr): ~20GB
- FrameFiles (24hr): ~0.1GB
This is particularly problematic for multi-cam setups where storage quickly becomes the limiting factor.
This isn't just an upload bandwidth issue (files are already compressed for transfer). The real impact is on data retention for analysis.
All compression schemes (RICE, gzip, etc.) achieve roughly the same ~50% ratio. We just need to pick one and implement it.
Option 1: Compress individual FF files
- Pros: Convenient, direct file handling
- Cons: Reduced compatibility with very old Pixinsight version.
Option 2: Compress entire night directories
- Pros: Preserves compatibility
- Cons: Less convenient for accessing individual files
Both could be made optional via config flag.
Thoughts on implementation approach? Any strong preferences between the options?
If you compressed an entire directory, how hard would it be to retrieve an individual file? I know some compression schemes are clever, but it feels expensive.
My own feeling is compress individual fits files from the previous night just before capture on the next night is due to start.
I don't agree that this problem affects multi-camera stations more than single camera stations. I think the greatest benefit will come from doubling the recall ability of the many 128 GB SD card stations.
I just want to reiterate that i still don't think this is the right approach. Storage is cheap, and it'd be a lot simpler, less risky and more compatible just to keep data uncompressed and advise buying a larger SD card. As I noted upthread my quick tests indicate that any compression will make the data less compatible with other tools, which will make it harder for camera operators to examine and play with the data themselves. i think that ability is really important as it keeps people enaged and interested.
I also believe we can solve the problm by making sure that all new non-core capabilities, such as all-day timelapses, raw video capture, daytime monitoring, contribution to contrail monitoring etc are disabled by default and come with a clear caveat that they'll use more storage so either a 256GB SD card or an SSD will be required. Existing station owners would then be unaffected and anyone enabling the new features would understand the risks and impact. I do worry that we're seeing a lot of "mission creep" thats impacting camera owners without fully informing them.
I also think this would work - I might be missing something here, but I have disabled all unnecessary features and my stations can still retain 9-10 days of data, which is the same as it was back in 2022. So i do think this approach would obviate the need to make the data less compatible.
If we do decide to compress data, I'd strongly recommend creating a standalone service that could be run as via systemd or be triggered by RMS using a signal. I feel it should not be done within RMS itself as this will create another thread and dependency that could fail or get stuck, leading to unexpected behaviour or data loss. Its a best-practice antipattern to make monolithic apps.
Just to clear up a misconception, all-day time-lapse has a completely negligible impact on storage as of prerelease. It needs 1.1GB overhead plus 0.1GB per day.
Obviously, raw video - used for meteor - has a large impact and requires large storage (although, it can be more efficient than FF files and produces higher quality observations).
Even if you asked operators to spend their money on larger storage, you still would only retain half as many days as you could if you just compressed for free - which is wasteful. Sometimes the storage you get is whatever is laying at the bottom of the drawer.
With everything turned off but core functions, a high latitude station with a 128GB drive can hold 2 days of uncompressed data vs 4 days of compressed data in the winter. Turning on all-day timelapse doesn't change these numbers.
At the other end of the spectrum, a high latitude 6-cam station with continuous raw-video turned on, and a 2TB drive, can hold 5 days of data vs 7 days compressed. Same whether all day timelapse are produced or not.
If we at least made it optional, I don't believe it would break any GMN pipeline. For operators needing their data to remain uncompressed, they would just turn the option off. For the majority of people who would rather store data efficiently, they would leave the options on.
This minimal PR accomplishes this without large changes to the code base.