marvin
marvin copied to clipboard
Modelcube loading from files is slow
@zpace reported that doing batch operations with ModelCube
loaded from files seems to be slower as compared with using raw astropy.fits
. He agree to distill the issue to a simple script that we can test. We should profile the file loading and see if there is any obvious bottleneck. This is somehow related to #298.
I have run a quick test with DRP LOGCUBE instead of the DAP LOGCUBE (just b/c that's what it's handiest for me to script), and I'll try to summarize. I wrote a couple quick functions that respectively load through a bare wrapper around fits.open()
and marvin.tools.Cube()
, and profiled the result.
def get_fluxcube_fits(plate, ifu, MPL_V):
'''a wrapper around fits.open()
'''
with m.load_drp_logcube(plate, ifu, MPL_V) as drp: # can replace with fits.open(drp_logcube_filename)
return drp['FLUX'].data
def get_fluxcube_marvin(plate, ifu, **kwargs):
'''a wrapper around marvin.tools.Cube
'''
drp_cube = marvin.tools.Cube('-'.join((plate, ifu)), **kwargs)
return drp_cube.flux
def compare_two_loads(plate, ifu, MPL_V):
'''runs both loaders
'''
fits_cube = get_fluxcube_fits(plate, ifu, MPL_V)
marvin_cube = get_fluxcube_marvin(plate, ifu, release=MPL_V)
return fits_cube, marvin_cube
%prun -D loadcube.prof compare_two_loads(plate, ifu, MPL_V)
I then loaded loadcube.prof
(which github won't let me upload here) into snakeviz.
Here's a screengrab:
What jumped out at me is the large number of calls to zlib.Decompress.decompress
(about 2/3 of the time taken by get_fluxcube_marvin()
). Jose and I had talked in Ensenada about fits files having to be decompressed prior to use, so this makes sense. IIRC, individual HDUs are decompressed as needed.
Just to verify this, I wrote another bare wrapper that accessed every HDU of the LOGCUBE in a list comprehension, and profiled it:
def get_alldata_fits(plate, ifu, MPL_V):
with m.load_drp_logcube(plate, ifu, MPL_V) as drp:
data = [hdu.data for hdu in drp]
return data
def compare_two_fullloads(plate, ifu, MPL_V):
fits_data = get_alldata_fits(plate, ifu, MPL_V)
marvin_cube = get_fluxcube_marvin(plate, ifu, release=MPL_V)
And sure enough, get_all_data_fits()
takes 30-ish seconds. The culprit is once again the zlib
decompression. I think users with lots of hard-disk space could profit from pre-extracting all the data files they plan to use, if marvin could "prioritize" ingesting uncompressed files over compressed files.
If there's any other tests that would be informative, please do let me know!
@zpace This is very useful. Just for clarification, your initial tests of opening up one extension with Marvin versus straight fits.open
showed that Marvin took 2/3 longer to open up a single extension. Is that correct? But when you load the full fits hdu the times become equivalent? That might be due to the fact the Cube
always loads the full hdu first before accessing cube.[extension]
.
Can you also profile astropy fits.open
versus fitsio
and see how they compare? Can you also unzip a manga cube and redo the profiling? If we take away the decompression, do they behave equivalently, or are there still overheads in Marvin cube loading?
@zpace This is very useful. Just for clarification, your initial tests of opening up one extension with Marvin versus straight
fits.open
showed that Marvin took 2/3 longer to open up a single extension. Is that correct? But when you load the full fits hdu the times become equivalent? That might be due to the fact theCube
always loads the full hdu first before accessingcube.[extension]
.
Yes, this is True.
Can you also profile
astropy fits.open
versusfitsio
and see how they compare? Can you also unzip a manga cube and redo the profiling? If we take away the decompression, do they behave equivalently, or are there still overheads in Marvin cube loading?
I will take a stab at these in the next couple days.
I've made a couple changes to my ecosystem of fits-loader functions, which allows the user to specify which file extensions are preferred. The default behavior now tries .fits
before .fits.gz
, but it can be forced otherwise. I have uncompressed one DRP LOGCUBE and tried once again to load the data from the 'FLUX'
extension.
Retrieval method | Uncompressed pref'd? | Flux cube retrieval time (ms) |
---|---|---|
marvin.tools.Cube |
N | 3556 |
marvin.tools.Cube |
Y | ERROR |
astropy.io.fits |
N | 1230 |
astropy.io.fits |
Y | 20 |
When I try to load the same uncompressed file through marvin, though:
OSError: filename /usr/data/minhas2/zpace/sdss/sas/mangawork/manga/spectro/redux/v2_5_3/8083/stack/manga-8083-12704-LOGCUBE.fits cannot be found: Not a gzipped file (b'SI')
I am using marvin version 2.3.2, which (I think) is the latest available on pyPI. Has there been a new update to allow uncompressed files?
It's pretty clear that at least for non-marvin loading, the decompression is the really expensive part. So users that can spare the extra storage overhead (about 2x, I think) could get some speed gains just by uncompressing all the FITS files they're using.
I'll try to look at fitsio
soon, but in case I don't get to it today, I wanted to post what I have.