marvin icon indicating copy to clipboard operation
marvin copied to clipboard

slicing a maps file for a spaxel takes a long time

Open havok2063 opened this issue 6 years ago • 7 comments

Describe the bug The time to extract a spaxel from a local Maps file takes a long time. It takes ~30 seconds to extract a spaxel in py3.6. Maybe it's doing something under the hood but it should be more transparent. This is the central spaxel for a 127 IFU, so maybe it's the initialization of all the Bins for every map property?

To Reproduce

from marvin.tools import Cube
from marvin import config, marvindb
config.forceDbOff()
cube = Cube('8936-12705')
maps = cube.getMaps()
maps
<Marvin Maps (plateifu='8936-12705', mode='local', data_origin='file', bintype='HYB10', template='GAU-MILESHC')>

s=maps[37,37]
s
<Marvin Spaxel (plateifu=8936-12705, x=37, y=37; x_cen=0, y_cen=0, loaded=cube/maps)>

%timeit s=maps[37,37]
26.4 s ± 319 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

s.emline_gflux_ha_6564.bin
<BinInfo (binid=-1, n_spaxels=2727)>
s.stellar_vel.bin
<BinInfo (binid=-1, n_spaxels=3028)>

Expected behavior A faster slice if possible, but maybe it's not for the 127s.

havok2063 avatar Aug 27 '18 18:08 havok2063

I reran the code above and printed the cube and s._cube. Both are coming from the files. This is using the vetround branch. Could it have anything to do with the Bin loading all the spaxels for every single property? I grabbed a central spaxel and the haflux property has ~2700 spaxels associated with it. Is it loading all of those spaxels? I tried accessing the list to see if they are loaded or not and got an error.

from marvin.tools import Cube
from marvin import config, marvindb
config.forceDbOff() 
cube = Cube('8936-12705')
cube
<Marvin Cube (plateifu='8936-12705', mode='local', data_origin='file')>

maps = cube.getMaps()
maps
<Marvin Maps (plateifu='8936-12705', mode='local', data_origin='file', bintype='HYB10', template='GAU-MILESHC')>

s=maps[37,37]
s
<Marvin Spaxel (plateifu=8936-12705, x=37, y=37; x_cen=0, y_cen=0, loaded=cube/maps)>

s._cube
<Marvin Cube (plateifu='8936-12705', mode='local', data_origin='file')>

b=s.emline_gflux_ha_6564.bin
b
<BinInfo (binid=-1, n_spaxels=2727)>

b.get_bin_spaxels()
---------------------------------------------------------------------------
MarvinError                               Traceback (most recent call last)
<ipython-input-15-f8568c8b0b77> in <module>()
----> 1 b.get_bin_spaxels()

~/Work/github_projects/Marvin/python/marvin/tools/quantities/base_quantity.py in get_bin_spaxels(self, lazy)
     82             raise marvin.core.exceptions.MarvinError(
     83                 'coordinates ({}, {}) do not correspond to a valid binid.'.format(self._spaxel.x,
---> 84                                                                                   self._spaxel.y))
     85
     86         spaxel_coords = zip(*numpy.where(self.binid_map.value == self.binid))

MarvinError: coordinates (37, 37) do not correspond to a valid binid.

havok2063 avatar Oct 17 '18 01:10 havok2063

I looked into this. So, I think there is no "bug" here. If you don't have a DB but have all the file on disk Marvin opens them all and doesn't make any API call. It also seems to open them the right number of times. The problem is that the files are compressed and we do access a lot of points in the file (to retrieve all the properties). I profiled the script

from marvin.tools import Cube
from marvin import config, marvindb
config.forceDbOff()
cube = Cube('8485-1901')
maps = cube.getMaps()
s=maps[10,10]

And it takes ~5 seconds to run. Of those, 4.3 seconds are used in ~168k calls to zlib.Decompress. I hacked things a bit to make Marvin work with uncompressed files. If the cube and maps files are uncompressed, that time goes down to 0.3 seconds.

So, I'm not sure what we can do here. One possibility would be to uncompress files as we open them. That's not totally trivial, could have unexpected results, and I'm not sure how much time it would save (uncompressing the whole file is not instantaneous at all). Another option would be to make all access to properties/model quantities/etc lazy so that there is very little access to the file until needed. That's something that we can consider for Marvin 3 but not for DR15.

I'm leaning towards not doing anything at this point. Thoughts?

albireox avatar Nov 07 '18 02:11 albireox

Can you profile it with the code in my example using 8936-12705? I don't think it used to take 30 seconds to slice and extract a 127 IFU spaxel from a cube/map using files. Here is the profile of opening a cube and maps. And then a separate profile of slicing alone.

def test():
    from marvin.tools import Cube
    from marvin import config, marvindb
    config.forceDbOff()
    cube = Cube('8936-12705')
    maps = cube.getMaps()
    return maps    

1st time it's run

%timeit -n 1 -r 1 maps = test()
14.4 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Subsequent times after caching

%timeit maps = test()
4.87 s ± 116 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Spaxel slice

maps  = test()
%timeit s=maps[37,37]
26.6 s ± 538 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

havok2063 avatar Nov 08 '18 20:11 havok2063

Yeah, I think you're right. I reinstalled 2.6.0 and this was certainly faster, at least for the spaxel. In hindsight and with more coffee, there is probably no reason to have 168k accesses to the file. I wonder if this may be related to the implementation of the bin. I'll dig a bit more.

albireox avatar Nov 08 '18 20:11 albireox

So, actually that was not true. This seems to have been a problem at least since 2.6.0.

Have a look at PR #572. I'm unzipping the input file to a temporary location which gets removed when not needed anymore. This slows down the initial creation of the object a bit but then significantly speeds up the access to the file. Using your same code

%timeit maps = test()
6.74 s ± 216 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit s=maps[37,37]
7.72 s ± 1.54 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

I'm also considering whether, when getting an Spaxel, we should get only the quantities from the object that created the Spaxel. Right now s=maps[37,37] will create a Spaxel with the quantities from the cube and maps. Maybe we should just load what the user initially created and have them use Spaxel.load() if they need to add more.

albireox avatar Nov 09 '18 04:11 albireox

Interesting. I'll have a look at the PR. So is this unrelated to the changes to Spaxel regarding Bin? When a Bin is initialized and gets all the spaxels associated with it, are those spaxels unloaded?

This solution clearly helps in some way but I'm still not sure if it's the underlying cause. I only noticed the 30 second overhead time to slicing a spaxel after the recent changes to the Spaxel and Bin class. Does this change create more calls to the underlying compressed files?

In the new code, can you elaborate on the chain of events that occurs when a.) a map is opened and b) a single spaxel is extracted? Once a file is loaded into memory, either from a compressed or uncompressed state, why would slicing and accessing a single spectrum go back to the compressed file? Where do the 168k calls to the decompression algorithm come from? If it's from loading all the other data associated with that spaxel, is file access propagating downward from single spaxel -> all maps and model properties -> bins for every relevant property -> spaxel information of all spaxels in each bin?

havok2063 avatar Nov 09 '18 20:11 havok2063

Yes, this seems unrelated to the Bin. I disabled it and didn't make a difference. I don't think anything has changed dramatically in terms of number of calls recently. As I say, 2.2.6 seems to have the same problem. There may be two reasons we hadn't noticed before:

  • We usually use a DB, which makes things quite fast.
  • In small datacubes, 19 or 37, this is not a significant problem.

I admit it's a bit weird that we didn't noticed this before, but this doesn't seem like a bug in Marvin. If you access all the extensions in a compressed cube to retrieve a single spaxel you get similar access times. It may be that something has changed in astropy but I tried with a 2.x version and the times are similar.

Some answers to your questions:

  • When a Cube or a ModelCube (but not a Maps, this really doesn't matter for small files and I think uncompressing may be counterproductive there) is opened from file, the file is gunzipped to a temporary file, then read. From there on all access to the file happens on that uncompressed file.
  • Astropy doesn't work that way. When you open a file you don't load it into memory. In fact you just get the bare minimum. Each access to an extension means a delay while that part of the file is being uncompressed. I think there is some caching since accessing the same data twice is much faster the second time. Uncompressing the whole file means that opening the file is slower at the beginning but much faster to access after that. It's a bit of a trade-off. For people only interested in the cube itself this may slow things a bit. For people accessing spaxels it will speed it up. I think the trade-off is reasonable enough since it also helps with things like accessing the flux datacube in a Cube.
  • I'm not sure where the 168k access come from. Certainly we don't access hundreds of thousands of things on the file. I assume that's how astropy works internally.
  • We are propagating the object downstream with some caveats. In your example, when you do maps = cube.getMaps() the maps doesn't know about cube so slicing maps means that cube needs to be reopened.

albireox avatar Nov 12 '18 21:11 albireox