Location of I/O code in Gammapy
@MaxNoe in https://github.com/gammapy/gammapy/issues/1265#issuecomment-365260031 made this comment that I think we should discuss separately (i.e. in this new issue):
I think that core functionality must be completely agnostic of file formats.
Having an IO layer around the core functionality is nice to have and supporting the data structures of multiple existing telescopes and also new, better formats will be easy as soon as all library functionality does not depend on files but a common in-memory data representation.
So everything outside gammapy.io or something similar should not have anything to do with file formats. Functions should only get astropy Quantity or Tables or plain numpy arrays.
Writing high level science tools then becomes easy. But mangling IO and computation just makes it much more complex, harder to use, harder to maintain and reduces the number of people and experiments that can effectively work with gammapy.
Just to summarise what we have in Gammapy, and to add my 2 cents.
In Gammapy, we almost always have the pattern that __init__ takes simple objects, mostly Numpy arrays, or Astropy Quantity, SkyCoord, Time, Table, WCS objects, and those are used in analysis. I think we're in agreement that this is good.
I/O is currently often located in classmethods called read and write and then for the existing FITS formats in classmethods from_hdulist and to_hdulist that are located on the classes that represent the in-memory data structure. This is the point we should discuss, @MaxNoe is proposing to split all that code out into gammapy.io. Presumably that's how it's done in ctapipe.io?
Good examples for what we have in Gammapy are the maps, or to take a very simple example for discussion, see how the gammapy.data.PointingInfo is implemented and does I/O via classmethods attached to the container class.
(I've already admitted that the DataStoreObservation is not OK and should be rewritten as you propose; for that, please comment on #1303 if you have any suggestions how it should be done in detail.)
@mackaiver @MaxNoe @woodmd @registerrier @kosack @adonath @joleroi - do you think we should split off all I/O code into gammapy.io, or stick with the I/O coding pattern we have now, or some mix or something different?
Note that the suggestion by @mackaiver in #1285 is very similar - there the question is whether the existing plot methods in Gammapy should be split out into separate plot functions / classes.
Personally I'm a bit in favour of having I/O in classmethods on container classes.
I would think it's a bit easier to maintain, especially if I change anything about the in-memory representation I then only have to touch one class, not two. As a user, I also think it's nice to have EventList.read and Map.read, i.e. to not have to learn a separate set of functions / classes in gammapy.io to read and write data. I don't feel strongly about this and am open to discuss changing everything to gammapy.io, but for this I think a convincing argument would have to be made first in which sense it really is better.
I think having read and write methods for our objects is a nice approach.
If, indeed, we use mostly basic objects such ~astropy.table.Table and Quantity we don't really need a gammapy.iopackage since we rely already on astropy.io.
After all, Table have their own readand write methods.
A related question here is whether to use existing data structures (Table and meta dict) directly, or whether to introduce container classes. Numpy and scipy and scikit-image and scikit-learn are libraries of functions and classes that just use Numpy arrays and don't introduce any container objects. That makes them super flexible and re-usable and successful.
This is not what we have in Gammapy, and to a large degree also not what Astropy or ctapipe are doing (see Table, Time, SkyCoord, WCS classes). The design pattern we are mostly using in Gammapy is to introduce container classes (for maps, IRFs, events, spectra, ...) and there to use composition out of existing data structures (mostly Numpy arrays, Table, Time, SkyCoord, WCS), and to add conveniences to those classes. E.g. the eventlist has a table with the astropy.table.Table holding the data, and a time property that maps the data to an astropy.time.Time object. Both the events.table and the events.time are accessed a lot, the events.table is not a hidden implementation detail, rather EventList is a class that groups common properties and small computations to work with the data in an event list table. Usually computations are implemented in functions that just use simple objects as inputs / outputs, or the container objects are very easy and cheap to create (e.g. SkyCoord or Table or EventList or Map just wrap and hold Numpy arrays).
@maxnoe - I don't know if you were also suggesting the larger change to get rid of many classes in Gammapy?
Personally I think there are a few we should get rid of (e.g. SkyImageList), and a few abstractions are missing (e.g. PSF kernels), and mostly each abstraction / class needs to be discussed / decided in a separate issue. I think it's great what Numpy / Scipy / scikit-learn / scikit-image have done, but for higher-level, more domain-specific packages like Gammapy I think having ~ 90 classes with ~ 10 methods each and ~100 separate functions works better, instead of the equivalent code with no classes and ~ 1000 separate functions.
So my main complaint is not that container classes provide methods to be created from files. A high-level Table like interface if you will. Whether that's a an anti-pattern or not is of course a matter of debate. In my opinion it's okay-ish to leave it that way.
Heres a little story. Imagine a new user very new to all this trying to use gammapy for a very simple point-source analysis:
Little Peter wants to do a point source analysis
Peter build a little telescope that produces event lists. He goes to the gammapy documentation and reads this:
TODO: events.info() gives s KeyError: 'ONTIME'. Should we introduce a sub-class EventListIACT?
>>>
>>> from gammapy.data import EventListDataset
>>> filename = '$GAMMAPY_EXTRA/datasets/hess-crab4-hd-hap-prod2/run023400-023599/run023523/hess_events_023523.fits.gz'
>>> events = EventListDataset.read(filename)
>>> events.info()
Too bad since Peters EventList is in some CSV or FITS format. He is not in HESS. He stores things differently on his harddisk.
If Peter understands correctly one can certainly build it from an astropy table in the constructor. From the docs he can guess the expected column names and data types. Also there's a big TODO in the middle of the docs.
But alright fair enough. Let's try right here:
from astropy.table import Table
from gammapy.data import EventList
import numpy as np
import astropy.units as u
N = 500
time = np.linspace(0, 30, N) *u.s
ra = np.linspace(5, 5.1, N) * u.deg
dec = np.linspace(35, 35.1, N) * u.deg
energy = np.random.normal(loc=2, scale=0.2, size=N) * u.TeV
t = Table({'TIME':time, 'RA':ra, 'DEC':dec, 'ENERGY':energy})
event_list = EventList(t)
Now that actually works. But what does the EventList object actually provide that the table up there does not? Well plotting for example, but it is broken at the moment (see #1259 and #1258)
Now let's try event_list.altaz which of course does not work since we need an absolute time and date and location for that. If I try to call it we get KeyError: 'MJDREFI' which is neither documented in the gammapy docs or the eventlist format docs.
In any case. What does he know? Little Peter is a noob after all. He tries to use the ReflectedRegionsBackgroundEstimator it takes an ObservationList however so no dice there. If you check the docs for ObservationList and this thing called DataStoreObservation you're out of luck. Not much there. He has no idea how to create these objects.
The AdaptiveRingBackgroundEstimator, RingBackgroundEstimator and KernelBackgroundEstimator take something else entirely namely SkyImages. (also big TODOs in the docs here. )
So lemme do a full text search for EventList in the gammapy codebase (which is a ridiculous idea for a new user). Here I find this thing called
ring_background_estimate which I can use.
from astropy.coordinates import SkyCoord, Angle
pos = SkyCoord(ra=5.25*u.deg, dec=35.25*u.deg)
radius = Angle(0.05 *u.deg)
bkg = ring_background_estimate(pos=pos, on_radius=radius, inner_radius=radius, outer_radius=radius, events=event_list)
Looking good so far
len(bkg.off_events)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-43-783eebe08cb9> in <module>()
3 radius = Angle(0.05 *u.deg)
4 bkg = ring_background_estimate(pos=pos, on_radius=radius, inner_radius=radius, outer_radius=radius, events=event_list)
----> 5 len(bkg.off_events)
TypeError: object of type 'EventList' has no len()
???
bkg.on_events[0]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-45-4bca3b3626a0> in <module>()
3 radius = Angle(0.05 *u.deg)
4 bkg = ring_background_estimate(pos=pos, on_radius=radius, inner_radius=radius, outer_radius=radius, events=event_list)
----> 5 bkg.on_events[0]
TypeError: 'EventList' object does not support indexing
???
Peter is now sad and goes home.
Now I know Peter could have accessed EventList.table. But then He could have just kept his table in the first place.
Long story short: This class called EventList provides no further advantage over a simple table.
I could write more of these stories.
I could write more of these stories.
@mackaiver - Thanks. This feedback is very helpful.
Based on this, I already have a ton of things I want to clean up and improve in Gammapy. So I would suggest that either you wait a bit until things are better (or help make it better one small pull request at a time) before giving more feedback, or we work together to make a tutorial of how little Peter can feed his data to Gammapy for analysis now (by bringing it into the form Gammapy expects, either via files, or in-memory objects). E.g. analysis needs things like MJDREFI and ONTIME, so it's not unreasonable to ask Peter to add some info to the table meta dict before passing it to Gammapy following the standard events format. This part of Gammapy probably won't change, times have to be given in the FITS standard format. Other things you point out and notice above are issues or at least things we should do better (e.g. events.info can just print "not available" for missing pieces of info instead of erroring out).
This part of Gammapy probably won't change, times have to be given in the FITS standard format.
Why? Why not just use an astropy.time.Time? Why force an on disk format's specification for in memory representation when there is something better available in the already used ecosystem?
Why not just use an astropy.time.Time?
We could. Putting Time and SkyCoord objects in Table, as well as using using QTable was considered in #980 and it was a close decision whether to do it or not. I decided against it because I wanted to do the numpy array to Time / SkyCoord conversion not on event table read, but only if these objects are actually needed. Our current scheme has the additional advantage that it round-trips to the standard event list format and that I/O remains simpler than doing the to object mapping on I/O. Maybe we could / should re-open that discussion.
@MaxNoe @mackaiver - If you can join the call tomorrow evening 8 pm, please do. https://github.com/gammapy/gammapy-meetings/tree/master/2018-02-15 It will be focused on gammapy.maps, but map I/O will be a major topic, and we can discuss event I/O or other related points of how Gammapy should work that are in this thread as well.
Handling of I/O formats, location of I/O code and registries will be part of the Gammapy v2.0 roadmap.