fiasco icon indicating copy to clipboard operation
fiasco copied to clipboard

How should data be downloaded/distributed?

Open namurphy opened this issue 8 years ago • 10 comments

The issue of how data should be downloaded and distributed was labelled as the biggest question on the wiki page. It seems worthwhile to raise it as an issue so as to allow some discussion.

One issue that someone else brought up is that the Chianti database does not have a license, as far as we know. This would make the legal whatever of adapting the data stored by Chianti confusing, since it's not clear if modifications or redistributions of the database are allowed. If Chianti officially adopted the MIT license like fiasco, then it would clear up any confusion.

With respect to data storage and distribution, the easiest thing would be to store all of the data files within GitHub, though they may end up being too large. The git history itself might end up getting huge that way.

Another option would be to store the HDF5 files on a site like Zenodo. I really like Zenodo because:

  • Data and other research products are openly available.
  • Data sets are citable (e.g., they are each given a doi)
  • Versioning is supported (e.g., you can link to updated or older versions of the same data set, while each version has a different doi)
  • Licensing of the data is supported
  • It's at no cost to researchers
  • The maximum standard data set size is 50 GB, with exceptions possibly being made for larger data sets

One option would be to store the HDF5 files on Zenodo, and then have them be downloaded during the installation process.

namurphy avatar Sep 01 '17 17:09 namurphy

Also, I was thinking it would be great to coordinate with people from other databases (e.g., AtomDB/pyatomdb) who might end up storing data in HDF5 files in the future. It would be great for the format/structure of each of the files to be as similar as can be. This would allow direct comparisons between the different databases, and allow routines to be used interchangeably between them.

namurphy avatar Sep 01 '17 17:09 namurphy

Thanks for pulling this into an issue @namurphy. @cadair and @dpshelio also both brought this up on the Sunpy-dev mailing list. It is an issue that I've thought a lot about and one that does not seem to have a clear solution.

For the foreseeable future, I think the best way to go about this is to rebuild the (ASCII-format) CHIANTI database into HDF5 locally the first time a user installs fiasco (and after any updates to the database). It is not a terribly intensive computation (~30 mins on my ancient Macbook Pro) and it means that users who already have CHIANTI installed (e.g. through SolarSoft) can continue to use that same version of the database. If the user has not specified a specific location to look for the ASCII database or it cannot be found, it could be downloaded automatically.

I do like services like Zenodo and I think hosting (and versioning!) the database is a good idea in principle. However, the responsibility of managing and updating a hosted database is not trivial. At this point, I think I'd rather not have this responsibility especially since this project is so new. The lack of a license may also be prohibitive here. All that being said, I think a hosted and versioned HDF5 (or another convenient format) version should be the end goal.

The license issue is another headache. It was brought up in chianti-atomic/ChiantiPy#76 and pretty much went nowhere. I'm in contact with a few of the CHIANTI developers so I could pose this question to them directly. From what I can gather, there isn't really any opposition to licensing the database, It's just that no one has thought to do it. I think if we present a convincing case they would probably be willing to add a license.

wtbarnes avatar Sep 01 '17 19:09 wtbarnes

Also a good point about AtomDB. I know comparisons between all of these different atomic databases/codes have been a headache in the past so if this package could make these comparisons easier that would be great.

The current layout of the HDF5 file is pretty much exactly the same as the CHIANTI directory layout. However, to facilitate easy comparisons, I don't think the file layouts necessarily have to be the same. The data just has to be exposed (i.e. through some API) in the same way or at least in a flexible way. One of my goals is to abstract the details of the database away from the user-facing code. I think this type of approach would make the kind of comparisons you're talking about much easier.

wtbarnes avatar Sep 01 '17 19:09 wtbarnes

Agreed - as long as there is a common API/user-facing code, then it would be enough to enable easy comparisons for users. It may come down to what ends up being simplest, i.e., is it easier in the long run to put the HDF5 files with the same layout, or to have two sets of methods to access the different HDF5 files. This may end up being a decision fiasco doesn't have to make, as it is the first of these databases to convert to HDF5 (as far as I know, though my knowledge is limited). It will probably be whoever does this second who has to make that decision. In any case, yay HDF5! 👍

If I remember correctly, AtomDB currently uses .fits files.

namurphy avatar Sep 01 '17 20:09 namurphy

Just to be a bit more specific, here is what I'm thinking as far as downloading and accessing the data with fiasco... (this seemed the most logical place to record this and I wanted to write it down before I forgot it!)

As I mentioned above, I think it is best (for now) to rebuild the ASCII CHIANTI database on the user's end as an HDF5 file and not worry about distributing this ourselves. This could come later. Doing it this way, there are of course challenges with building and updating the user's database.

At import time, parse the config file ~/.fiasco/fiascorc. In my prototypes, I've structured it as follows,

[database]
dbase_root = '/path/to/chianti/dbase'
hdf5_dbase_root = '/path/to/hdf5/chianti/dbase.h5'

Read these paths into some defaults dict. If either key doesn't exist (or the rc file itself does not exist), default to ~/.fiasco/chianti_dbase and ~/.fiasco/chianti_dbase.h5, respectively. That way, everything is contained in ~/.fiasco unless explicitly stated by the user.

Next, check if the dbase_root directory exists (some additional checking could be done on the contents). If it does not, download (and unzip) the CHIANTI database from here.

Finally, check if the hdf5_dbase_root file exists, if it does not, build it from the ASCII files. If the file does exist, maybe there is some checking to see if the ASCII files have been updated since the HDF5 file was created/updated and the needed datasets are updated appropriately.

So in pseudocode,

defaults = parse_config('~/.fiasco/fiascorc')

if 'dbase_root' not in defaults:
    defaults['dbase_root'] = '~/.fiasco/dbase'
if 'hdf5_dbase_root' not in defaults:
    defaults['hdf5_dbase_root'] = '~/.fiasco/dbase.h5'

if not exists(defaults['dbase_root']):
    download_dbase(CHIANTI_URL,defaults['dbase_root'])
if not exists(defaults['hdf5_dbase_root']):
    build_hdf5_dbase(defaults['hdf5_dbase_root'])
else:
    check_for_updates(defaults['hdf5_dbase_root'])

This is just a rough outline and I'd be interested to hear people's thoughts on this.

wtbarnes avatar Sep 20 '17 07:09 wtbarnes

Overall this sounds great to me!

One possible minor issue is that looking for ~/.fiasco/fiascorc might not work on a Windows machine. A possible way to fix this would be to have a different default file location on Windows, and then check which OS is being used to figure out what the default file location should be.

namurphy avatar Sep 20 '17 18:09 namurphy

Good point about Windows. We'll have to be careful about being cross-platform. Historically, CHIANTI (and ChiantiPy) have relied on setting the XUVTOP (no idea what that could stand for, eXtreme UltraViolet TOP directory???) environment variable, an unfortunate dependence I don't want to carry over into fiasco.

I think using an approach like the one outlined in this SO answer should work though I don't have a Windows machine to actually test this on.

wtbarnes avatar Sep 20 '17 18:09 wtbarnes

Over the past two or so days, I've pushed several commits that essentially implement the system I described above. The main parts are contained in

  • fiasco.util.download_dbase()
  • fiasco.util.build_hdf5_dbase()

both contained in fiasco/util/setup_db.py. They are bit clunky (lots of if/else) and may not cover every corner case, but they'll do for now. In each case, the user is prompted before either downloading the data for building the HDF5 file.

One issue is where to do the downloading and file building. I don't want this to have to be a manual step for the user, but I also don't want to do too much under the hood. I originally did this at import fiasco, but ultimately decided to just do it when an IonBase object, which requires the existence of the HDF5 database, is instantiated.

If others could try this out and/or give their thoughts on how best to handle the downloading that'd be great.

wtbarnes avatar Sep 27 '17 07:09 wtbarnes

@wtbarnes - regarding the updates, you could keep a md5 checksum of the data files that chianti offers. Ideally they would provide such signatures on their side (that's something that you could add to the comments when talking with them about the license). I completely agree with you that the easiest is to download it from chianti directly and make the conversion on your side. I like the idea of zenodo, but that should be done by chianti as its their database. I could imagine people getting anxious because you are getting citations as it would be easier to find such data on zenodo.

dpshelio avatar Dec 09 '17 15:12 dpshelio

@dpshelio Yeah I like the idea of some sort of hash/checksum to effectively version the data locally. though I'm really not sure about the best way to implement this. The CHIANTI team does provide a version number each time they release an update to the database as well.

wtbarnes avatar Dec 11 '17 20:12 wtbarnes