MassBank-data icon indicating copy to clipboard operation
MassBank-data copied to clipboard

Have GitHub and Zenodo releases synchronized

Open Adafede opened this issue 1 year ago • 6 comments

Hi,

Thank your for all your effort put in MassBank! I was trying to access its data and realized https://github.com/MassBank/MassBank-data/releases and https://doi.org/10.5281/zenodo.3378723 are not synchrone.

This can be easily done by following https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content.

This way, each GitHub release ends up archived on Zenodo and having its DOI automatically.

Hope this makes sense!

Adafede avatar Aug 16 '23 08:08 Adafede

Thank you for bringing this to our attention. An automatic procedure should be in place, but apparently its not working atm. I will look into this.

meier-rene avatar Aug 16 '23 09:08 meier-rene

I just checked and didn't found any differences. Could you please explain a little bit more of your finding? What I did:

  • Downloaded the zip from zenodo: https://zenodo.org/record/8014263/files/MassBank/MassBank-data-2023.06.zip?download=1
  • Downloaded the release artifact from github: https://github.com/MassBank/MassBank-data/archive/refs/tags/2023.06.zip
  • unziped and compared
  • diff shows no differences on my system

meier-rene avatar Aug 16 '23 09:08 meier-rene

Wow, this is a fast reply!

I actually found the different json/sql/msp files available in the releases/tag/2023.06 very convenient and they do not seem to appear on Zenodo, but maybe I missed something?

P.S.: Is there any reason for having an sql and no sqlite which would make it directly readable by MsBackendMassbank? (Or did I miss something again here?)

Adafede avatar Aug 16 '23 09:08 Adafede

Yes, you are right. Zenodo only covers the txt files. Thats a result of the automatic zenodo release procedure of github. I dont know how to automatically attach the other release artifacts to the zenodo release.

For your second question I have no answer atm. The sql file is released for the MsBackendMassbank package, but we did not put too much effort into it. Its basically the dump of our internal data structure. Maybe this sql file needs to be processed to an sqlite file? I need to do some research. Maybe @jorainer didnt want to create additional workload on our side? I found that script: https://github.com/rformassspectrometry/MsBackendMassbank/blob/main/inst/scripts/massbank-to-sqlite.R. If thats the case we can probably modify our scripts to create the sqlite artifact instead of the sql file.

meier-rene avatar Aug 16 '23 09:08 meier-rene

👍🏼 The different "ready-to-use" files would be a plus on Zenodo (I also don't know how to attach artifacts to Zenodo releases automatically...will search a bit and come back if I find something). I was also using the nice script of @jorainer, and we are probably many out there to do so...so generating the sqlite directly would probably indeed add some work on your side, but avoid it being replicated many times elsewhere.

Adafede avatar Aug 16 '23 09:08 Adafede

Note: my preferred way to access/use MassBank data in R is through AnnotationHub:

library(AnnotationHub)
ah <- AnnotationHub()
query(ah, "MassBank")
AnnotationHub with 3 records
# snapshotDate(): 2023-06-23
# $dataprovider: MassBank
# $species: NA
# $rdataclass: CompDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH107048"]]' 

             title                                
  AH107048 | MassBank CompDb for release 2021.03  
  AH107049 | MassBank CompDb for release 2022.06  
  AH111334 | MassBank CompDb for release 2022.12.1

So, as for now there are these 3 releases available through AnnotationHub. To use one of them:

mb <- ah[["AH107049"]]
mb
class: CompDb 
 data source: MassBank 
 version: 2022.06 
 organism: NA 
 compound count: 90190 
 MS/MS spectra count: 90190 

This CompDb can be used directly with Spectra (i.e. Spectra(mb) would get you all MS2 spectra). Besides being available through AnnotationHub, the resource (sqlite file) gets also locally cached. So, first time downloaded, and any subsequent use will load it from the local cache.

There's however a manual step involved - since I need to convert the MassBank data structures into a CompDb SQLite (using this script) and then also to upload and maintain these releases in Bioconductor's AnnotationHub... but I think that this should simplify usage of MassBank in R tremendously. Long term goal is to provide also other annotation resources (as CompDb?) through AnnotationHub...

jorainer avatar Aug 23 '23 13:08 jorainer