msigdbr icon indicating copy to clipboard operation
msigdbr copied to clipboard

Methodology details, and `write.gmt` helper functions?

Open dereckmezquita opened this issue 2 years ago • 3 comments

Hi I came across your package which could potentially save me a lot of work so I thank you.

Could you publish the details on your methods for converting between human to X species? I need this information in order to be able to cite you in my research.

Also will you consider adding helper functions to convert from the data.frame types to a type which can be easily written as a .gmt pathway file?

dereckmezquita avatar Apr 17 '22 19:04 dereckmezquita

Thank you for your interest. The gene conversion happen using a different package babelgene. The vignette includes some background info, but let me know if anything is unclear. The code for pre-processing the data is available as well if you really want to dive deep.

There are a few different GMT writer functions available, such as cmapR::write_gmt, pathwayPCA::write_gmt, immcp::write_gmt, and rWikiPathways::writeGMT. I have not tried any of them, but I am not sure another function would be solving any new problems.

igordot avatar Apr 18 '22 01:04 igordot

Thank you for that, babelgene I will look into that.

And thank you for pointing those write gmt functions out for me.

I've written one myself in the past; I suppose what I was really asking for is helper functions for extracting/selecting a database set for example hallmark and then having it extract the related genes along with gene set description URL and the pathway (gene set) name and genes (in original order) and putting it into a different format which could then written to a file as a gmt.

For example, convert HALLMARK dataset to a list of character vectors (list pathways/gene sets; vector gene sets). This should be a list of 50 elements (50 pathways) (as HALLMARK has only 50 pathways) each element of this list holds a character vector of the pathway (gene set) name first, then the description URL as in the standard GMT distributed by Broad, and then the genes.

This object could could then be written line by line using a \t separator would do it.


The tricky parts I am facing in accomplishing this task is extracting the elements relating to specific gene set collections and getting the original order of the genes in a given gene set.

Might you be able to give me some information as to how I could re-find the original order the genes in a given gene set are supposed to go in? As I've understood GSEA gmt files have gene sets and these are in a specific order from most to least important. I don't see this information (ordering) included in the datasets offered here; am I missing something?

As proof of concept I would like to be able to convert the Homo sapiens data back to separate gmt files, which match those distributed by Broad. I don't know how I would get the gene order though.

I am looking for a way to extract the genes relating to these 5 specific pathway collections:

  • msigdb.v7.5.1.symbols.gmt.txt
  • c2.cp.kegg.v7.5.1.symbols.gmt.txt
  • c2.cp.reactome.v7.5.1.symbols.gmt.txt
  • c5.go.bp.v7.5.1.symbols.gmt.txt
  • h.all.v7.5.1.symbols.gmt.txt

Finally thank you again for the package, it is a lot of work - matching human and X species gene names is not a trivial task.

dereckmezquita avatar Apr 18 '22 01:04 dereckmezquita