msigdbr
msigdbr copied to clipboard
Save the 'entrez_gene' columns in character mode
First thanks for this great package! Especially it directly outputs three different gene ID types, which saves a lot of time when switching between different gene ID types.
I have a small suggestion. Here in the output table, columns related to "entrez_gene" are stored as integers. I would suggest to change to characters, as what other Bioconducror annotation package does (e.g. org.Hs.eg.db).
gene_sets
# A tibble: 8,209 × 15
gs_cat gs_su…¹ gs_name gene_…² entre…³ ensem…⁴ human…⁵ human…⁶ human…⁷ gs_id gs_pmid gs_ge…⁸
<chr> <chr> <chr> <chr> <int> <chr> <chr> <int> <chr> <chr> <chr> <chr>
1 H "" HALLMA… ABCA1 19 ENSG00… ABCA1 19 ENSG00… M5905 267710… ""
2 H "" HALLMA… ABCB8 11194 ENSG00… ABCB8 11194 ENSG00… M5905 267710… ""
3 H "" HALLMA… ACAA2 10449 ENSG00… ACAA2 10449 ENSG00… M5905 267710… ""
4 H "" HALLMA… ACADL 33 ENSG00… ACADL 33 ENSG00… M5905 267710… ""
5 H "" HALLMA… ACADM 34 ENSG00… ACADM 34 ENSG00… M5905 267710… ""
6 H "" HALLMA… ACADS 35 ENSG00… ACADS 35 ENSG00… M5905 267710… ""
7 H "" HALLMA… ACLY 47 ENSG00… ACLY 47 ENSG00… M5905 267710… ""
8 H "" HALLMA… ACO2 50 ENSG00… ACO2 50 ENSG00… M5905 267710… ""
9 H "" HALLMA… ACOX1 51 ENSG00… ACOX1 51 ENSG00… M5905 267710… ""
10 H "" HALLMA… ADCY6 112 ENSG00… ADCY6 112 ENSG00… M5905 267710… ""
# … with 8,199 more rows, 3 more variables: gs_exact_source <chr>, gs_url <chr>,
# gs_description <chr>, and abbreviated variable names ¹gs_subcat, ²gene_symbol,
# ³entrez_gene, ⁴ensembl_gene, ⁵human_gene_symbol, ⁶human_entrez_gene, ⁷human_ensembl_gene,
# ⁸gs_geoid
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
Imagine we want to convert Entrez IDs to Refseq IDs, and we have a mapping vector (map
) where Entrez IDs are the names and Refseq IDs are the values. Then naturally, to convert, we can do:
map[gene_sets$entrez_gene]
This causes the problem because gene_sets$entrez_gene
are integers and it is actually treated as numeric indices for the map
vector, while not to match to the names in map
.
To do it correctly, we need to explicitly convert gene_sets$entrez_gene
to characters:
map[as.character(gene_sets$entrez_gene)]
The more severe consequence is, if the maximal numeric value in gene_sets$entrez_gene
is smaller than the length of map
, executing map[gene_sets$entrez_gene]
actually will not generate any warning or error message. And it would generate wrong results silently.